diff --git a/README.md b/README.md
index ecac7ebd6..d09c022ff 100644
--- a/README.md
+++ b/README.md
@@ -19,7 +19,7 @@
### Wikidata and Wiktionary language data extraction
-**Scribe-Data** is a convenient command-line interface (CLI) for extracting and formatting language data from [Wikidata](https://www.wikidata.org/). Functionality includes allowing users to list, download, and manage language data directly from the terminal.
+**Scribe-Data** is a command-line interface (CLI) for extracting and formatting language data from [Wikidata](https://www.wikidata.org/) and other supported sources. It helps users list, download, manage, convert, and filter language data directly from the terminal.
> [!NOTE]\
> The [contributing](#contributing) section has information for those interested, with the articles and presentations in [featured by](#featured-by) also being good resources for learning more about Scribe.
@@ -28,7 +28,7 @@ Scribe applications are available on [iOS](https://github.com/scribe-org/Scribe-
Check out Scribe's [architecture diagrams](https://github.com/scribe-org/Organization/blob/main/ARCHITECTURE.md) for an overview of the organization including our applications, services and processes. It depicts the projects that [Scribe](https://github.com/scribe-org) is developing as well as the relationships between them and the external systems with which they interact. Also check out the [Wikidata and Scribe Guide](https://github.com/scribe-org/Organization/blob/main/WIKIDATAGUIDE.md) for an overview of [Wikidata](https://www.wikidata.org/) and getting language data from it.
-# Contents
+## Contents
- [Process](#process)
- [Installation](#installation)
@@ -38,7 +38,7 @@ Check out Scribe's [architecture diagrams](https://github.com/scribe-org/Organiz
- [Environment Setup](#environment-setup)
- [Featured By](#featured-by)
-# Process
+## Process
The CLI commands defined within [scribe_data/cli](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/cli) and the notebooks within the various [scribe_data](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data) directories are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) once they're active.
@@ -46,7 +46,7 @@ The main data update process triggers [language based SPARQL queries](https://gi
Back to top.
-# Installation
+## Installation
Scribe-Data is available for installation via [uv](https://docs.astral.sh/uv/) (recommended) or [pip](https://pypi.org/project/scribe-data/).
@@ -80,7 +80,7 @@ pip install -e .
Back to top.
-# CLI Usage
+## CLI Usage
Scribe-Data provides a command-line interface (CLI) for efficient interaction with its language data functionality. Please see the [usage guide](https://github.com/scribe-org/Scribe-Data/blob/main/USAGE.md) or the [official documentation](https://scribe-data.readthedocs.io/) for detailed instructions.
@@ -95,10 +95,15 @@ scribe-data [command] [arguments]
### Available Commands
-- `list` (`l`): Enumerate available languages, data types and their combinations.
-- `get` (`g`): Retrieve data from Wikidata for specified languages and data types.
-- `total` (`t`): Display the total available data for given languages and data types.
-- `convert` (`c`): Transform data returned by Scribe-Data into different file formats.
+- `list` (`l`): List languages, data types and combinations of each that Scribe-Data can be used for.
+- `get` (`g`): Get data from Wikidata and other sources for the given languages and data types.
+- `total` (`t`): Check Wikidata for the total available data for the given languages and data types.
+- `convert` (`c`): Convert data returned by Scribe-Data to different file types.
+- `download` (`d`): Download Wikidata lexeme or Wiktionary dumps.
+- `interactive` (`i`): Run in interactive mode.
+- `export_contracts` (`ec`): Export Scribe-Data contracts to a local directory.
+- `check_contracts` (`cc`): Check the data in a Scribe-Data export directory to see that all needed language data is included.
+- `filter_data` (`fd`): Filter exported Scribe-Data data based on provided data contract values.
### Command Examples
@@ -108,9 +113,10 @@ scribe-data [command] [arguments]
```bash
# Commands used in the above GIF:
+scribe-data list
scribe-data list --language
scribe-data list --data-type
-scribe-data get --language English --data-type verbs -od ./scribe-data
+scribe-data get --language English --data-type verbs --output-dir ./scribe-data
scribe-data total --language English
```
@@ -120,13 +126,13 @@ scribe-data total --language English
```bash
# Commands used in the above GIF:
-scribe-data get -i
-scribe-data total -i
+scribe-data get --interactive
+scribe-data total --interactive
```
Back to top.
-# Data Contracts
+## Data Contracts
[Wikidata](https://www.wikidata.org/) has lots of [language data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data) available, but not all of it is useful for all applications. In order to make the functionality of the Scribe-Data `get` requests as simple as possible, we made the decision to always return all data for the given languages and data types. Adding the ability to pass desired forms to the commands seemed cumbersome, and larger Scribe-Data requests should be parsing [Wikidata lexeme dumps](https://dumps.wikimedia.org/wikidatawiki/entities/) as the data source.
@@ -160,7 +166,7 @@ Updating contracts shouldn't be something that Scribe-Data users should have to
Back to top.
-# Contributing
+## Contributing
@@ -200,7 +206,7 @@ Scribe does not accept direct edits to the grammar JSON files as they are source
Back to top.
-# Environment Setup
+## Environment Setup
> [!IMPORTANT]
>
@@ -288,7 +294,7 @@ See the [contribution guidelines](https://github.com/scribe-org/Scribe-Data/blob
Back to top.
-# Featured By
+## Featured By
Please see the [blog posts page on our website](https://scri.be/docs/about/blog-posts) for a list of articles on Scribe, and feel free to open a pull request to add one that you've written at [scribe-org/scri.be](https://github.com/scribe-org/scri.be)!
@@ -316,7 +322,7 @@ The following organizations have supported the development of Scribe projects th
Back to top.
-# Powered By
+## Powered By
### Contributors
diff --git a/USAGE.md b/USAGE.md
index cc815683d..86fc4c44c 100644
--- a/USAGE.md
+++ b/USAGE.md
@@ -1,33 +1,75 @@
+
+
# Scribe-Data CLI Usage
-Scribe-Data provides a command-line interface (CLI) for efficient interaction with its language data functionality.
+Scribe-Data provides a command-line interface (CLI) for extracting language data from Wikidata and other sources.
-## Basic Usage
+## Contents
+
+- [Installation](#installation)
+- [Development Build](#development-build)
+- [Basic Usage](#basic-usage)
+- [Command Examples](#command-examples)
+- [Additional Help](#additional-help)
+
+## Installation
+
+### Using uv (recommended)
+
+```bash
+uv pip install scribe-data
+```
-To utilize the Scribe-Data CLI, you can execute the following command in your terminal:
+### Using pip
```bash
pip install scribe-data
+```
-# For a development build:
-git clone https://github.com/scribe-org/Scribe-Data.git # or ideally your fork
+Back to top.
+
+## Development Build
+
+```bash
+git clone https://github.com/scribe-org/Scribe-Data.git # or your fork
cd Scribe-Data
+
+# With uv (recommended)
+uv sync --all-groups
+source .venv/bin/activate # macOS/Linux
+# .venv\Scripts\activate # Windows
+
+# Or with pip
+python -m venv .venv
+source .venv/bin/activate # macOS/Linux
+# .venv\Scripts\activate # Windows
pip install -e .
+```
+
+Back to top.
+
+## Basic Usage
-scribe-data -h # view the cli options
+```bash
+scribe-data -h
scribe-data [command] [arguments]
```
-## Available Commands
+### Available Commands
-- `list` (`l`): Enumerate available languages, data types and their combinations.
-- `get` (`g`): Retrieve data from Wikidata for specified languages and data types.
-- `total` (`t`): Display the total available data for given languages and data types.
-- `convert` (`c`): Transform data returned by Scribe-Data into different file formats.
+- `list` (`l`): List the languages, data types, and combinations available in Scribe-Data.
+- `get` (`g`): Get data from Wikidata and other sources for the selected languages and data types.
+- `total` (`t`): Show the total available data for selected languages and data types.
+- `convert` (`c`): Convert Scribe-Data output into different file types.
+- `download` (`d`): Download Wikidata lexeme or Wiktionary dumps.
+- `interactive` (`i`): Run Scribe-Data in interactive mode.
+- `export_contracts` (`ec`): Export Scribe-Data contracts to a local directory.
+- `check_contracts (`cc`): Check that an export directory contains the language data needed by the contracts.
+- `filter_data` (`fd`): Filter exported Scribe-Data data based on contract values.
-## Available Arguments
+### Available Arguments
-The following arguments can be passed to the Scribe-Data commands whenever sensible:
+The following arguments can be passed to commands where applicable:
- `--language` (`-lang`): The language to run the command for.
- `--data-type` (`-dt`): The data type to run the command for.
@@ -36,107 +78,69 @@ The following arguments can be passed to the Scribe-Data commands whenever sensi
- `--output-type` (`-ot`): The file type that the command should output.
- `--outputs-per-entry` (`-ope`): How many outputs should be generated per data entry.
- `--all` (`-a`): Get all results from the command.
+- `--interactive` (`-i`): Run in interactive mode where supported.
-## Command Examples
-
-### List Command
-
-1. Display all available options:
-
- ```bash
- scribe-data list # -a --all
- ```
-
-2. Display available languages:
+Back to top.
- ```bash
- scribe-data list -lang # --language
- ```
-
-3. Display available data types:
-
- ```bash
- scribe-data list -dt # --data-type
- ```
-
-### Total Command
-
-1. Display total available data for a specific data type (e.g. nouns):
-
- ```bash
- scribe-data total -dt nouns
- ```
-
-2. Display total available data for a specific language (e.g. English):
-
- ```bash
- scribe-data total -lang English
- ```
-
-3. Display total available data for both language and data type (e.g. English nouns):
-
- ```bash
- scribe-data total -lang English -dt nouns
- ```
-
-### Get Command
-
-1. Get all available languages and data types:
+## Command Examples
- ```bash
- scribe-data get -a # --all
- ```
+### List
-2. Get specific language and data type (e.g. German nouns):
+```bash
+scribe-data list
+scribe-data list --language
+scribe-data list --data-type
+```
- ```bash
- scribe-data get -lang German -dt nouns
- ```
+### Total
-### Convert Command
+```bash
+scribe-data total --data-type nouns
+scribe-data total --language English
+scribe-data total --language English --data-type nouns
+```
-1. Retrieve data for both language and data type (e.g. English nouns) in CSV format:
+### Get
- ```bash
- scribe-data get -lang english -dt verbs -od ./output_data -ot csv
- ```
+```bash
+scribe-data get --all
+scribe-data get --language German --data-type nouns
+```
-2. Retrieve data for both language and data type (e.g. English nouns) in TSV format:
+### Convert
- ```bash
- scribe-data get -lang english -dt verbs -od ./output_data -ot tsv
- ```
+```bash
+scribe-data get --language English --data-type verbs --output-dir ./output_data --output-type csv
-### Interactive Get Mode
+scribe-data get --language English --data-type verbs --output-dir ./output_data --output-type tsv
+```
-The CLI also offers an interactive get mode, which can be initiated with the following command:
+### Interactive Mode
```bash
-scribe-data get -i # --interactive
+scribe-data interactive
+scribe-data get --interactive
+scribe-data total --interactive
```
-This mode guides users through the data retrieval process with a series of prompts:
-
-1. Language selection: Users can choose from a list of available languages or select all.
-2. Data type selection: Users can specify which types of data to get.
-3. Output configuration: Users can set the file format, export directory, and overwrite preferences.
+Back to top.
-The interactive mode is particularly useful for users who prefer a guided approach or are exploring the available data options.
+## Additional Help
-## Additional Assistance
-
-For more detailed information on each command and its options, append the `--help` flag:
+For detailed information on any command, use:
```bash
-scribe-data -h # --help
+scribe-data -h
scribe-data [command] -h
```
-The CLI also has functions to check the version and upgrade the package if necessary.
+Version and upgrade commands are also available:
```bash
-scribe-data -v # --version
-scribe-data -u # --upgrade
+scribe-data -v
+scribe-data -u
```
-For comprehensive usage instructions and examples, please refer to the [official documentation](https://scribe-data.readthedocs.io/).
+For more information, see the [official documentation](https://scribe-data.readthedocs.io/).
+
+Back to top.
diff --git a/complexipy-snapshot.json b/complexipy-snapshot.json
index fdd4cd3ac..08928dd2a 100644
--- a/complexipy-snapshot.json
+++ b/complexipy-snapshot.json
@@ -158,22 +158,42 @@
]
},
{
- "path": "scribe_data/cli/convert.py",
- "file_name": "convert.py",
+ "path": "scribe_data/cli/convert/to_csv_or_tsv.py",
+ "file_name": "to_csv_or_tsv.py",
+ "functions": [
+ {
+ "name": "convert_to_csv_or_tsv",
+ "complexity": 123
+ }
+ ]
+ },
+ {
+ "path": "scribe_data/cli/convert/to_json.py",
+ "file_name": "to_json.py",
"functions": [
{
"name": "convert_to_json",
"complexity": 90
+ }
+ ]
+ },
+ {
+ "path": "scribe_data/cli/convert/to_sqlite.py",
+ "file_name": "to_sqlite.py",
+ "functions": [
+ {
+ "name": "wiktionary_translations_to_sqlite",
+ "complexity": 17
},
{
- "name": "convert_to_csv_or_tsv",
- "complexity": 123
+ "name": "convert_to_sqlite",
+ "complexity": 75
}
]
},
{
- "path": "scribe_data/cli/download.py",
- "file_name": "download.py",
+ "path": "scribe_data/cli/download/wikidata_lexeme_dump.py",
+ "file_name": "wikidata_lexeme_dump.py",
"functions": [
{
"name": "download_wd_lexeme_dump",
@@ -184,12 +204,18 @@
"complexity": 28
},
{
- "name": "available_closest_lexeme_dumpfile",
+ "name": "available_closest_lexeme_dump_file",
"complexity": 29
- },
+ }
+ ]
+ },
+ {
+ "path": "scribe_data/cli/download/wiktionary_dump.py",
+ "file_name": "wiktionary_dump.py",
+ "functions": [
{
"name": "download_wiktionary_dumps",
- "complexity": 31
+ "complexity": 32
}
]
},
@@ -204,11 +230,11 @@
]
},
{
- "path": "scribe_data/cli/interactive.py",
- "file_name": "interactive.py",
+ "path": "scribe_data/cli/interactive/run.py",
+ "file_name": "run.py",
"functions": [
{
- "name": "start_interactive_mode",
+ "name": "run_interactive_mode",
"complexity": 36
}
]
@@ -224,21 +250,13 @@
]
},
{
- "path": "scribe_data/cli/total.py",
- "file_name": "total.py",
+ "path": "scribe_data/cli/total/print_values.py",
+ "file_name": "print_values.py",
"functions": [
{
"name": "get_datatype_list",
"complexity": 21
},
- {
- "name": "get_total_lexemes",
- "complexity": 29
- },
- {
- "name": "total_wrapper",
- "complexity": 30
- },
{
"name": "print_total_lexemes",
"complexity": 35
@@ -246,16 +264,22 @@
]
},
{
- "path": "scribe_data/load/data_to_sqlite.py",
- "file_name": "data_to_sqlite.py",
+ "path": "scribe_data/cli/total/query.py",
+ "file_name": "query.py",
"functions": [
{
- "name": "wiktionary_translations_to_sqlite",
- "complexity": 17
- },
+ "name": "query_total_lexemes",
+ "complexity": 29
+ }
+ ]
+ },
+ {
+ "path": "scribe_data/cli/total/wrapper.py",
+ "file_name": "wrapper.py",
+ "functions": [
{
- "name": "data_to_sqlite",
- "complexity": 75
+ "name": "total_wrapper",
+ "complexity": 30
}
]
},
diff --git a/docs/source/index.rst b/docs/source/index.rst
index ec6292c1b..569a20607 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -81,10 +81,15 @@ To utilize the Scribe-Data CLI, you can execute variations of the following comm
Available Commands
==================
-- ``list`` (``l``): Enumerate available languages, data types and their combinations.
-- ``get`` (``g``): Retrieve data from Wikidata for specified languages and data types.
-- ``total`` (``t``): Display the total available data for given languages and data types.
-- ``convert`` (``c``): Transform data returned by Scribe-Data into different file formats.
+- ``list`` (``l``): List languages, data types and combinations of each that Scribe-Data can be used for.
+- ``get`` (``g``): Get data from Wikidata and other sources for the given languages and data types.
+- ``total`` (``t``): Check Wikidata for the total available data for the given languages and data types.
+- ``convert`` (``c``): Convert data returned by Scribe-Data to different file types.
+- ``download`` (``d``): Download Wikidata lexeme or Wiktionary dumps.
+- ``interactive`` (``i``): Run in interactive mode.
+- ``export_contracts`` (``ec``): Export Scribe-Data contracts to a local directory.
+- ``check_contracts`` (``cc``): Check the data in a Scribe-Data export directory to see that all needed language data is included.
+- ``filter_data`` (``fd``): Filter exported Scribe-Data data based on provided data contract values.
Contents
========
diff --git a/docs/source/scribe_data/cli/cli_utils.rst b/docs/source/scribe_data/cli/cli_utils.rst
new file mode 100644
index 000000000..01b772593
--- /dev/null
+++ b/docs/source/scribe_data/cli/cli_utils.rst
@@ -0,0 +1,8 @@
+cli_utils.py
+============
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.cli_utils
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/contracts/check.rst b/docs/source/scribe_data/cli/contracts/check.rst
new file mode 100644
index 000000000..792d18e22
--- /dev/null
+++ b/docs/source/scribe_data/cli/contracts/check.rst
@@ -0,0 +1,8 @@
+check.py
+========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.contracts.check
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/contracts/export.rst b/docs/source/scribe_data/cli/contracts/export.rst
new file mode 100644
index 000000000..d428175db
--- /dev/null
+++ b/docs/source/scribe_data/cli/contracts/export.rst
@@ -0,0 +1,8 @@
+export.py
+=========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.contracts.export
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/contracts/filter.rst b/docs/source/scribe_data/cli/contracts/filter.rst
new file mode 100644
index 000000000..dee57861e
--- /dev/null
+++ b/docs/source/scribe_data/cli/contracts/filter.rst
@@ -0,0 +1,8 @@
+filter.py
+=========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.contracts.filter
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/contracts/index.rst b/docs/source/scribe_data/cli/contracts/index.rst
new file mode 100644
index 000000000..2d910498a
--- /dev/null
+++ b/docs/source/scribe_data/cli/contracts/index.rst
@@ -0,0 +1,12 @@
+contracts/
+==========
+
+`View code on Github `_
+
+
+.. toctree::
+ :maxdepth: 1
+
+ check
+ export
+ filter
diff --git a/docs/source/scribe_data/cli/convert/index.rst b/docs/source/scribe_data/cli/convert/index.rst
new file mode 100644
index 000000000..abd6b4bcc
--- /dev/null
+++ b/docs/source/scribe_data/cli/convert/index.rst
@@ -0,0 +1,13 @@
+convert/
+========
+
+`View code on Github `_
+
+
+.. toctree::
+ :maxdepth: 1
+
+ to_csv_or_tsv
+ to_json
+ to_sqlite
+ wrapper
diff --git a/docs/source/scribe_data/cli/convert/to_csv_or_tsv.rst b/docs/source/scribe_data/cli/convert/to_csv_or_tsv.rst
new file mode 100644
index 000000000..378119042
--- /dev/null
+++ b/docs/source/scribe_data/cli/convert/to_csv_or_tsv.rst
@@ -0,0 +1,8 @@
+to_csv_or_tsv.py
+================
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.convert.to_csv_or_tsv
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/convert/to_json.rst b/docs/source/scribe_data/cli/convert/to_json.rst
new file mode 100644
index 000000000..def5c94dd
--- /dev/null
+++ b/docs/source/scribe_data/cli/convert/to_json.rst
@@ -0,0 +1,8 @@
+to_json.py
+==========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.convert.to_json
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/convert/to_sqlite.rst b/docs/source/scribe_data/cli/convert/to_sqlite.rst
new file mode 100644
index 000000000..b902894dc
--- /dev/null
+++ b/docs/source/scribe_data/cli/convert/to_sqlite.rst
@@ -0,0 +1,8 @@
+to_sqlite.py
+============
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.convert.to_sqlite
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/convert/wrapper.rst b/docs/source/scribe_data/cli/convert/wrapper.rst
new file mode 100644
index 000000000..c998b092c
--- /dev/null
+++ b/docs/source/scribe_data/cli/convert/wrapper.rst
@@ -0,0 +1,8 @@
+wrapper.py
+==========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.convert.wrapper
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/download/index.rst b/docs/source/scribe_data/cli/download/index.rst
new file mode 100644
index 000000000..271dd5f20
--- /dev/null
+++ b/docs/source/scribe_data/cli/download/index.rst
@@ -0,0 +1,11 @@
+download/
+=========
+
+`View code on Github `_
+
+
+.. toctree::
+ :maxdepth: 1
+
+ wikidata_lexeme_dump
+ wiktionary_dump
diff --git a/docs/source/scribe_data/cli/download/wikidata_lexeme_dump.rst b/docs/source/scribe_data/cli/download/wikidata_lexeme_dump.rst
new file mode 100644
index 000000000..ad782909e
--- /dev/null
+++ b/docs/source/scribe_data/cli/download/wikidata_lexeme_dump.rst
@@ -0,0 +1,8 @@
+wikidata_lexeme_dump.py
+=======================
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.download.wikidata_lexeme_dump
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/download/wiktionary_dump.rst b/docs/source/scribe_data/cli/download/wiktionary_dump.rst
new file mode 100644
index 000000000..4469e4c82
--- /dev/null
+++ b/docs/source/scribe_data/cli/download/wiktionary_dump.rst
@@ -0,0 +1,8 @@
+wiktionary_dump.py
+==================
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.download.wiktionary_dump
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/get.rst b/docs/source/scribe_data/cli/get.rst
new file mode 100644
index 000000000..0010611e8
--- /dev/null
+++ b/docs/source/scribe_data/cli/get.rst
@@ -0,0 +1,8 @@
+get.py
+======
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.get
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli.rst b/docs/source/scribe_data/cli/index.rst
similarity index 99%
rename from docs/source/scribe_data/cli.rst
rename to docs/source/scribe_data/cli/index.rst
index d88e6cabb..f112ea91c 100644
--- a/docs/source/scribe_data/cli.rst
+++ b/docs/source/scribe_data/cli/index.rst
@@ -5,6 +5,22 @@ cli/
Scribe-Data provides a command-line interface (CLI) for efficient interaction with its language data functionality.
+.. toctree::
+ :maxdepth: 2
+
+ contracts/index
+ convert/index
+ download/index
+ interactive/index
+ list/index
+ total/index
+
+.. toctree::
+ :maxdepth: 1
+
+ cli_utils
+ get
+
Usage
-----
diff --git a/docs/source/scribe_data/cli/interactive/config.rst b/docs/source/scribe_data/cli/interactive/config.rst
new file mode 100644
index 000000000..239a4394b
--- /dev/null
+++ b/docs/source/scribe_data/cli/interactive/config.rst
@@ -0,0 +1,8 @@
+config.py
+=========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.interactive.config
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/interactive/execute.rst b/docs/source/scribe_data/cli/interactive/execute.rst
new file mode 100644
index 000000000..36948cf5a
--- /dev/null
+++ b/docs/source/scribe_data/cli/interactive/execute.rst
@@ -0,0 +1,8 @@
+execute.py
+==========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.interactive.execute
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/interactive/index.rst b/docs/source/scribe_data/cli/interactive/index.rst
new file mode 100644
index 000000000..efdcc4156
--- /dev/null
+++ b/docs/source/scribe_data/cli/interactive/index.rst
@@ -0,0 +1,13 @@
+interactive/
+============
+
+`View code on Github `_
+
+
+.. toctree::
+ :maxdepth: 1
+
+ config
+ execute
+ prompt
+ run
diff --git a/docs/source/scribe_data/cli/interactive/prompt.rst b/docs/source/scribe_data/cli/interactive/prompt.rst
new file mode 100644
index 000000000..08044fd7f
--- /dev/null
+++ b/docs/source/scribe_data/cli/interactive/prompt.rst
@@ -0,0 +1,8 @@
+prompt.py
+=========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.interactive.prompt
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/interactive/run.rst b/docs/source/scribe_data/cli/interactive/run.rst
new file mode 100644
index 000000000..f62c6e96f
--- /dev/null
+++ b/docs/source/scribe_data/cli/interactive/run.rst
@@ -0,0 +1,8 @@
+run.py
+======
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.interactive.run
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/list/data_types.rst b/docs/source/scribe_data/cli/list/data_types.rst
new file mode 100644
index 000000000..dfcd5bb60
--- /dev/null
+++ b/docs/source/scribe_data/cli/list/data_types.rst
@@ -0,0 +1,8 @@
+data_types.py
+=============
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.list.data_types
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/load/index.rst b/docs/source/scribe_data/cli/list/index.rst
similarity index 54%
rename from docs/source/scribe_data/load/index.rst
rename to docs/source/scribe_data/cli/list/index.rst
index d65db3b52..3a1c54407 100644
--- a/docs/source/scribe_data/load/index.rst
+++ b/docs/source/scribe_data/cli/list/index.rst
@@ -1,9 +1,12 @@
-load/
+list/
=====
-`View code on Github `_
+`View code on Github `_
+
.. toctree::
:maxdepth: 1
- data_to_sqlite
+ data_types
+ languages
+ wrapper
diff --git a/docs/source/scribe_data/cli/list/languages.rst b/docs/source/scribe_data/cli/list/languages.rst
new file mode 100644
index 000000000..33b38cb06
--- /dev/null
+++ b/docs/source/scribe_data/cli/list/languages.rst
@@ -0,0 +1,8 @@
+languages.py
+============
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.list.languages
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/list/wrapper.rst b/docs/source/scribe_data/cli/list/wrapper.rst
new file mode 100644
index 000000000..cd8e239b5
--- /dev/null
+++ b/docs/source/scribe_data/cli/list/wrapper.rst
@@ -0,0 +1,8 @@
+wrapper.py
+==========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.list.wrapper
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/total/index.rst b/docs/source/scribe_data/cli/total/index.rst
new file mode 100644
index 000000000..39ba21b67
--- /dev/null
+++ b/docs/source/scribe_data/cli/total/index.rst
@@ -0,0 +1,12 @@
+total/
+======
+
+`View code on Github `_
+
+
+.. toctree::
+ :maxdepth: 1
+
+ print_values
+ query
+ wrapper
diff --git a/docs/source/scribe_data/cli/total/print_values.rst b/docs/source/scribe_data/cli/total/print_values.rst
new file mode 100644
index 000000000..e5b9398bb
--- /dev/null
+++ b/docs/source/scribe_data/cli/total/print_values.rst
@@ -0,0 +1,8 @@
+print_values.py
+===============
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.total.print_values
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/total/query.rst b/docs/source/scribe_data/cli/total/query.rst
new file mode 100644
index 000000000..40e200401
--- /dev/null
+++ b/docs/source/scribe_data/cli/total/query.rst
@@ -0,0 +1,8 @@
+query.py
+========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.total.query
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/cli/total/wrapper.rst b/docs/source/scribe_data/cli/total/wrapper.rst
new file mode 100644
index 000000000..70a7ba1e1
--- /dev/null
+++ b/docs/source/scribe_data/cli/total/wrapper.rst
@@ -0,0 +1,8 @@
+wrapper.py
+==========
+
+`View code on Github `_
+
+.. automodule:: scribe_data.cli.total.wrapper
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/index.rst b/docs/source/scribe_data/index.rst
index 73db49062..23395dc1c 100644
--- a/docs/source/scribe_data/index.rst
+++ b/docs/source/scribe_data/index.rst
@@ -6,13 +6,13 @@ Scribe-Data
.. toctree::
:maxdepth: 1
- cli
utils
.. toctree::
:maxdepth: 2
check/index
- load/index
+ cli/index
unicode/index
wikidata/index
+ wiktionary/index
diff --git a/docs/source/scribe_data/load/data_to_sqlite.rst b/docs/source/scribe_data/load/data_to_sqlite.rst
deleted file mode 100644
index 24433ca85..000000000
--- a/docs/source/scribe_data/load/data_to_sqlite.rst
+++ /dev/null
@@ -1,20 +0,0 @@
-data_to_sqlite.py
-=================
-
-`View code on Github `_
-
-Converts all or desired JSON data generated by update_data into SQLite databases.
-
-Parameters
-----------
- languages : list of strings (default=None)
- A subset of Scribe's languages that the user wants to update.
-
-Example
--------
-
-.. code:: bash
-
- python3 data_to_sqlite.py '["French", "German"]'
-
-..
diff --git a/docs/source/scribe_data/wiktionary/index.rst b/docs/source/scribe_data/wiktionary/index.rst
new file mode 100644
index 000000000..43d46a643
--- /dev/null
+++ b/docs/source/scribe_data/wiktionary/index.rst
@@ -0,0 +1,10 @@
+wiktionary/
+===========
+
+`View code on Github `_
+
+.. toctree::
+ :maxdepth: 1
+
+ parse_constants
+ parse_translations
diff --git a/docs/source/scribe_data/wiktionary/parse_constants.rst b/docs/source/scribe_data/wiktionary/parse_constants.rst
new file mode 100644
index 000000000..5811d5e79
--- /dev/null
+++ b/docs/source/scribe_data/wiktionary/parse_constants.rst
@@ -0,0 +1,8 @@
+parse_constants.py
+==================
+
+`View code on Github `_
+
+.. automodule:: scribe_data.wiktionary.parse_constants
+ :members:
+ :private-members:
diff --git a/docs/source/scribe_data/wiktionary/parse_translations.rst b/docs/source/scribe_data/wiktionary/parse_translations.rst
new file mode 100644
index 000000000..d9691e713
--- /dev/null
+++ b/docs/source/scribe_data/wiktionary/parse_translations.rst
@@ -0,0 +1,8 @@
+parse_translations.py
+=====================
+
+`View code on Github `_
+
+.. automodule:: scribe_data.wiktionary.parse_translations
+ :members:
+ :private-members:
diff --git a/src/scribe_data/load/__init__.py b/src/scribe_data/cli/contracts/__init__.py
similarity index 100%
rename from src/scribe_data/load/__init__.py
rename to src/scribe_data/cli/contracts/__init__.py
diff --git a/src/scribe_data/cli/contracts/filter.py b/src/scribe_data/cli/contracts/filter.py
index 25899c027..194368bf8 100644
--- a/src/scribe_data/cli/contracts/filter.py
+++ b/src/scribe_data/cli/contracts/filter.py
@@ -18,6 +18,8 @@
get_language_from_iso,
)
+# MARK: Filter Metadata
+
def filter_contract_metadata(contract_file: Path) -> dict[str, Any]:
"""
@@ -63,7 +65,6 @@ def filter_contract_metadata(contract_file: Path) -> dict[str, Any]:
# Case 2: List of number types.
elif isinstance(numbers, list):
- # Filter out empty strings
filtered_numbers = [n for n in numbers if n]
# Case 3: String of number types.
@@ -71,7 +72,7 @@ def filter_contract_metadata(contract_file: Path) -> dict[str, Any]:
# Split and filter out empty strings.
filtered_numbers = [n for n in numbers.split() if n]
- # Remove duplicates and store
+ # Remove duplicates and store.
filtered_metadata["nouns"]["numbers"] = list(set(filtered_numbers))
# Filter Genders.
@@ -112,6 +113,7 @@ def filter_contract_metadata(contract_file: Path) -> dict[str, Any]:
).split()
]
conj_forms.update(cleaned_forms)
+
elif isinstance(form, list):
cleaned_forms = [
f
@@ -153,6 +155,9 @@ def filter_contract_metadata(contract_file: Path) -> dict[str, Any]:
return {}
+# MARK: Filter Export Data
+
+
def filter_exported_data(
input_file: Path, contract_metadata: dict[str, Any], data_type: str
) -> dict[str, Any]:
@@ -221,6 +226,9 @@ def filter_exported_data(
return {}
+# MARK: Export Filtered Data
+
+
def export_data_filtered_by_contracts(
contracts_dir: Path, input_dir: Path, output_dir: Path
) -> None:
@@ -305,6 +313,7 @@ def export_data_filtered_by_contracts(
open(output_file, "w", encoding="utf-8") as dst,
):
dst.write(src.read())
+
print(f"Copied unfiltered {data_type} for {matched_language}")
continue
@@ -319,6 +328,7 @@ def export_data_filtered_by_contracts(
)
with open(output_file, "w", encoding="utf-8") as f:
json.dump(filtered_data, f, ensure_ascii=False, indent=2)
+
print(
f"Exported {matched_language} {data_type} with {len(filtered_data)} entries"
)
diff --git a/src/scribe_data/cli/convert.py b/src/scribe_data/cli/convert.py
deleted file mode 100644
index a5534f3fc..000000000
--- a/src/scribe_data/cli/convert.py
+++ /dev/null
@@ -1,473 +0,0 @@
-# SPDX-License-Identifier: GPL-3.0-or-later
-"""
-Functions to convert data returned from the Scribe-Data CLI to other file types.
-"""
-
-import csv
-import json
-from pathlib import Path
-
-from scribe_data.load.data_to_sqlite import data_to_sqlite
-from scribe_data.utils import (
- DEFAULT_CSV_EXPORT_DIR,
- DEFAULT_JSON_EXPORT_DIR,
- DEFAULT_SQLITE_EXPORT_DIR,
- DEFAULT_TSV_EXPORT_DIR,
- DEFAULT_WIKTIONARY_JSON_EXPORT_DIR,
- camel_to_snake,
- check_index_exists,
-)
-
-# MARK: JSON
-
-
-def convert_to_json(
- language: str,
- data_types: str | list[str] | None,
- input_file: Path,
- output_dir: Path,
- output_type: str,
- overwrite: bool = False,
- identifier_case: str = "camel",
-) -> None:
- """
- Convert a CSV/TSV file to JSON.
-
- Parameters
- ----------
- language : str
- The language of the file to convert.
-
- data_types : Union[str, List[str]]
- The data type of the file to convert.
-
- input_file : Path
- The input CSV/TSV file path.
-
- output_dir : Path
- The output directory path for results.
-
- output_type : str
- The output format, should be "json".
-
- overwrite : bool
- Whether to overwrite existing files.
-
- identifier_case : str
- The case format for identifiers. Default is "camel".
-
- Returns
- -------
- None
- A JSON file.
- """
- if not language:
- raise ValueError(f"Language '{language.capitalize()}' is not recognized.")
-
- data_types = [data_types] if isinstance(data_types, str) else data_types
-
- if not data_types:
- return
-
- if output_dir is None:
- output_dir = DEFAULT_JSON_EXPORT_DIR
-
- json_output_dir = Path(output_dir) / language.capitalize()
- json_output_dir.mkdir(parents=True, exist_ok=True)
-
- for dtype in data_types:
- if not input_file.exists():
- print(f"No data found for {dtype} conversion at '{input_file}'.")
- continue
-
- delimiter = {".csv": ",", ".tsv": "\t"}.get(input_file.suffix.lower())
-
- if not delimiter:
- raise ValueError(
- f"Unsupported file extension '{input_file.suffix}' for {str(input_file)}. Please provide a '.csv' or '.tsv' file."
- )
-
- try:
- with input_file.open("r", encoding="utf-8") as file:
- reader = csv.DictReader(file, delimiter=delimiter)
- rows = list(reader)
-
- if not rows:
- print(f"No data found in '{input_file}'.")
- continue
-
- # Use the first row to inspect column headers.
- first_row = rows[0]
- keys = list(first_row.keys())
- data: dict = {}
-
- if len(keys) == 1:
- # Handle Case: { key: None }.
- for row in rows:
- data[row[keys[0]]] = None
-
- elif len(keys) == 2:
- # Handle Case: { key: value }.
- for row in rows:
- key = (
- camel_to_snake(row[keys[0]])
- if identifier_case == "snake"
- else row[keys[0]]
- )
- value = row[keys[1]]
- data[key] = value
-
- elif len(keys) > 2:
- if all(col in first_row for col in ["emoji", "is_base", "rank"]):
- # Handle Case: { key: [ { emoji: ..., is_base: ..., rank: ... }, { emoji: ..., is_base: ..., rank: ... } ] }.
- for row in rows:
- if reader.fieldnames and len(reader.fieldnames) > 0:
- if identifier_case == "snake":
- raw_value = row.get(reader.fieldnames[0])
- key = camel_to_snake(raw_value or "")
-
- else:
- key = row.get(reader.fieldnames[0])
-
- emoji = row.get("emoji", "").strip()
- is_base = (
- row.get("is_base", "false").strip().lower() == "true"
- )
- rank = row.get("rank", None)
- rank = int(rank) if rank and rank.isdigit() else None
-
- entry = {"emoji": emoji, "is_base": is_base, "rank": rank}
-
- if key is None:
- continue
-
- data.setdefault(key, []).append(entry)
-
- else:
- # Handle Case: { key: { value1: ..., value2: ... } }.
- for row in rows:
- data[row[keys[0]]] = {
- (
- camel_to_snake(k)
- if identifier_case == "snake"
- else k
- ): row[k]
- for k in keys[1:]
- }
-
- except (IOError, csv.Error) as e:
- print(f"Error reading '{input_file}': {e}")
- continue
-
- # Define output file path
- output_file = json_output_dir / f"{dtype}.{output_type}"
-
- if check_index_exists(output_file, overwrite):
- print(f"Skipping {dtype}")
- continue
-
- try:
- with output_file.open("w", encoding="utf-8") as file:
- json.dump(data, file, ensure_ascii=False, indent=2)
-
- except IOError as e:
- print(f"Error writing to '{output_file}': {e}")
- continue
-
- print(f"Data for {language.capitalize()} {dtype} written to {output_file}")
-
-
-# MARK: CSV or TSV
-
-
-def convert_to_csv_or_tsv(
- language: str,
- data_types: str | list[str],
- input_file: Path,
- output_dir: Path,
- output_type: str,
- overwrite: bool = False,
- identifier_case: str = "camel",
-) -> None:
- """
- Convert a JSON File to CSV/TSV file.
-
- Parameters
- ----------
- language : str
- The language of the file to convert.
-
- data_types : Union[str, List[str]]
- The data type of the file to convert.
-
- input_file : Path
- The input JSON file path.
-
- output_dir : Path
- The output directory path for results.
-
- output_type : str
- The output format, should be "csv" or "tsv".
-
- overwrite : bool
- Whether to overwrite existing files.
-
- identifier_case : str
- The case format for identifiers. Default is "camel".
-
- Returns
- -------
- None
- A CSV/TSV files.
- """
- if not language:
- raise ValueError(f"Language '{language.capitalize()}' is not recognized.")
-
- data_types = [data_types] if isinstance(data_types, str) else data_types
-
- # Modify input file path to use the provided input_file or default JSON export path.
- input_file_path = (
- input_file
- or DEFAULT_JSON_EXPORT_DIR / language.lower() / f"{data_types[0]}.json"
- )
-
- for dtype in data_types:
- if not input_file_path.exists():
- print(f"No data found for {dtype} conversion at '{input_file_path}'.")
- continue
-
- try:
- with input_file_path.open("r", encoding="utf-8") as f:
- data = json.load(f)
-
- except (IOError, json.JSONDecodeError) as e:
- print(f"Error reading '{input_file_path}': {e}")
- continue
-
- # Determine the delimiter based on output type.
- delimiter = "," if output_type == "csv" else "\t"
-
- if output_dir is None:
- output_dir = (
- DEFAULT_CSV_EXPORT_DIR
- if output_type == "csv"
- else DEFAULT_TSV_EXPORT_DIR
- )
-
- final_output_dir = output_dir / language.capitalize()
- final_output_dir.mkdir(parents=True, exist_ok=True)
-
- output_file = final_output_dir / f"{dtype}.{output_type}"
-
- if check_index_exists(output_file, overwrite):
- print(f"Skipping {dtype}")
- continue
-
- try:
- with output_file.open("w", newline="", encoding="utf-8") as file:
- writer = csv.writer(file, delimiter=delimiter)
-
- # Handle different JSON structures based on the format.
- if isinstance(data, dict):
- first_key = list(data.keys())[0]
-
- first_val = next(iter(data.values())) if data else None
- if isinstance(first_val, dict):
- # Handle case: { key: { value1: ..., value2: ... } }.
- columns = sorted(first_val.keys())
- header = [
- camel_to_snake(dtype[:-1])
- if identifier_case == "snake"
- else dtype[:-1]
- ]
- header += [
- camel_to_snake(col) if identifier_case == "snake" else col
- for col in columns
- ]
- writer.writerow(header)
-
- for key, value in data.items():
- row = [key] + [value.get(col, "") for col in columns]
- writer.writerow(row)
-
- elif isinstance(data[first_key], list):
- if all(isinstance(item, dict) for item in data[first_key]):
- # Handle case: { key: [ { value1: ..., value2: ... } ] }.
- if "emoji" in data[first_key][0]: # emoji specific case
- columns = ["word", "emoji", "is_base", "rank"]
- writer.writerow(
- [camel_to_snake(col) for col in columns]
- if identifier_case == "snake"
- else columns
- )
-
- for key, value in data.items():
- for item in value:
- row = [
- key,
- item.get("emoji", ""),
- item.get("is_base", ""),
- item.get("rank", ""),
- ]
- writer.writerow(row)
-
- else:
- if identifier_case == "snake":
- columns = [camel_to_snake(dtype[:-1])] + [
- camel_to_snake(col)
- for col in data[first_key][0].keys()
- ]
-
- else:
- columns = [dtype[:-1]] + list(
- data[first_key][0].keys()
- )
- writer.writerow(columns)
-
- for key, value in data.items():
- for item in value:
- row = [key] + [
- item.get(col, "") for col in columns[1:]
- ]
- writer.writerow(row)
-
- elif all(isinstance(item, str) for item in data[first_key]):
- # Handle case: { key: [value1, value2, ...] }.
- header = [
- camel_to_snake(dtype[:-1])
- if identifier_case == "snake"
- else dtype[:-1]
- ]
- header += [
- f"autosuggestion_{i + 1}"
- for i in range(len(data[first_key]))
- ]
- writer.writerow(header)
- for key, value in data.items():
- row = [key] + value
- writer.writerow(row)
-
- else:
- # Handle case: { key: value }.
- writer.writerow(
- [
- camel_to_snake(dtype[:-1])
- if identifier_case == "snake"
- else dtype[:-1],
- "value",
- ]
- )
-
- for key, value in data.items():
- writer.writerow([key, value])
-
- except IOError as e:
- print(f"Error writing to '{output_file}': {e}")
- continue
-
- print(f"Data for {language.capitalize()} {dtype} written to '{output_file}'")
-
-
-# MARK: Convert Wrapper
-
-
-def convert_wrapper(
- languages: list[str] | None,
- data_types: list | None,
- input_path: Path,
- output_dir: Path,
- output_type: str,
- overwrite: bool = False,
- identifier_case: str = "camel",
- all: bool = False,
-) -> None:
- """
- Convert data to the specified output type: JSON, CSV/TSV, or SQLite.
-
- Parameters
- ----------
- languages : Optional[List[str]]
- The language(s) of the data to convert.
-
- data_types : Optional[List[str]]
- The data type(s) of the data to convert.
-
- input_path : Path
- The path to the input file or directory.
-
- output_dir : Path
- The output directory where converted files will be stored.
-
- output_type : str
- The desired output format. Can be 'json', 'csv', 'tsv', or 'sqlite'.
-
- overwrite : bool, optional, default=False
- Whether to overwrite existing output files.
-
- identifier_case : str, optional, default='camel'
- The case format for identifiers.
-
- all : bool, optional, default=False
- Convert all languages and data types.
-
- Returns
- -------
- None
- This function does not return any value; it performs a conversion operation.
- """
- # Route the function call to the correct conversion function.
- if output_dir is None:
- output_dir = {
- "json": DEFAULT_JSON_EXPORT_DIR,
- "csv": DEFAULT_CSV_EXPORT_DIR,
- "tsv": DEFAULT_TSV_EXPORT_DIR,
- "sqlite": DEFAULT_SQLITE_EXPORT_DIR,
- }.get(output_type, DEFAULT_JSON_EXPORT_DIR)
-
- if input_path is None and data_types:
- is_wiktionary = any(
- isinstance(dt, str) and dt.startswith("wiktionary")
- for dt in (data_types if isinstance(data_types, list) else [data_types])
- )
- input_path = (
- DEFAULT_WIKTIONARY_JSON_EXPORT_DIR
- if is_wiktionary
- else DEFAULT_JSON_EXPORT_DIR
- )
-
- if output_type == "json" and languages and data_types:
- convert_to_json(
- language=languages[0], # only one language possible
- data_types=data_types,
- input_file=input_path,
- output_dir=output_dir,
- output_type=output_type,
- overwrite=overwrite,
- identifier_case=identifier_case,
- )
-
- elif output_type in {"csv", "tsv"} and languages and data_types:
- convert_to_csv_or_tsv(
- language=languages[0], # only one language possible
- data_types=data_types,
- input_file=input_path,
- output_dir=output_dir,
- output_type=output_type,
- overwrite=overwrite,
- identifier_case=identifier_case,
- )
-
- elif output_type == "sqlite":
- data_to_sqlite(
- languages=languages,
- specific_tables=data_types,
- identifier_case=identifier_case,
- input_file=input_path,
- output_file=output_dir,
- overwrite=overwrite,
- )
-
- else:
- raise ValueError(
- f"Unsupported output type '{output_type}'. Must be 'json', 'csv', 'tsv' or 'sqlite'."
- )
diff --git a/src/scribe_data/cli/convert/__init__.py b/src/scribe_data/cli/convert/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/src/scribe_data/cli/convert/to_csv_or_tsv.py b/src/scribe_data/cli/convert/to_csv_or_tsv.py
new file mode 100644
index 000000000..8329223dc
--- /dev/null
+++ b/src/scribe_data/cli/convert/to_csv_or_tsv.py
@@ -0,0 +1,206 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Functions to convert data returned from the Scribe-Data CLI to CSV or TSV files.
+"""
+
+import csv
+import json
+from pathlib import Path
+
+from scribe_data.utils import (
+ DEFAULT_CSV_EXPORT_DIR,
+ DEFAULT_JSON_EXPORT_DIR,
+ DEFAULT_TSV_EXPORT_DIR,
+ camel_to_snake,
+ check_index_exists,
+)
+
+# MARK: CSV or TSV
+
+
+def convert_to_csv_or_tsv(
+ language: str,
+ data_types: str | list[str],
+ input_file: Path,
+ output_dir: Path,
+ output_type: str,
+ overwrite: bool = False,
+ identifier_case: str = "camel",
+) -> None:
+ """
+ Convert a JSON File to CSV/TSV file.
+
+ Parameters
+ ----------
+ language : str
+ The language of the file to convert.
+
+ data_types : Union[str, List[str]]
+ The data type of the file to convert.
+
+ input_file : Path
+ The input JSON file path.
+
+ output_dir : Path
+ The output directory path for results.
+
+ output_type : str
+ The output format, should be "csv" or "tsv".
+
+ overwrite : bool
+ Whether to overwrite existing files.
+
+ identifier_case : str
+ The case format for identifiers. Default is "camel".
+
+ Returns
+ -------
+ None
+ A CSV/TSV files.
+ """
+ if not language:
+ raise ValueError(f"Language '{language.capitalize()}' is not recognized.")
+
+ data_types = [data_types] if isinstance(data_types, str) else data_types
+
+ # Modify input file path to use the provided input_file or default JSON export path.
+ input_file_path = (
+ input_file
+ or DEFAULT_JSON_EXPORT_DIR / language.lower() / f"{data_types[0]}.json"
+ )
+
+ for dtype in data_types:
+ if not input_file_path.exists():
+ print(f"No data found for {dtype} conversion at '{input_file_path}'.")
+ continue
+
+ try:
+ with input_file_path.open("r", encoding="utf-8") as f:
+ data = json.load(f)
+
+ except (IOError, json.JSONDecodeError) as e:
+ print(f"Error reading '{input_file_path}': {e}")
+ continue
+
+ # Determine the delimiter based on output type.
+ delimiter = "," if output_type == "csv" else "\t"
+
+ if output_dir is None:
+ output_dir = (
+ DEFAULT_CSV_EXPORT_DIR
+ if output_type == "csv"
+ else DEFAULT_TSV_EXPORT_DIR
+ )
+
+ final_output_dir = output_dir / language.capitalize()
+ final_output_dir.mkdir(parents=True, exist_ok=True)
+
+ output_file = final_output_dir / f"{dtype}.{output_type}"
+
+ if check_index_exists(output_file, overwrite):
+ print(f"Skipping {dtype}")
+ continue
+
+ try:
+ with output_file.open("w", newline="", encoding="utf-8") as file:
+ writer = csv.writer(file, delimiter=delimiter)
+
+ # Handle different JSON structures based on the format.
+ if isinstance(data, dict):
+ first_key = list(data.keys())[0]
+
+ first_val = next(iter(data.values())) if data else None
+ if isinstance(first_val, dict):
+ # Handle case: { key: { value1: ..., value2: ... } }.
+ columns = sorted(first_val.keys())
+ header = [
+ camel_to_snake(dtype[:-1])
+ if identifier_case == "snake"
+ else dtype[:-1]
+ ]
+ header += [
+ camel_to_snake(col) if identifier_case == "snake" else col
+ for col in columns
+ ]
+ writer.writerow(header)
+
+ for key, value in data.items():
+ row = [key] + [value.get(col, "") for col in columns]
+ writer.writerow(row)
+
+ elif isinstance(data[first_key], list):
+ if all(isinstance(item, dict) for item in data[first_key]):
+ # Handle case: { key: [ { value1: ..., value2: ... } ] }.
+ if "emoji" in data[first_key][0]: # emoji specific case
+ columns = ["word", "emoji", "is_base", "rank"]
+ writer.writerow(
+ [camel_to_snake(col) for col in columns]
+ if identifier_case == "snake"
+ else columns
+ )
+
+ for key, value in data.items():
+ for item in value:
+ row = [
+ key,
+ item.get("emoji", ""),
+ item.get("is_base", ""),
+ item.get("rank", ""),
+ ]
+ writer.writerow(row)
+
+ else:
+ if identifier_case == "snake":
+ columns = [camel_to_snake(dtype[:-1])] + [
+ camel_to_snake(col)
+ for col in data[first_key][0].keys()
+ ]
+
+ else:
+ columns = [dtype[:-1]] + list(
+ data[first_key][0].keys()
+ )
+ writer.writerow(columns)
+
+ for key, value in data.items():
+ for item in value:
+ row = [key] + [
+ item.get(col, "") for col in columns[1:]
+ ]
+ writer.writerow(row)
+
+ elif all(isinstance(item, str) for item in data[first_key]):
+ # Handle case: { key: [value1, value2, ...] }.
+ header = [
+ camel_to_snake(dtype[:-1])
+ if identifier_case == "snake"
+ else dtype[:-1]
+ ]
+ header += [
+ f"autosuggestion_{i + 1}"
+ for i in range(len(data[first_key]))
+ ]
+ writer.writerow(header)
+ for key, value in data.items():
+ row = [key] + value
+ writer.writerow(row)
+
+ else:
+ # Handle case: { key: value }.
+ writer.writerow(
+ [
+ camel_to_snake(dtype[:-1])
+ if identifier_case == "snake"
+ else dtype[:-1],
+ "value",
+ ]
+ )
+
+ for key, value in data.items():
+ writer.writerow([key, value])
+
+ except IOError as e:
+ print(f"Error writing to '{output_file}': {e}")
+ continue
+
+ print(f"Data for {language.capitalize()} {dtype} written to '{output_file}'")
diff --git a/src/scribe_data/cli/convert/to_json.py b/src/scribe_data/cli/convert/to_json.py
new file mode 100644
index 000000000..cfceda93c
--- /dev/null
+++ b/src/scribe_data/cli/convert/to_json.py
@@ -0,0 +1,172 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Functions to convert data returned from the Scribe-Data CLI to JSON files.
+"""
+
+import csv
+import json
+from pathlib import Path
+
+from scribe_data.utils import (
+ DEFAULT_JSON_EXPORT_DIR,
+ camel_to_snake,
+ check_index_exists,
+)
+
+# MARK: JSON
+
+
+def convert_to_json(
+ language: str,
+ data_types: str | list[str] | None,
+ input_file: Path,
+ output_dir: Path,
+ output_type: str,
+ overwrite: bool = False,
+ identifier_case: str = "camel",
+) -> None:
+ """
+ Convert a CSV/TSV file to JSON.
+
+ Parameters
+ ----------
+ language : str
+ The language of the file to convert.
+
+ data_types : Union[str, List[str]]
+ The data type of the file to convert.
+
+ input_file : Path
+ The input CSV/TSV file path.
+
+ output_dir : Path
+ The output directory path for results.
+
+ output_type : str
+ The output format, should be "json".
+
+ overwrite : bool
+ Whether to overwrite existing files.
+
+ identifier_case : str
+ The case format for identifiers. Default is "camel".
+
+ Returns
+ -------
+ None
+ A JSON file.
+ """
+ if not language:
+ raise ValueError(f"Language '{language.capitalize()}' is not recognized.")
+
+ data_types = [data_types] if isinstance(data_types, str) else data_types
+
+ if not data_types:
+ return
+
+ if output_dir is None:
+ output_dir = DEFAULT_JSON_EXPORT_DIR
+
+ json_output_dir = Path(output_dir) / language.capitalize()
+ json_output_dir.mkdir(parents=True, exist_ok=True)
+
+ for dtype in data_types:
+ if not input_file.exists():
+ print(f"No data found for {dtype} conversion at '{input_file}'.")
+ continue
+
+ delimiter = {".csv": ",", ".tsv": "\t"}.get(input_file.suffix.lower())
+
+ if not delimiter:
+ raise ValueError(
+ f"Unsupported file extension '{input_file.suffix}' for {str(input_file)}. Please provide a '.csv' or '.tsv' file."
+ )
+
+ try:
+ with input_file.open("r", encoding="utf-8") as file:
+ reader = csv.DictReader(file, delimiter=delimiter)
+ rows = list(reader)
+
+ if not rows:
+ print(f"No data found in '{input_file}'.")
+ continue
+
+ # Use the first row to inspect column headers.
+ first_row = rows[0]
+ keys = list(first_row.keys())
+ data: dict = {}
+
+ if len(keys) == 1:
+ # Handle Case: { key: None }.
+ for row in rows:
+ data[row[keys[0]]] = None
+
+ elif len(keys) == 2:
+ # Handle Case: { key: value }.
+ for row in rows:
+ key = (
+ camel_to_snake(row[keys[0]])
+ if identifier_case == "snake"
+ else row[keys[0]]
+ )
+ value = row[keys[1]]
+ data[key] = value
+
+ elif len(keys) > 2:
+ if all(col in first_row for col in ["emoji", "is_base", "rank"]):
+ # Handle Case: { key: [ { emoji: ..., is_base: ..., rank: ... }, { emoji: ..., is_base: ..., rank: ... } ] }.
+ for row in rows:
+ if reader.fieldnames and len(reader.fieldnames) > 0:
+ if identifier_case == "snake":
+ raw_value = row.get(reader.fieldnames[0])
+ key = camel_to_snake(raw_value or "")
+
+ else:
+ key = row.get(reader.fieldnames[0])
+
+ emoji = row.get("emoji", "").strip()
+ is_base = (
+ row.get("is_base", "false").strip().lower() == "true"
+ )
+ rank = row.get("rank", None)
+ rank = int(rank) if rank and rank.isdigit() else None
+
+ entry = {"emoji": emoji, "is_base": is_base, "rank": rank}
+
+ if key is None:
+ continue
+
+ data.setdefault(key, []).append(entry)
+
+ else:
+ # Handle Case: { key: { value1: ..., value2: ... } }.
+ for row in rows:
+ data[row[keys[0]]] = {
+ (
+ camel_to_snake(k)
+ if identifier_case == "snake"
+ else k
+ ): row[k]
+ for k in keys[1:]
+ }
+
+ except (IOError, csv.Error) as e:
+ print(f"Error reading '{input_file}': {e}")
+ continue
+
+ # Define output file path
+ output_file = json_output_dir / f"{dtype}.{output_type}"
+
+ if check_index_exists(output_file, overwrite):
+ print(f"Skipping {dtype}")
+ continue
+
+ try:
+ with output_file.open("w", encoding="utf-8") as file:
+ json.dump(data, file, ensure_ascii=False, indent=2)
+
+ except IOError as e:
+ print(f"Error writing to '{output_file}': {e}")
+ continue
+
+ print(f"Data for {language.capitalize()} {dtype} written to {output_file}")
diff --git a/src/scribe_data/load/data_to_sqlite.py b/src/scribe_data/cli/convert/to_sqlite.py
similarity index 99%
rename from src/scribe_data/load/data_to_sqlite.py
rename to src/scribe_data/cli/convert/to_sqlite.py
index 1cb2c7bd0..c7658f065 100644
--- a/src/scribe_data/load/data_to_sqlite.py
+++ b/src/scribe_data/cli/convert/to_sqlite.py
@@ -33,10 +33,13 @@ def create_table(
----------
cursor : sqlite3.Cursor
A sqlite3 cursor.
+
identifier_case : str
Either "camel" or "snake" to determine column naming.
+
data_type : str
The name of the table to be created.
+
cols : list of str
The names of columns for the new table.
"""
@@ -72,8 +75,10 @@ def table_insert(cursor: sqlite3.Cursor, data_type: str, keys: list) -> None:
----------
cursor : sqlite3.Cursor
A sqlite3 cursor.
+
data_type : str
The name of the table to be inserted into.
+
keys : list of any
The values to be inserted into the table row.
"""
@@ -97,14 +102,19 @@ def translations_to_sqlite(
----------
language_data_type_dict : dict
A dictionary specifying the data types for each language.
+
current_languages : list
A list of current languages.
+
identifier_case : str, optional
The identifier case. Default is "snake".
+
input_file : str, optional, default=DEFAULT_JSON_EXPORT_DIR
The input JSON export directory.
+
output_file : str, optional, default=DEFAULT_SQLITE_EXPORT_DIR
The output SQLite export directory.
+
overwrite : bool, optional
If True, existing SQLite files will be overwritten without prompting.
"""
@@ -280,7 +290,7 @@ def wiktionary_translations_to_sqlite(
print(f"Wiktionary translation tables for {language} processed successfully.\n")
-def data_to_sqlite(
+def convert_to_sqlite(
languages: list[str] | None = None,
specific_tables: str | list[str] | None = None,
identifier_case: str = "camel",
diff --git a/src/scribe_data/cli/convert/wrapper.py b/src/scribe_data/cli/convert/wrapper.py
new file mode 100644
index 000000000..1e5a7db52
--- /dev/null
+++ b/src/scribe_data/cli/convert/wrapper.py
@@ -0,0 +1,121 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Wrapper function to convert data returned from the Scribe-Data CLI to other file types.
+"""
+
+from pathlib import Path
+
+from scribe_data.cli.convert.to_csv_or_tsv import convert_to_csv_or_tsv
+from scribe_data.cli.convert.to_json import convert_to_json
+from scribe_data.cli.convert.to_sqlite import convert_to_sqlite
+from scribe_data.utils import (
+ DEFAULT_CSV_EXPORT_DIR,
+ DEFAULT_JSON_EXPORT_DIR,
+ DEFAULT_SQLITE_EXPORT_DIR,
+ DEFAULT_TSV_EXPORT_DIR,
+ DEFAULT_WIKTIONARY_JSON_EXPORT_DIR,
+)
+
+# MARK: Wrapper
+
+
+def convert_wrapper(
+ languages: list[str] | None,
+ data_types: list | None,
+ input_path: Path,
+ output_dir: Path,
+ output_type: str,
+ overwrite: bool = False,
+ identifier_case: str = "camel",
+ all: bool = False,
+) -> None:
+ """
+ Convert data to the specified output type: JSON, CSV/TSV, or SQLite.
+
+ Parameters
+ ----------
+ languages : Optional[List[str]]
+ The language(s) of the data to convert.
+
+ data_types : Optional[List[str]]
+ The data type(s) of the data to convert.
+
+ input_path : Path
+ The path to the input file or directory.
+
+ output_dir : Path
+ The output directory where converted files will be stored.
+
+ output_type : str
+ The desired output format. Can be 'json', 'csv', 'tsv', or 'sqlite'.
+
+ overwrite : bool, optional, default=False
+ Whether to overwrite existing output files.
+
+ identifier_case : str, optional, default='camel'
+ The case format for identifiers.
+
+ all : bool, optional, default=False
+ Convert all languages and data types.
+
+ Returns
+ -------
+ None
+ This function does not return any value; it performs a conversion operation.
+ """
+ # Route the function call to the correct conversion function.
+ if output_dir is None:
+ output_dir = {
+ "json": DEFAULT_JSON_EXPORT_DIR,
+ "csv": DEFAULT_CSV_EXPORT_DIR,
+ "tsv": DEFAULT_TSV_EXPORT_DIR,
+ "sqlite": DEFAULT_SQLITE_EXPORT_DIR,
+ }.get(output_type, DEFAULT_JSON_EXPORT_DIR)
+
+ if input_path is None and data_types:
+ is_wiktionary = any(
+ isinstance(dt, str) and dt.startswith("wiktionary")
+ for dt in (data_types if isinstance(data_types, list) else [data_types])
+ )
+ input_path = (
+ DEFAULT_WIKTIONARY_JSON_EXPORT_DIR
+ if is_wiktionary
+ else DEFAULT_JSON_EXPORT_DIR
+ )
+
+ if output_type == "json" and languages and data_types:
+ convert_to_json(
+ language=languages[0], # only one language possible
+ data_types=data_types,
+ input_file=input_path,
+ output_dir=output_dir,
+ output_type=output_type,
+ overwrite=overwrite,
+ identifier_case=identifier_case,
+ )
+
+ elif output_type in {"csv", "tsv"} and languages and data_types:
+ convert_to_csv_or_tsv(
+ language=languages[0], # only one language possible
+ data_types=data_types,
+ input_file=input_path,
+ output_dir=output_dir,
+ output_type=output_type,
+ overwrite=overwrite,
+ identifier_case=identifier_case,
+ )
+
+ elif output_type == "sqlite":
+ convert_to_sqlite(
+ languages=languages,
+ specific_tables=data_types,
+ identifier_case=identifier_case,
+ input_file=input_path,
+ output_file=output_dir,
+ overwrite=overwrite,
+ )
+
+ else:
+ raise ValueError(
+ f"Unsupported output type '{output_type}'. Must be 'json', 'csv', 'tsv' or 'sqlite'."
+ )
diff --git a/src/scribe_data/cli/download/__init__.py b/src/scribe_data/cli/download/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/src/scribe_data/cli/download.py b/src/scribe_data/cli/download/wikidata_lexeme_dump.py
similarity index 67%
rename from src/scribe_data/cli/download.py
rename to src/scribe_data/cli/download/wikidata_lexeme_dump.py
index 06d42dbbc..df2a867f5 100644
--- a/src/scribe_data/cli/download.py
+++ b/src/scribe_data/cli/download/wikidata_lexeme_dump.py
@@ -19,7 +19,6 @@
DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR,
check_lexeme_dump_prompt_download,
- resolve_lang_iso,
)
@@ -53,7 +52,7 @@ def parse_date(date_string: str) -> date | None:
return None
-def available_closest_lexeme_dumpfile(
+def available_closest_lexeme_dump_file(
target_entity: str,
other_old_dumps: list,
check_wd_dump_exists: Callable[[str], str | None],
@@ -179,7 +178,7 @@ def check_wd_dump_exists(target_entity: str) -> str | None:
return
if other_old_dumps:
- if closest_date := available_closest_lexeme_dumpfile(
+ if closest_date := available_closest_lexeme_dump_file(
target_entity, other_old_dumps, check_wd_dump_exists
):
print(
@@ -294,117 +293,3 @@ def wd_lexeme_dump_download_wrapper(
except Exception as e:
rprint(f"[bold red]An error occurred: {e}[/bold red]")
-
-
-def download_wiktionary_dumps(
- output_dir: Path = DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR,
- language_isos: list[str] = ["en"],
- dump_snapshot: str | None = "latest",
-) -> Path | None:
- """
- Download the latest Wiktionary pages-articles dump based on passed language isos.
-
- Parameters
- ----------
- output_dir : Path, optional, default=DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR
- Directory to save the dump. Defaults to DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR.
-
- language_isos : List[str], optional, default=['en']
- A list of ISO-2 codes for desired Wiktionary dumps.
-
- dump_snapshot : str, optional, default='latest'
- The Wiktionary dump snapshot to be downloaded.
-
- Returns
- -------
- Path
- Path to the downloaded file, or None if aborted/failed.
- """
- if isinstance(language_isos, str):
- language_isos = [language_isos]
-
- resolved_isos = []
- not_included_isos = []
- for lang in language_isos:
- iso = resolve_lang_iso(lang)
- if iso:
- resolved_isos.append(iso)
-
- else:
- not_included_isos.append(lang)
-
- if not_included_isos:
- iso_or_isos = "iso" if len(not_included_isos) == 1 else "isos"
- is_or_are = "is" if len(not_included_isos) == 1 else "are"
- rprint(
- f"[bold red]The following {iso_or_isos} {is_or_are} not included: {', '.join(not_included_isos)}[/bold red]"
- )
- return None
-
- language_isos = resolved_isos
- wiktionaries = [f"{iso}wiktionary" for iso in language_isos]
- wiktionary_urls = [f"https://dumps.wikimedia.org/{w}" for w in wiktionaries]
-
- Path(output_dir).mkdir(parents=True, exist_ok=True)
- for i, w, u in zip(language_isos, wiktionaries, wiktionary_urls):
- # Note: Remove the snapshot from the resulting filename so Scribe-Server always looks for one file.
- filename = f"{w}-pages-articles.xml.bz2"
- download_filename = f"{w}-{dump_snapshot}-pages-articles.xml.bz2"
- download_url = f"{u}/{dump_snapshot}/{download_filename}"
-
- rprint(f"[bold blue]Checking dump validity at {download_url}...[/bold blue]")
- try:
- response = requests.head(download_url, timeout=30)
- response.raise_for_status()
-
- except requests.exceptions.RequestException as e:
- rprint(f"[bold red]Invalid dump date or dump not found: {e}[/bold red]")
- return None
-
- output_path = output_dir / filename
-
- if output_path.exists():
- rprint(f"[bold yellow]Dump already exists: {output_path}[/bold yellow]")
- user_input = questionary.select(
- "Do you want to:",
- choices=[
- "Skip download",
- "Download and overwrite",
- ],
- ).ask()
- if user_input == "Skip download":
- rprint("[bold green]Skipping download.[/bold green]")
- return output_path
-
- rprint(f"[bold blue]Downloading to {output_path}...[/bold blue]")
- try:
- response = requests.get(download_url, stream=True, timeout=30)
- response.raise_for_status()
- total_size = int(response.headers.get("content-length", 0))
-
- with open(output_path, "wb") as f:
- with tqdm(
- total=total_size,
- unit="iB",
- unit_scale=True,
- desc=download_filename,
- ) as pbar:
- for chunk in response.iter_content(chunk_size=8192):
- if chunk:
- f.write(chunk)
- pbar.update(len(chunk))
-
- rprint(
- f"[bold green]{i.upper()}Wiktionary dump download completed successfully![/bold green]"
- )
- return output_path
-
- except requests.exceptions.RequestException as e:
- rprint(f"[bold red]Download failed: {e}[/bold red]")
- return None
-
- iso_or_isos = "iso" if len(not_included_isos) == 1 else "isos"
- was_or_were = "was" if len(not_included_isos) == 1 else "were"
- rprint(
- f"[bold green]The following {iso_or_isos} {was_or_were} successfully downloaded: {', '.join(not_included_isos)}[/bold green]"
- )
diff --git a/src/scribe_data/cli/download/wiktionary_dump.py b/src/scribe_data/cli/download/wiktionary_dump.py
new file mode 100644
index 000000000..a9fda0f62
--- /dev/null
+++ b/src/scribe_data/cli/download/wiktionary_dump.py
@@ -0,0 +1,131 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Functions for downloading Wiktionary dumps.
+"""
+
+from pathlib import Path
+
+import questionary
+import requests
+from rich import print as rprint
+from tqdm import tqdm
+
+from scribe_data.utils import (
+ DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR,
+ resolve_lang_iso,
+)
+
+
+def download_wiktionary_dumps(
+ output_dir: Path = DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR,
+ language_isos: list[str] = ["en"],
+ dump_snapshot: str | None = "latest",
+) -> Path | None:
+ """
+ Download the latest Wiktionary pages-articles dump based on passed language isos.
+
+ Parameters
+ ----------
+ output_dir : Path, optional, default=DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR
+ Directory to save the dump. Defaults to DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR.
+
+ language_isos : List[str], optional, default=['en']
+ A list of ISO-2 codes for desired Wiktionary dumps.
+
+ dump_snapshot : str, optional, default='latest'
+ The Wiktionary dump snapshot to be downloaded.
+
+ Returns
+ -------
+ Path
+ Path to the downloaded file, or None if aborted/failed.
+ """
+ if isinstance(language_isos, str):
+ language_isos = [language_isos]
+
+ resolved_isos = []
+ not_included_isos = []
+ for lang in language_isos:
+ iso = resolve_lang_iso(lang)
+ if iso:
+ resolved_isos.append(iso)
+
+ else:
+ not_included_isos.append(lang)
+
+ if not_included_isos:
+ iso_or_isos = "iso" if len(not_included_isos) == 1 else "isos"
+ is_or_are = "is" if len(not_included_isos) == 1 else "are"
+ rprint(
+ f"[bold red]The following {iso_or_isos} {is_or_are} not included: {', '.join(not_included_isos)}[/bold red]"
+ )
+ return None
+
+ language_isos = resolved_isos
+ wiktionaries = [f"{iso}wiktionary" for iso in language_isos]
+ wiktionary_urls = [f"https://dumps.wikimedia.org/{w}" for w in wiktionaries]
+
+ Path(output_dir).mkdir(parents=True, exist_ok=True)
+ for i, w, u in zip(language_isos, wiktionaries, wiktionary_urls):
+ # Note: Remove the snapshot from the resulting filename so Scribe-Server always looks for one file.
+ filename = f"{w}-pages-articles.xml.bz2"
+ download_filename = f"{w}-{dump_snapshot}-pages-articles.xml.bz2"
+ download_url = f"{u}/{dump_snapshot}/{download_filename}"
+
+ rprint(f"[bold blue]Checking dump validity at {download_url}...[/bold blue]")
+ try:
+ response = requests.head(download_url, timeout=30)
+ response.raise_for_status()
+
+ except requests.exceptions.RequestException as e:
+ rprint(f"[bold red]Invalid dump date or dump not found: {e}[/bold red]")
+ return None
+
+ output_path = output_dir / filename
+
+ if output_path.exists():
+ rprint(f"[bold yellow]Dump already exists: {output_path}[/bold yellow]")
+ user_input = questionary.select(
+ "Do you want to:",
+ choices=[
+ "Skip download",
+ "Download and overwrite",
+ ],
+ ).ask()
+ if user_input == "Skip download":
+ rprint("[bold green]Skipping download.[/bold green]")
+ return output_path
+
+ rprint(f"[bold blue]Downloading to {output_path}...[/bold blue]")
+ try:
+ response = requests.get(download_url, stream=True, timeout=30)
+ response.raise_for_status()
+ total_size = int(response.headers.get("content-length", 0))
+
+ with open(output_path, "wb") as f:
+ with tqdm(
+ total=total_size,
+ unit="iB",
+ unit_scale=True,
+ desc=download_filename,
+ ) as pbar:
+ for chunk in response.iter_content(chunk_size=8192):
+ if chunk:
+ f.write(chunk)
+ pbar.update(len(chunk))
+
+ rprint(
+ f"[bold green]{i.upper()}Wiktionary dump download completed successfully![/bold green]"
+ )
+ return output_path
+
+ except requests.exceptions.RequestException as e:
+ rprint(f"[bold red]Download failed: {e}[/bold red]")
+ return None
+
+ iso_or_isos = "iso" if len(not_included_isos) == 1 else "isos"
+ iso_or_isos = "iso" if len(language_isos) == 1 else "isos"
+ was_or_were = "was" if len(language_isos) == 1 else "were"
+ rprint(
+ f"[bold green]The following {iso_or_isos} {was_or_were} successfully downloaded: {', '.join(language_isos)}[/bold green]"
+ )
diff --git a/src/scribe_data/cli/get.py b/src/scribe_data/cli/get.py
index 71f7b0c18..6f5dc3a69 100644
--- a/src/scribe_data/cli/get.py
+++ b/src/scribe_data/cli/get.py
@@ -14,7 +14,7 @@
from rich import print as rprint
from SPARQLWrapper.SPARQLExceptions import EndPointInternalError
-from scribe_data.cli.convert import convert_wrapper
+from scribe_data.cli.convert.wrapper import convert_wrapper
from scribe_data.unicode.generate_emoji_keywords import generate_emoji
from scribe_data.utils import (
DEFAULT_CSV_EXPORT_DIR,
diff --git a/src/scribe_data/cli/interactive.py b/src/scribe_data/cli/interactive.py
deleted file mode 100644
index cb208e6c2..000000000
--- a/src/scribe_data/cli/interactive.py
+++ /dev/null
@@ -1,663 +0,0 @@
-# SPDX-License-Identifier: GPL-3.0-or-later
-"""
-Interactive mode functionality for the Scribe-Data CLI to allow users to select request arguments.
-"""
-
-import logging
-from pathlib import Path
-
-import questionary
-from prompt_toolkit import prompt
-from prompt_toolkit.completion import WordCompleter
-from rich import print as rprint
-from rich.console import Console
-from rich.logging import RichHandler
-from rich.table import Table
-from tqdm import tqdm
-
-from scribe_data.cli.convert import convert_wrapper
-
-# from scribe_data.cli.list import list_wrapper
-from scribe_data.cli.get import get_data
-from scribe_data.cli.total import total_wrapper
-from scribe_data.utils import (
- DEFAULT_JSON_EXPORT_DIR,
- DEFAULT_SQLITE_EXPORT_DIR,
- DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
- DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR,
- DEFAULT_WIKTIONARY_JSON_EXPORT_DIR,
- data_type_metadata,
- language_metadata,
- list_all_languages,
- resolve_lang_iso,
-)
-from scribe_data.wikidata.wikidata_utils import parse_wd_lexeme_dump
-
-# MARK: Config Setup
-
-logging.basicConfig(
- level=logging.INFO,
- format="%(message)s",
- datefmt="[%X]",
- handlers=[RichHandler(markup=True)], # Enable markup for colors
-)
-console = Console()
-logger = logging.getLogger("rich")
-THANK_YOU_MESSAGE = "[bold cyan]Thank you for using Scribe-Data![/bold cyan]"
-
-
-class ScribeDataConfig:
- """
- Class for the configuration of the interactive mode.
- """
-
- def __init__(self) -> None:
- """
- Configure the interactive mode.
- """
- self.languages = list_all_languages(language_metadata)
- self.data_types = list(data_type_metadata.keys())
- self.selected_languages: list[str] = []
- self.selected_data_types: list[str] = []
- self.output_type: str = "json"
- self.output_dir: Path = DEFAULT_JSON_EXPORT_DIR
- self.overwrite: bool = False
- self.configured: bool = False
- self.identifier_case: str = "camel"
- self.input_dir: Path = DEFAULT_JSON_EXPORT_DIR
- self.output_dir_sqlite: Path = DEFAULT_SQLITE_EXPORT_DIR
-
-
-config = ScribeDataConfig()
-
-
-# MARK: Summary
-
-
-def display_summary() -> None:
- """
- Display a summary of the interactive mode request to run.
- """
- table = Table(
- title="Scribe-Data Request Configuration Summary", style="bright_white"
- )
-
- table.add_column("Setting", style="bold cyan", no_wrap=True)
- table.add_column("Value(s)", style="magenta")
-
- table.add_row("Languages", ", ".join(config.selected_languages) or "None")
- table.add_row("Data Types", ", ".join(config.selected_data_types) or "None")
- table.add_row("Output Type", config.output_type)
- table.add_row("Output Directory", str(config.output_dir))
- table.add_row("Overwrite", "Yes" if config.overwrite else "No")
-
- console.print("\n")
- console.print(table, justify="left")
- console.print("\n")
-
-
-# Helper function to create a WordCompleter.
-def create_word_completer(
- options: list[str], include_all: bool = False
-) -> WordCompleter:
- """
- Return a word completer object of the given options.
-
- Parameters
- ----------
- options : List[str]
- The options that could complete the current input.
-
- include_all : bool
- Whether 'All' should be an option.
-
- Returns
- -------
- WordCompleter
- The word completer object from which completions can be shown to the user.
- """
- if include_all:
- options = ["All"] + options
-
- return WordCompleter(options, ignore_case=True)
-
-
-# MARK: Language Selection
-
-
-def prompt_for_languages() -> None:
- """
- Request language and data type for lexeme totals.
-
- Returns
- -------
- None
- Languages are added to the configuration or are asked for.
- """
- language_completer = create_word_completer(config.languages, include_all=True)
- initial_language_selection = ", ".join(config.selected_languages)
- selected_languages = prompt(
- "Select languages (comma-separated or 'All'): ",
- default=initial_language_selection,
- completer=language_completer,
- )
- if "All" in selected_languages:
- config.selected_languages = config.languages
-
- elif selected_languages.strip(): # check if input is not just whitespace
- config.selected_languages = [
- lang.strip()
- for lang in selected_languages.split(",")
- if lang.strip() in config.languages
- ]
-
- if not config.selected_languages:
- rprint("[yellow]No language selected. Please try again.[/yellow]")
- return prompt_for_languages()
-
-
-def _wiktionary_dump_search_dirs(location: Path) -> list[Path]:
- """
- Build an ordered list of directories to search for Wiktionary dumps.
-
- Each candidate directory is resolved and included only if it exists.
- Duplicate paths are omitted while preserving the following search order:
-
- 1. The provided ``location`` directory.
- 2. The default export directory (:data:`~scribe_data.utils.DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR`).
- 3. The default export directory under every ancestor of the current working directory.
- 4. The current working directory itself.
-
- Searching ancestor directories allows dumps to be found when the interactive mode
- is started from a nested folder (e.g., ``scribe_data_wiktionary_json_export/spanish``).
-
- Parameters
- ----------
- location : Path
- User-supplied dump path or search root from
- :func:`resolve_wiktionary_dump_path`.
-
- Returns
- -------
- list[Path]
- A deduplicated list of existing directories to search.
- """
- candidates = [
- location,
- DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR,
- *(parent / DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR for parent in Path.cwd().parents),
- Path.cwd(),
- ]
- resolved_paths = [path.expanduser().resolve() for path in candidates]
- return list(dict.fromkeys(path for path in resolved_paths if path.is_dir()))
-
-
-def resolve_wiktionary_dump_path(language: str, location: str | Path) -> Path | None:
- """
- Resolve a Wiktionary dump file for the given source language.
-
- Locates the newest Wiktionary XML dump for the specified language.
- If the ``location`` argument points directly to a file, that file is returned.
- Otherwise, it searches through a prioritized list of directories for dumps
- matching the ``{iso}wiktionary*pages-articles.xml*`` pattern.
-
- Parameters
- ----------
- language : str
- Source language name (e.g. ``german``).
-
- location : str or Path
- Path to a specific dump file, or a base directory to begin searching from.
-
- Returns
- -------
- Path or None
- The path to the newest matching dump file, the explicit file if ``location``
- is a file, or ``None`` if no matching dump is found.
- """
- path = Path(location).expanduser().resolve()
- if path.is_file():
- return path
-
- if not (iso := resolve_lang_iso(language)):
- return None
-
- dumps = [
- dump_path
- for search_dir in _wiktionary_dump_search_dirs(path)
- for dump_path in search_dir.glob(f"{iso}wiktionary*pages-articles.xml*")
- ]
- return (
- max(dumps, key=lambda dump_path: dump_path.stat().st_mtime).resolve()
- if dumps
- else None
- )
-
-
-# MARK: Data Type Selection
-
-
-def prompt_for_data_types() -> None:
- """
- Prompt the user to select data types.
-
- Returns
- -------
- None
- Data types are added to the configuration or are asked for.
- """
- data_type_completer = create_word_completer(config.data_types, include_all=True)
- initial_data_type_selection = ", ".join(config.selected_data_types)
-
- while True:
- selected_data_types = prompt(
- "Select data types (comma-separated or 'All'): ",
- default=initial_data_type_selection,
- completer=data_type_completer,
- )
- if "All" in selected_data_types.capitalize():
- config.selected_data_types = config.data_types
- break
-
- elif selected_data_types.strip(): # check if input is not just whitespace
- config.selected_data_types = [
- dt.strip()
- for dt in selected_data_types.split(",")
- if dt.strip() in config.data_types
- ]
- if config.selected_data_types:
- break # exit loop if valid data types are selected
-
- rprint("[yellow]No data type selected. Please try again.[/yellow]")
-
-
-def configure_settings() -> None:
- """
- Configure the settings of the interactive mode request.
-
- Asks for:
- - Languages
- - Data types
- - Output type
- - Output directory
- - Whether to overwrite
- """
- rprint(
- "[cyan]Follow the prompts below. Press tab for completions and enter to select.[/cyan]"
- )
- prompt_for_languages()
- prompt_for_data_types()
-
- # MARK: Outputs
-
- output_type_completer = create_word_completer(["json", "csv", "tsv"])
- config.output_type = prompt(
- "Select output type (json/csv/tsv): ",
- default="json",
- completer=output_type_completer,
- )
- while config.output_type not in ["json", "csv", "tsv"]:
- rprint("[yellow]Invalid output type selected. Please try again.[/yellow]")
- config.output_type = prompt(
- "Select output type (json/csv/tsv): ",
- default="json",
- completer=output_type_completer,
- )
-
- # MARK: Output Directory
-
- if output_dir := prompt(f"Enter output directory (default: {config.output_dir}): "):
- config.output_dir = Path(output_dir)
-
- # MARK: Overwrite Confirmation
-
- overwrite_completer = create_word_completer(["Y", "n"])
- overwrite = (
- prompt("Overwrite existing files? (Y/n): ", completer=overwrite_completer)
- or "y"
- )
- config.overwrite = overwrite.lower() == "y"
-
- config.configured = True
- display_summary()
-
-
-def run_request() -> None:
- """
- Execute the interactive mode request based on current configuration.
-
- Returns
- -------
- None
- An interactive mode request is ran.
- """
- if not config.selected_languages or not config.selected_data_types:
- rprint("[bold red]Error: Please configure languages and data types.[/bold red]")
- return
-
- # Calculate total operations
- total_operations = len(config.selected_languages) * len(config.selected_data_types)
-
- # MARK: Export Data
-
- with tqdm(
- total=total_operations,
- desc="Exporting data",
- unit="operation",
- ) as pbar:
- for language in config.selected_languages:
- for data_type in config.selected_data_types:
- pbar.set_description(f"Exporting {language} {data_type} data")
-
- try:
- get_data(
- languages=[language],
- data_types=[data_type],
- output_type=config.output_type,
- output_dir=config.output_dir,
- overwrite=config.overwrite,
- interactive=True,
- )
- # The data was successfully written to file, so we can log success
- logger.info(
- f"[green]✔ Successfully exported {language} {data_type} data.[/green]"
- )
- except Exception as e:
- logger.error(
- f"[red]✖ Failed to export {language} {data_type} data: {str(e)}[/red]"
- )
-
- pbar.update(1)
-
- if config.overwrite:
- rprint("[bold green]Data request completed successfully![/bold green]")
-
-
-def request_total_lexeme_loop() -> None:
- """
- Continuously prompts for lexeme requests until exit.
- """
- while True:
- choice = questionary.select(
- "What would you like to do?",
- choices=[
- questionary.Choice("Configure total lexemes request", "total"),
- questionary.Choice("Run total lexemes request", "run"),
- questionary.Choice(
- "Run total lexemes request with lexeme dumps", "run_all"
- ),
- questionary.Choice("Exit", "exit"),
- ],
- ).ask()
-
- if choice == "run":
- total_wrapper(
- languages=config.selected_languages,
- data_types=config.selected_data_types,
- all_bool=False,
- )
- config.selected_languages, config.selected_data_types = [], []
- rprint(THANK_YOU_MESSAGE)
- break
-
- elif choice == "run_all":
- if wikidata_dump_path := prompt(
- f"Enter Wikidata lexeme dump path (default: {str(DEFAULT_WIKIDATA_DUMP_EXPORT_DIR)}): "
- ):
- wikidata_dump_path = Path(wikidata_dump_path)
-
- else:
- wikidata_dump_path = DEFAULT_WIKIDATA_DUMP_EXPORT_DIR
-
- parse_wd_lexeme_dump(
- languages=config.selected_languages,
- wikidata_dump_path=wikidata_dump_path,
- wikidata_dump_type=["total"],
- interactive_mode=True,
- )
- break
-
- elif choice == "exit":
- return
-
- else:
- prompt_for_languages()
- prompt_for_data_types()
-
-
-# MARK: List
-
-# def see_list_languages():
-# """
-# See list of languages.
-# """
-
-# choice = select(
-# "What would you like to list?",
-# choices=[
-# Choice("All languages", "all_languages"),
-# Choice("Languages for a specific data type", "languages_for_data_type"),
-# Choice("Data types for a specific language", "data_types_for_language"),
-# ],
-# ).ask()
-
-# if choice == "all_languages":
-# list_wrapper(all_bool=True)
-
-# elif choice == "languages_for_data_type":
-# list_wrapper(data_type=True)
-
-# elif choice == "data_types_for_language":
-# list_wrapper(language=True)
-
-
-# MARK: Start
-
-
-def start_interactive_mode(operation: str | None = None) -> None:
- """
- Entry point for interactive mode.
-
- Parameters
- ----------
- operation : str
- The type of operation that interactive mode is being ran with.
- """
- while True:
- # Check if both selected_languages and selected_data_types are empty.
- if config.selected_languages or config.selected_data_types:
- choices = [
- questionary.Choice("Configure get data request", "configure"),
- questionary.Choice("Exit", "exit"),
- ]
-
- if config.configured:
- choices.insert(
- 1, questionary.Choice("Run get data request with WDQS", "run")
- )
- choices.insert(
- 2,
- questionary.Choice(
- "Run get lexemes request with lexeme dumps", "run_all"
- ),
- )
-
- elif config.selected_languages and config.selected_data_types:
- choices.insert(
- 1, questionary.Choice("Request for convert JSON", "convert_json")
- )
-
- else:
- choices.insert(
- 1, questionary.Choice("Request for total lexeme", "total")
- )
-
- elif operation == "get":
- choices = [
- questionary.Choice("Configure get data request", "configure"),
- # Choice("See list of languages", "languages"),
- questionary.Choice("Exit", "exit"),
- ]
-
- elif operation == "total":
- choices = [
- questionary.Choice("Configure total lexemes request", "total"),
- # Choice("See list of languages", "languages"),
- questionary.Choice("Exit", "exit"),
- ]
-
- elif operation == "convert":
- choices = [
- questionary.Choice("Configure convert request", "convert"),
- questionary.Choice("Exit", "exit"),
- ]
-
- elif operation == "translations":
- choices = [
- questionary.Choice("Configure translations request", "translations"),
- # Choice("See list of languages", "languages"),
- questionary.Choice("Exit", "exit"),
- ]
-
- choice = questionary.select("What would you like to do?", choices=choices).ask()
-
- if choice == "configure":
- configure_settings()
-
- elif choice == "run_all":
- if wikidata_dump_path := prompt(
- f"Enter Wikidata lexeme dump path (default: {str(DEFAULT_WIKIDATA_DUMP_EXPORT_DIR)}): "
- ):
- wikidata_dump_path = Path(wikidata_dump_path)
-
- else:
- wikidata_dump_path = DEFAULT_WIKIDATA_DUMP_EXPORT_DIR
-
- parse_wd_lexeme_dump(
- languages=config.selected_languages,
- data_types=config.selected_data_types,
- wikidata_dump_type=["form"],
- output_dir=config.output_dir,
- wikidata_dump_path=wikidata_dump_path,
- overwrite_all=config.overwrite,
- interactive_mode=True,
- )
- rprint(THANK_YOU_MESSAGE)
- break
-
- elif choice == "total":
- prompt_for_languages()
- prompt_for_data_types()
- request_total_lexeme_loop()
- break
-
- elif choice == "convert":
- prompt_for_languages()
- prompt_for_data_types()
-
- # Use the default explicitly so that if the user enters nothing, the default value is retained.
- user_input_dir = prompt(
- f"Enter input directory (default: {config.input_dir}): ",
- default=str(config.input_dir),
- )
- config.input_dir = Path(user_input_dir)
-
- user_output_dir = prompt(
- f"Enter output directory (default: {config.output_dir_sqlite}): ",
- default=str(config.output_dir_sqlite),
- )
- config.output_dir_sqlite = Path(user_output_dir)
-
- identifier_case = prompt(
- "Enter identifier case (default: camel): ",
- default="camel",
- )
- output_type = prompt(
- "Enter output type (default: sqlite): ",
- default="sqlite",
- )
- overwrite_str = prompt(
- "Overwrite existing files? (default: False): ",
- default="False",
- )
- overwrite_bool = overwrite_str.strip().lower() in ("true", "y", "yes")
-
- convert_wrapper(
- languages=config.selected_languages,
- data_types=config.selected_data_types,
- input_path=config.input_dir, # Use the updated configuration value
- output_dir=config.output_dir_sqlite,
- output_type=output_type,
- identifier_case=identifier_case,
- overwrite=overwrite_bool,
- )
- break
-
- elif choice == "translations":
- from scribe_data.wiktionary.parse_translations import (
- parse_wiktionary_translations,
- )
-
- while True:
- wiktionary_dump_language = prompt(
- "Select Wiktionary dump source language: ",
- default="english",
- completer=create_word_completer(config.languages),
- ).strip()
- if wiktionary_dump_language in config.languages:
- break
- rprint(
- f"[bold red]Error: {wiktionary_dump_language} is not a valid language.[/bold red]"
- )
-
- dump_location = prompt(
- "Enter Wiktionary dump directory or file path "
- f"(default: {DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR}): ",
- default=str(DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR),
- )
- wiktionary_dump_path = resolve_wiktionary_dump_path(
- wiktionary_dump_language,
- dump_location,
- )
- if not wiktionary_dump_path:
- rprint(
- f"[bold red]No {wiktionary_dump_language} Wiktionary dump found at "
- f"{dump_location}.[/bold red]"
- )
- break
-
- prompt_for_languages()
-
- translations_output_dir = prompt(
- "Enter output directory "
- f"(default: {DEFAULT_WIKTIONARY_JSON_EXPORT_DIR}): ",
- default=str(DEFAULT_WIKTIONARY_JSON_EXPORT_DIR),
- )
-
- overwrite_str = prompt(
- "Overwrite existing files? (default: False): ",
- default="False",
- )
- overwrite_bool = overwrite_str.strip().lower() in ("true", "y", "yes")
-
- parse_wiktionary_translations(
- target_languages=config.selected_languages,
- wiktionary_dump_path=Path(wiktionary_dump_path),
- output_dir=Path(translations_output_dir),
- overwrite=overwrite_bool,
- )
-
- break
-
- elif choice == "run":
- run_request()
- rprint(THANK_YOU_MESSAGE)
- break
-
- else:
- rprint(THANK_YOU_MESSAGE)
- break
-
-
-if __name__ == "__main__":
- start_interactive_mode()
diff --git a/src/scribe_data/cli/interactive/__init__.py b/src/scribe_data/cli/interactive/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/src/scribe_data/cli/interactive/config.py b/src/scribe_data/cli/interactive/config.py
new file mode 100644
index 000000000..7fc43c795
--- /dev/null
+++ b/src/scribe_data/cli/interactive/config.py
@@ -0,0 +1,42 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Interactive mode configuration for the Scribe-Data CLI to allow users to select request arguments.
+"""
+
+from pathlib import Path
+
+# from scribe_data.cli.list import list_wrapper
+from scribe_data.utils import (
+ DEFAULT_JSON_EXPORT_DIR,
+ DEFAULT_SQLITE_EXPORT_DIR,
+ data_type_metadata,
+ language_metadata,
+ list_all_languages,
+)
+
+THANK_YOU_MESSAGE = "[bold cyan]Thank you for using Scribe-Data![/bold cyan]"
+
+
+class ScribeDataConfig:
+ """
+ Class for the configuration of the interactive mode.
+ """
+
+ def __init__(self) -> None:
+ """
+ Configure the interactive mode.
+ """
+ self.languages = list_all_languages(language_metadata)
+ self.data_types = list(data_type_metadata.keys())
+ self.selected_languages: list[str] = []
+ self.selected_data_types: list[str] = []
+ self.output_type: str = "json"
+ self.output_dir: Path = DEFAULT_JSON_EXPORT_DIR
+ self.overwrite: bool = False
+ self.configured: bool = False
+ self.identifier_case: str = "camel"
+ self.input_dir: Path = DEFAULT_JSON_EXPORT_DIR
+ self.output_dir_sqlite: Path = DEFAULT_SQLITE_EXPORT_DIR
+
+
+interactive_mode_config = ScribeDataConfig()
diff --git a/src/scribe_data/cli/interactive/execute.py b/src/scribe_data/cli/interactive/execute.py
new file mode 100644
index 000000000..00f58242d
--- /dev/null
+++ b/src/scribe_data/cli/interactive/execute.py
@@ -0,0 +1,182 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Interactive mode execution for the Scribe-Data CLI to allow users to select request arguments.
+"""
+
+import logging
+from pathlib import Path
+
+import questionary
+from prompt_toolkit import prompt
+from rich import print as rprint
+from rich.console import Console
+from rich.logging import RichHandler
+from rich.table import Table
+from tqdm import tqdm
+
+# from scribe_data.cli.list import list_wrapper
+from scribe_data.cli.get import get_data
+from scribe_data.cli.interactive.config import (
+ THANK_YOU_MESSAGE,
+ interactive_mode_config,
+)
+from scribe_data.cli.interactive.prompt import (
+ prompt_for_data_types,
+ prompt_for_languages,
+)
+from scribe_data.cli.total.wrapper import total_wrapper
+from scribe_data.utils import DEFAULT_WIKIDATA_DUMP_EXPORT_DIR
+from scribe_data.wikidata.wikidata_utils import parse_wd_lexeme_dump
+
+# MARK: Logging
+
+logging.basicConfig(
+ level=logging.INFO,
+ format="%(message)s",
+ datefmt="[%X]",
+ handlers=[RichHandler(markup=True)], # Enable markup for colors
+)
+console = Console()
+logger = logging.getLogger("rich")
+
+# MARK: Execute Request
+
+
+def execute_request() -> None:
+ """
+ Execute the interactive mode request based on current configuration.
+
+ Returns
+ -------
+ None
+ An interactive mode request is ran.
+ """
+ if (
+ not interactive_mode_config.selected_languages
+ or not interactive_mode_config.selected_data_types
+ ):
+ rprint("[bold red]Error: Please configure languages and data types.[/bold red]")
+ return
+
+ # Calculate total operations
+ total_operations = len(interactive_mode_config.selected_languages) * len(
+ interactive_mode_config.selected_data_types
+ )
+
+ # MARK: Export Data
+
+ with tqdm(
+ total=total_operations,
+ desc="Exporting data",
+ unit="operation",
+ ) as pbar:
+ for language in interactive_mode_config.selected_languages:
+ for data_type in interactive_mode_config.selected_data_types:
+ pbar.set_description(f"Exporting {language} {data_type} data")
+
+ try:
+ get_data(
+ languages=[language],
+ data_types=[data_type],
+ output_type=interactive_mode_config.output_type,
+ output_dir=interactive_mode_config.output_dir,
+ overwrite=interactive_mode_config.overwrite,
+ interactive=True,
+ )
+ # The data was successfully written to file, so we can log success
+ logger.info(
+ f"[green]✔ Successfully exported {language} {data_type} data.[/green]"
+ )
+ except Exception as e:
+ logger.error(
+ f"[red]✖ Failed to export {language} {data_type} data: {str(e)}[/red]"
+ )
+
+ pbar.update(1)
+
+ if interactive_mode_config.overwrite:
+ rprint("[bold green]Data request completed successfully![/bold green]")
+
+
+def request_total_lexeme_loop() -> None:
+ """
+ Continuously prompts for lexeme requests until exit.
+ """
+ while True:
+ choice = questionary.select(
+ "What would you like to do?",
+ choices=[
+ questionary.Choice("Configure total lexemes request", "total"),
+ questionary.Choice("Run total lexemes request", "run"),
+ questionary.Choice(
+ "Run total lexemes request with lexeme dumps", "run_all"
+ ),
+ questionary.Choice("Exit", "exit"),
+ ],
+ ).ask()
+
+ if choice == "run":
+ total_wrapper(
+ languages=interactive_mode_config.selected_languages,
+ data_types=interactive_mode_config.selected_data_types,
+ all_bool=False,
+ )
+ (
+ interactive_mode_config.selected_languages,
+ interactive_mode_config.selected_data_types,
+ ) = [], []
+ rprint(THANK_YOU_MESSAGE)
+ break
+
+ elif choice == "run_all":
+ if wikidata_dump_path := prompt(
+ f"Enter Wikidata lexeme dump path (default: {str(DEFAULT_WIKIDATA_DUMP_EXPORT_DIR)}): "
+ ):
+ wikidata_dump_path = Path(wikidata_dump_path)
+
+ else:
+ wikidata_dump_path = DEFAULT_WIKIDATA_DUMP_EXPORT_DIR
+
+ parse_wd_lexeme_dump(
+ languages=interactive_mode_config.selected_languages,
+ wikidata_dump_path=wikidata_dump_path,
+ wikidata_dump_type=["total"],
+ interactive_mode=True,
+ )
+ break
+
+ elif choice == "exit":
+ return
+
+ else:
+ prompt_for_languages()
+ prompt_for_data_types()
+
+
+# MARK: Summary
+
+
+def display_summary() -> None:
+ """
+ Display a summary of the interactive mode request to run.
+ """
+ table = Table(
+ title="Scribe-Data Request Configuration Summary", style="bright_white"
+ )
+
+ table.add_column("Setting", style="bold cyan", no_wrap=True)
+ table.add_column("Value(s)", style="magenta")
+
+ table.add_row(
+ "Languages", ", ".join(interactive_mode_config.selected_languages) or "None"
+ )
+ table.add_row(
+ "Data Types", ", ".join(interactive_mode_config.selected_data_types) or "None"
+ )
+ table.add_row("Output Type", interactive_mode_config.output_type)
+ table.add_row("Output Directory", str(interactive_mode_config.output_dir))
+ table.add_row("Overwrite", "Yes" if interactive_mode_config.overwrite else "No")
+
+ console.print("\n")
+ console.print(table, justify="left")
+ console.print("\n")
diff --git a/src/scribe_data/cli/interactive/prompt.py b/src/scribe_data/cli/interactive/prompt.py
new file mode 100644
index 000000000..0fa3098a4
--- /dev/null
+++ b/src/scribe_data/cli/interactive/prompt.py
@@ -0,0 +1,195 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Interactive mode prompting for the Scribe-Data CLI to allow users to select request arguments.
+"""
+
+from pathlib import Path
+
+from prompt_toolkit import prompt
+from prompt_toolkit.completion import WordCompleter
+from rich import print as rprint
+
+from scribe_data.cli.interactive.config import interactive_mode_config
+from scribe_data.utils import DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR, resolve_lang_iso
+
+# MARK: Word Completion
+
+
+def create_word_completer(
+ options: list[str], include_all: bool = False
+) -> WordCompleter:
+ """
+ Return a word completer object of the given options.
+
+ Parameters
+ ----------
+ options : List[str]
+ The options that could complete the current input.
+
+ include_all : bool
+ Whether 'All' should be an option.
+
+ Returns
+ -------
+ WordCompleter
+ The word completer object from which completions can be shown to the user.
+ """
+ if include_all:
+ options = ["All"] + options
+
+ return WordCompleter(options, ignore_case=True)
+
+
+# MARK: Language Selection
+
+
+def prompt_for_languages() -> None:
+ """
+ Request language and data type for lexeme totals.
+
+ Returns
+ -------
+ None
+ Languages are added to the configuration or are asked for.
+ """
+ language_completer = create_word_completer(
+ interactive_mode_config.languages, include_all=True
+ )
+ initial_language_selection = ", ".join(interactive_mode_config.selected_languages)
+ selected_languages = prompt(
+ "Select languages (comma-separated or 'All'): ",
+ default=initial_language_selection,
+ completer=language_completer,
+ )
+ if "All" in selected_languages:
+ interactive_mode_config.selected_languages = interactive_mode_config.languages
+
+ elif selected_languages.strip(): # check if input is not just whitespace
+ interactive_mode_config.selected_languages = [
+ lang.strip()
+ for lang in selected_languages.split(",")
+ if lang.strip() in interactive_mode_config.languages
+ ]
+
+ if not interactive_mode_config.selected_languages:
+ rprint("[yellow]No language selected. Please try again.[/yellow]")
+ return prompt_for_languages()
+
+
+def _wiktionary_dump_search_dirs(location: Path) -> list[Path]:
+ """
+ Build an ordered list of directories to search for Wiktionary dumps.
+
+ Each candidate directory is resolved and included only if it exists.
+ Duplicate paths are omitted while preserving the following search order:
+
+ 1. The provided ``location`` directory.
+ 2. The default export directory (:data:`~scribe_data.utils.DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR`).
+ 3. The default export directory under every ancestor of the current working directory.
+ 4. The current working directory itself.
+
+ Searching ancestor directories allows dumps to be found when the interactive mode
+ is started from a nested folder (e.g., ``scribe_data_wiktionary_json_export/spanish``).
+
+ Parameters
+ ----------
+ location : Path
+ User-supplied dump path or search root from
+ :func:`resolve_wiktionary_dump_path`.
+
+ Returns
+ -------
+ list[Path]
+ A deduplicated list of existing directories to search.
+ """
+ candidates = [
+ location,
+ DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR,
+ *(parent / DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR for parent in Path.cwd().parents),
+ Path.cwd(),
+ ]
+ resolved_paths = [path.expanduser().resolve() for path in candidates]
+ return list(dict.fromkeys(path for path in resolved_paths if path.is_dir()))
+
+
+def resolve_wiktionary_dump_path(language: str, location: str | Path) -> Path | None:
+ """
+ Resolve a Wiktionary dump file for the given source language.
+
+ Locates the newest Wiktionary XML dump for the specified language.
+ If the ``location`` argument points directly to a file, that file is returned.
+ Otherwise, it searches through a prioritized list of directories for dumps
+ matching the ``{iso}wiktionary*pages-articles.xml*`` pattern.
+
+ Parameters
+ ----------
+ language : str
+ Source language name (e.g. ``german``).
+
+ location : str or Path
+ Path to a specific dump file, or a base directory to begin searching from.
+
+ Returns
+ -------
+ Path or None
+ The path to the newest matching dump file, the explicit file if ``location``
+ is a file, or ``None`` if no matching dump is found.
+ """
+ path = Path(location).expanduser().resolve()
+ if path.is_file():
+ return path
+
+ if not (iso := resolve_lang_iso(language)):
+ return None
+
+ dumps = [
+ dump_path
+ for search_dir in _wiktionary_dump_search_dirs(path)
+ for dump_path in search_dir.glob(f"{iso}wiktionary*pages-articles.xml*")
+ ]
+ return (
+ max(dumps, key=lambda dump_path: dump_path.stat().st_mtime).resolve()
+ if dumps
+ else None
+ )
+
+
+# MARK: Data Type Selection
+
+
+def prompt_for_data_types() -> None:
+ """
+ Prompt the user to select data types.
+
+ Returns
+ -------
+ None
+ Data types are added to the configuration or are asked for.
+ """
+ data_type_completer = create_word_completer(
+ interactive_mode_config.data_types, include_all=True
+ )
+ initial_data_type_selection = ", ".join(interactive_mode_config.selected_data_types)
+
+ while True:
+ selected_data_types = prompt(
+ "Select data types (comma-separated or 'All'): ",
+ default=initial_data_type_selection,
+ completer=data_type_completer,
+ )
+ if "All" in selected_data_types.capitalize():
+ interactive_mode_config.selected_data_types = (
+ interactive_mode_config.data_types
+ )
+ break
+
+ elif selected_data_types.strip(): # check if input is not just whitespace
+ interactive_mode_config.selected_data_types = [
+ dt.strip()
+ for dt in selected_data_types.split(",")
+ if dt.strip() in interactive_mode_config.data_types
+ ]
+ if interactive_mode_config.selected_data_types:
+ break # exit loop if valid data types are selected
+
+ rprint("[yellow]No data type selected. Please try again.[/yellow]")
diff --git a/src/scribe_data/cli/interactive/run.py b/src/scribe_data/cli/interactive/run.py
new file mode 100644
index 000000000..2aea1febb
--- /dev/null
+++ b/src/scribe_data/cli/interactive/run.py
@@ -0,0 +1,306 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Interactive mode runner for the Scribe-Data CLI to allow users to select request arguments.
+"""
+
+from pathlib import Path
+
+import questionary
+from prompt_toolkit import prompt
+from rich import print as rprint
+
+from scribe_data.cli.convert.wrapper import convert_wrapper
+from scribe_data.cli.interactive.config import (
+ THANK_YOU_MESSAGE,
+ interactive_mode_config,
+)
+from scribe_data.cli.interactive.execute import (
+ display_summary,
+ execute_request,
+ request_total_lexeme_loop,
+)
+from scribe_data.cli.interactive.prompt import (
+ create_word_completer,
+ prompt_for_data_types,
+ prompt_for_languages,
+ resolve_wiktionary_dump_path,
+)
+from scribe_data.utils import (
+ DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
+ DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR,
+ DEFAULT_WIKTIONARY_JSON_EXPORT_DIR,
+)
+from scribe_data.wikidata.wikidata_utils import parse_wd_lexeme_dump
+
+# MARK: Configure
+
+
+def configure_settings() -> None:
+ """
+ Configure the settings of the interactive mode request.
+
+ Asks for:
+ - Languages
+ - Data types
+ - Output type
+ - Output directory
+ - Whether to overwrite
+ """
+ rprint(
+ "[cyan]Follow the prompts below. Press tab for completions and enter to select.[/cyan]"
+ )
+ prompt_for_languages()
+ prompt_for_data_types()
+
+ # MARK: Outputs
+
+ output_type_completer = create_word_completer(["json", "csv", "tsv"])
+ interactive_mode_config.output_type = prompt(
+ "Select output type (json/csv/tsv): ",
+ default="json",
+ completer=output_type_completer,
+ )
+ while interactive_mode_config.output_type not in ["json", "csv", "tsv"]:
+ rprint("[yellow]Invalid output type selected. Please try again.[/yellow]")
+ interactive_mode_config.output_type = prompt(
+ "Select output type (json/csv/tsv): ",
+ default="json",
+ completer=output_type_completer,
+ )
+
+ # MARK: Output Directory
+
+ if output_dir := prompt(
+ f"Enter output directory (default: {interactive_mode_config.output_dir}): "
+ ):
+ interactive_mode_config.output_dir = Path(output_dir)
+
+ # MARK: Overwrite Confirmation
+
+ overwrite_completer = create_word_completer(["Y", "n"])
+ overwrite = (
+ prompt("Overwrite existing files? (Y/n): ", completer=overwrite_completer)
+ or "y"
+ )
+ interactive_mode_config.overwrite = overwrite.lower() == "y"
+
+ interactive_mode_config.configured = True
+ display_summary()
+
+
+# MARK: Start
+
+
+def run_interactive_mode(operation: str | None = None) -> None:
+ """
+ Entry point for interactive mode.
+
+ Parameters
+ ----------
+ operation : str
+ The type of operation that interactive mode is being ran with.
+ """
+ while True:
+ # Check if both selected_languages and selected_data_types are empty.
+ if (
+ interactive_mode_config.selected_languages
+ or interactive_mode_config.selected_data_types
+ ):
+ choices = [
+ questionary.Choice("Configure get data request", "configure"),
+ questionary.Choice("Exit", "exit"),
+ ]
+
+ if interactive_mode_config.configured:
+ choices.insert(
+ 1, questionary.Choice("Run get data request with WDQS", "run")
+ )
+ choices.insert(
+ 2,
+ questionary.Choice(
+ "Run get lexemes request with lexeme dumps", "run_all"
+ ),
+ )
+
+ elif (
+ interactive_mode_config.selected_languages
+ and interactive_mode_config.selected_data_types
+ ):
+ choices.insert(
+ 1, questionary.Choice("Request for convert JSON", "convert_json")
+ )
+
+ else:
+ choices.insert(
+ 1, questionary.Choice("Request for total lexeme", "total")
+ )
+
+ elif operation == "get":
+ choices = [
+ questionary.Choice("Configure get data request", "configure"),
+ # Choice("See list of languages", "languages"),
+ questionary.Choice("Exit", "exit"),
+ ]
+
+ elif operation == "total":
+ choices = [
+ questionary.Choice("Configure total lexemes request", "total"),
+ # Choice("See list of languages", "languages"),
+ questionary.Choice("Exit", "exit"),
+ ]
+
+ elif operation == "convert":
+ choices = [
+ questionary.Choice("Configure convert request", "convert"),
+ questionary.Choice("Exit", "exit"),
+ ]
+
+ elif operation == "translations":
+ choices = [
+ questionary.Choice("Configure translations request", "translations"),
+ # Choice("See list of languages", "languages"),
+ questionary.Choice("Exit", "exit"),
+ ]
+
+ choice = questionary.select("What would you like to do?", choices=choices).ask()
+
+ if choice == "configure":
+ configure_settings()
+
+ elif choice == "run_all":
+ if wikidata_dump_path := prompt(
+ f"Enter Wikidata lexeme dump path (default: {str(DEFAULT_WIKIDATA_DUMP_EXPORT_DIR)}): "
+ ):
+ wikidata_dump_path = Path(wikidata_dump_path)
+
+ else:
+ wikidata_dump_path = DEFAULT_WIKIDATA_DUMP_EXPORT_DIR
+
+ parse_wd_lexeme_dump(
+ languages=interactive_mode_config.selected_languages,
+ data_types=interactive_mode_config.selected_data_types,
+ wikidata_dump_type=["form"],
+ output_dir=interactive_mode_config.output_dir,
+ wikidata_dump_path=wikidata_dump_path,
+ overwrite_all=interactive_mode_config.overwrite,
+ interactive_mode=True,
+ )
+ rprint(THANK_YOU_MESSAGE)
+ break
+
+ elif choice == "total":
+ prompt_for_languages()
+ prompt_for_data_types()
+ request_total_lexeme_loop()
+ break
+
+ elif choice == "convert":
+ prompt_for_languages()
+ prompt_for_data_types()
+
+ # Use the default explicitly so that if the user enters nothing, the default value is retained.
+ user_input_dir = prompt(
+ f"Enter input directory (default: {interactive_mode_config.input_dir}): ",
+ default=str(interactive_mode_config.input_dir),
+ )
+ interactive_mode_config.input_dir = Path(user_input_dir)
+
+ user_output_dir = prompt(
+ f"Enter output directory (default: {interactive_mode_config.output_dir_sqlite}): ",
+ default=str(interactive_mode_config.output_dir_sqlite),
+ )
+ interactive_mode_config.output_dir_sqlite = Path(user_output_dir)
+
+ identifier_case = prompt(
+ "Enter identifier case (default: camel): ",
+ default="camel",
+ )
+ output_type = prompt(
+ "Enter output type (default: sqlite): ",
+ default="sqlite",
+ )
+ overwrite_str = prompt(
+ "Overwrite existing files? (default: False): ",
+ default="False",
+ )
+ overwrite_bool = overwrite_str.strip().lower() in ("true", "y", "yes")
+
+ convert_wrapper(
+ languages=interactive_mode_config.selected_languages,
+ data_types=interactive_mode_config.selected_data_types,
+ input_path=interactive_mode_config.input_dir, # Use the updated configuration value
+ output_dir=interactive_mode_config.output_dir_sqlite,
+ output_type=output_type,
+ identifier_case=identifier_case,
+ overwrite=overwrite_bool,
+ )
+ break
+
+ elif choice == "translations":
+ from scribe_data.wiktionary.parse_translations import (
+ parse_wiktionary_translations,
+ )
+
+ while True:
+ wiktionary_dump_language = prompt(
+ "Select Wiktionary dump source language: ",
+ default="english",
+ completer=create_word_completer(interactive_mode_config.languages),
+ ).strip()
+ if wiktionary_dump_language in interactive_mode_config.languages:
+ break
+ rprint(
+ f"[bold red]Error: {wiktionary_dump_language} is not a valid language.[/bold red]"
+ )
+
+ dump_location = prompt(
+ "Enter Wiktionary dump directory or file path "
+ f"(default: {DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR}): ",
+ default=str(DEFAULT_WIKTIONARY_DUMP_EXPORT_DIR),
+ )
+ wiktionary_dump_path = resolve_wiktionary_dump_path(
+ wiktionary_dump_language,
+ dump_location,
+ )
+ if not wiktionary_dump_path:
+ rprint(
+ f"[bold red]No {wiktionary_dump_language} Wiktionary dump found at "
+ f"{dump_location}.[/bold red]"
+ )
+ break
+
+ prompt_for_languages()
+
+ translations_output_dir = prompt(
+ "Enter output directory "
+ f"(default: {DEFAULT_WIKTIONARY_JSON_EXPORT_DIR}): ",
+ default=str(DEFAULT_WIKTIONARY_JSON_EXPORT_DIR),
+ )
+
+ overwrite_str = prompt(
+ "Overwrite existing files? (default: False): ",
+ default="False",
+ )
+ overwrite_bool = overwrite_str.strip().lower() in ("true", "y", "yes")
+
+ parse_wiktionary_translations(
+ target_languages=interactive_mode_config.selected_languages,
+ wiktionary_dump_path=Path(wiktionary_dump_path),
+ output_dir=Path(translations_output_dir),
+ overwrite=overwrite_bool,
+ )
+
+ break
+
+ elif choice == "run":
+ execute_request()
+ rprint(THANK_YOU_MESSAGE)
+ break
+
+ else:
+ rprint(THANK_YOU_MESSAGE)
+ break
+
+
+if __name__ == "__main__":
+ run_interactive_mode()
diff --git a/src/scribe_data/cli/list.py b/src/scribe_data/cli/list.py
deleted file mode 100644
index 5a17b18ae..000000000
--- a/src/scribe_data/cli/list.py
+++ /dev/null
@@ -1,205 +0,0 @@
-# SPDX-License-Identifier: GPL-3.0-or-later
-"""
-Functions for listing languages and data types for the Scribe-Data CLI.
-"""
-
-import os
-from pathlib import Path
-
-from scribe_data.utils import (
- WIKIDATA_QUERIES_ALL_DATA_DIR,
- format_sublanguage_name,
- get_language_iso,
- get_language_qid,
- language_map,
- language_metadata,
- list_all_languages,
-)
-
-
-def list_languages() -> None:
- """
- Generate a table of languages with their ISO-2 codes and Wikidata QIDs.
-
- Returns
- -------
- None
- A table of all languages with their ISO-2 codes and Wikidata QIDs is printed.
- """
- languages = list_all_languages(language_metadata)
-
- language_col_width = max(len(lang) for lang in languages) + 2
- iso_col_width = max(len(get_language_iso(lang)) for lang in languages) + 2
- qid_col_width = max(len(get_language_qid(lang)) for lang in languages) + 2
-
- table_line_length = language_col_width + iso_col_width + qid_col_width
-
- print(
- f"{'\nLanguage':<{language_col_width}} {'ISO':<{iso_col_width}} {'QID':<{qid_col_width}}"
- )
- print("=" * table_line_length)
-
- for lang in languages:
- print(
- f"{lang.title():<{language_col_width}} {get_language_iso(lang):<{iso_col_width}} {get_language_qid(lang):<{qid_col_width}}"
- )
-
- print()
-
-
-def list_data_types(language: str = "") -> None:
- """
- List all data types or those available for a given language.
-
- Parameters
- ----------
- language : str
- The language to potentially list data types for.
- """
- languages = list_all_languages(language_metadata)
- if language:
- language = format_sublanguage_name(language, language_metadata)
- language_data = language_map.get(language.lower())
- language_dir = WIKIDATA_QUERIES_ALL_DATA_DIR / language.lower()
-
- if not language_data:
- raise ValueError(f"Language '{language.capitalize()}' is not recognized.")
-
- data_types = {f.name for f in language_dir.iterdir() if f.is_dir()}
-
- # Add emoji keywords if available.
- iso = get_language_iso(language=language)
- path_to_cldr_annotations = (
- Path(__file__).parent.parent
- / "unicode"
- / "cldr-annotations-full"
- / "annotations"
- )
- if iso in os.listdir(path_to_cldr_annotations):
- data_types.add("emoji-keywords")
-
- if not data_types:
- raise ValueError(
- f"No data types available for language '{language.capitalize()}'."
- )
-
- table_header = f"Available data types: {language.capitalize()}"
-
- else:
- data_types = set()
- for lang in languages:
- language_dir = WIKIDATA_QUERIES_ALL_DATA_DIR / format_sublanguage_name(
- lang, language_metadata
- )
- if language_dir.is_dir():
- data_types.update(f.name for f in language_dir.iterdir() if f.is_dir())
-
- data_types.add("emoji-keywords")
-
- table_header = "Available data types: All languages"
-
- table_line_length = max(len(table_header), max(len(dt) for dt in data_types))
-
- print()
- print(table_header)
- print("=" * table_line_length)
-
- data_types = sorted(data_types)
- for dt in data_types:
- print(dt.replace("_", "-"))
-
- print()
-
-
-def list_all() -> None:
- """
- List all available languages and data types.
-
- Returns
- -------
- None
- All available languages and data types are listed.
- """
- list_languages()
- list_data_types()
-
-
-def list_languages_for_data_type(data_type: str) -> None:
- """
- List the available languages for a given data type.
-
- Parameters
- ----------
- data_type : str
- The data type to check for.
-
- Returns
- -------
- None
- A list of languages for data types is printed to the terminal.
- """
- list_languages()
- # corrected_data_type = correct_data_type(data_type=data_type)
- # all_languages = list_languages_with_metadata_for_data_type(language_metadata)
-
- # # Set column widths for consistent formatting.
- # language_col_width = max(len(lang["name"]) for lang in all_languages) + 2
- # iso_col_width = max(len(lang["iso"]) for lang in all_languages) + 2
- # qid_col_width = max(len(lang["qid"]) for lang in all_languages) + 2
-
- # table_line_length = language_col_width + iso_col_width + qid_col_width
-
- # # Print table header.
- # print(
- # f"{'\nLanguage':<{language_col_width}} {'ISO':<{iso_col_width}} {'QID':<{qid_col_width}}"
- # )
- # print("=" * table_line_length)
-
- # # Iterate through the list of languages and format each row.
- # for lang in all_languages:
- # print(
- # f"{lang['name'].capitalize():<{language_col_width}} {lang['iso']:<{iso_col_width}} {lang['qid']:<{qid_col_width}}"
- # )
-
- # print()
-
-
-def list_wrapper(
- language: str = "", data_type: str = "", all_bool: bool = False
-) -> None:
- """
- Conditionally provides the full functionality of the list command.
-
- Parameters
- ----------
- language : str
- The language to potentially list data types for.
-
- data_type : str
- The data type to check for.
-
- all_bool : bool
- Whether all languages and data types should be listed.
-
- Returns
- -------
- None
- The call to list functions based on the provided arguments.
- """
- if (not language and not data_type) or all_bool:
- list_all()
-
- elif language is True and not data_type:
- list_languages()
-
- elif not language and data_type is True:
- list_data_types()
-
- elif language is True and data_type is True:
- print("Please specify either a language or a data type.")
-
- elif language is True and data_type is not None:
- list_languages_for_data_type(data_type)
-
- elif language is not None and data_type is True:
- list_data_types(language)
diff --git a/src/scribe_data/cli/list/__init__.py b/src/scribe_data/cli/list/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/src/scribe_data/cli/list/data_types.py b/src/scribe_data/cli/list/data_types.py
new file mode 100644
index 000000000..4229ae3d9
--- /dev/null
+++ b/src/scribe_data/cli/list/data_types.py
@@ -0,0 +1,82 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Functions for listing data types for the Scribe-Data CLI.
+"""
+
+import os
+from pathlib import Path
+
+from scribe_data.utils import (
+ WIKIDATA_QUERIES_ALL_DATA_DIR,
+ format_sublanguage_name,
+ get_language_iso,
+ language_map,
+ language_metadata,
+ list_all_languages,
+)
+
+# MARK: Data Types
+
+
+def list_data_types(language: str = "") -> None:
+ """
+ List all data types or those available for a given language.
+
+ Parameters
+ ----------
+ language : str
+ The language to potentially list data types for.
+ """
+ languages = list_all_languages(language_metadata)
+ if language:
+ language = format_sublanguage_name(language, language_metadata)
+ language_data = language_map.get(language.lower())
+ language_dir = WIKIDATA_QUERIES_ALL_DATA_DIR / language.lower()
+
+ if not language_data:
+ raise ValueError(f"Language '{language.capitalize()}' is not recognized.")
+
+ data_types = {f.name for f in language_dir.iterdir() if f.is_dir()}
+
+ # Add emoji keywords if available.
+ iso = get_language_iso(language=language)
+ path_to_cldr_annotations = (
+ Path(__file__).parent.parent.parent
+ / "unicode"
+ / "cldr-annotations-full"
+ / "annotations"
+ )
+ if iso in os.listdir(path_to_cldr_annotations):
+ data_types.add("emoji-keywords")
+
+ if not data_types:
+ raise ValueError(
+ f"No data types available for language '{language.capitalize()}'."
+ )
+
+ table_header = f"Available data types: {language.capitalize()}"
+
+ else:
+ data_types = set()
+ for lang in languages:
+ language_dir = WIKIDATA_QUERIES_ALL_DATA_DIR / format_sublanguage_name(
+ lang, language_metadata
+ )
+ if language_dir.is_dir():
+ data_types.update(f.name for f in language_dir.iterdir() if f.is_dir())
+
+ data_types.add("emoji-keywords")
+
+ table_header = "Available data types: All languages"
+
+ table_line_length = max(len(table_header), max(len(dt) for dt in data_types))
+
+ print()
+ print(table_header)
+ print("=" * table_line_length)
+
+ data_types = sorted(data_types)
+ for dt in data_types:
+ print(dt.replace("_", "-"))
+
+ print()
diff --git a/src/scribe_data/cli/list/languages.py b/src/scribe_data/cli/list/languages.py
new file mode 100644
index 000000000..1dbc7d770
--- /dev/null
+++ b/src/scribe_data/cli/list/languages.py
@@ -0,0 +1,60 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Functions for listing languages for the Scribe-Data CLI.
+"""
+
+from scribe_data.utils import (
+ get_language_iso,
+ get_language_qid,
+ language_metadata,
+ list_all_languages,
+)
+
+# MARK: Languages
+
+
+def list_languages() -> None:
+ """
+ Generate a table of languages with their ISO-2 codes and Wikidata QIDs.
+
+ Returns
+ -------
+ None
+ A table of all languages with their ISO-2 codes and Wikidata QIDs is printed.
+ """
+ languages = list_all_languages(language_metadata)
+
+ language_col_width = max(len(lang) for lang in languages) + 2
+ iso_col_width = max(len(get_language_iso(lang)) for lang in languages) + 2
+ qid_col_width = max(len(get_language_qid(lang)) for lang in languages) + 2
+
+ table_line_length = language_col_width + iso_col_width + qid_col_width
+
+ print(
+ f"{'\nLanguage':<{language_col_width}} {'ISO':<{iso_col_width}} {'QID':<{qid_col_width}}"
+ )
+ print("=" * table_line_length)
+
+ for lang in languages:
+ print(
+ f"{lang.title():<{language_col_width}} {get_language_iso(lang):<{iso_col_width}} {get_language_qid(lang):<{qid_col_width}}"
+ )
+
+ print()
+
+
+def list_languages_for_data_type(data_type: str) -> None:
+ """
+ List the available languages for a given data type.
+
+ Parameters
+ ----------
+ data_type : str
+ The data type to check for.
+
+ Returns
+ -------
+ None
+ A list of languages for data types is printed to the terminal.
+ """
+ list_languages()
diff --git a/src/scribe_data/cli/list/wrapper.py b/src/scribe_data/cli/list/wrapper.py
new file mode 100644
index 000000000..a056ff964
--- /dev/null
+++ b/src/scribe_data/cli/list/wrapper.py
@@ -0,0 +1,67 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Wrapper function for listing languages and data types for the Scribe-Data CLI.
+"""
+
+from scribe_data.cli.list.data_types import list_data_types
+from scribe_data.cli.list.languages import list_languages, list_languages_for_data_type
+
+# MARK: All
+
+
+def list_all() -> None:
+ """
+ List all available languages and data types.
+
+ Returns
+ -------
+ None
+ All available languages and data types are listed.
+ """
+ list_languages()
+ list_data_types()
+
+
+# MARK: Wrapper
+
+
+def list_wrapper(
+ language: str = "", data_type: str = "", all_bool: bool = False
+) -> None:
+ """
+ Conditionally provides the full functionality of the list command.
+
+ Parameters
+ ----------
+ language : str
+ The language to potentially list data types for.
+
+ data_type : str
+ The data type to check for.
+
+ all_bool : bool
+ Whether all languages and data types should be listed.
+
+ Returns
+ -------
+ None
+ The call to list functions based on the provided arguments.
+ """
+ if (not language and not data_type) or all_bool:
+ list_all()
+
+ elif language is True and not data_type:
+ list_languages()
+
+ elif not language and data_type is True:
+ list_data_types()
+
+ elif language is True and data_type is True:
+ print("Please specify either a language or a data type.")
+
+ # Note: Included in case listing languages by data type is implemented.
+ elif language is True and data_type is not None:
+ list_languages_for_data_type(data_type)
+
+ elif language is not None and data_type is True:
+ list_data_types(language)
diff --git a/src/scribe_data/cli/main.py b/src/scribe_data/cli/main.py
index 24ae81240..14de01e95 100644
--- a/src/scribe_data/cli/main.py
+++ b/src/scribe_data/cli/main.py
@@ -14,15 +14,17 @@
from scribe_data.cli.contracts.check import check_contracts
from scribe_data.cli.contracts.export import export_contracts
from scribe_data.cli.contracts.filter import export_data_filtered_by_contracts
-from scribe_data.cli.convert import convert_wrapper
-from scribe_data.cli.download import (
- download_wiktionary_dumps,
+from scribe_data.cli.convert.wrapper import convert_wrapper
+from scribe_data.cli.download.wikidata_lexeme_dump import (
wd_lexeme_dump_download_wrapper,
)
+from scribe_data.cli.download.wiktionary_dump import (
+ download_wiktionary_dumps,
+)
from scribe_data.cli.get import get_data
-from scribe_data.cli.interactive import start_interactive_mode
-from scribe_data.cli.list import list_wrapper
-from scribe_data.cli.total import total_wrapper
+from scribe_data.cli.interactive.run import run_interactive_mode
+from scribe_data.cli.list.wrapper import list_wrapper
+from scribe_data.cli.total.wrapper import total_wrapper
from scribe_data.cli.upgrade import upgrade_cli
from scribe_data.cli.version import get_version_message
from scribe_data.utils import (
@@ -495,7 +497,7 @@ def main() -> None:
elif args.command in ["get", "g"]:
if args.interactive:
- start_interactive_mode(operation="get")
+ run_interactive_mode(operation="get")
return
else:
@@ -550,7 +552,7 @@ def main() -> None:
elif args.command in ["total", "t"]:
if args.interactive:
- start_interactive_mode(operation="total")
+ run_interactive_mode(operation="total")
else:
total_wrapper(
@@ -566,7 +568,7 @@ def main() -> None:
elif args.command in ["convert", "c"]:
if args.interactive:
- start_interactive_mode(operation="convert")
+ run_interactive_mode(operation="convert")
return
# Handle language(s) - could be string or list.
@@ -649,16 +651,16 @@ def main() -> None:
download_wiktionary_dumps(language_isos=[lang])
elif action == "Check for totals":
- start_interactive_mode(operation="total")
+ run_interactive_mode(operation="total")
elif action == "Get data":
- start_interactive_mode(operation="get")
+ run_interactive_mode(operation="get")
elif action == "Get translations":
- start_interactive_mode(operation="translations")
+ run_interactive_mode(operation="translations")
elif action == "Convert JSON":
- start_interactive_mode(operation="convert")
+ run_interactive_mode(operation="convert")
else:
print("Skipping action")
diff --git a/src/scribe_data/cli/total.py b/src/scribe_data/cli/total.py
deleted file mode 100644
index b78ad3e25..000000000
--- a/src/scribe_data/cli/total.py
+++ /dev/null
@@ -1,450 +0,0 @@
-# SPDX-License-Identifier: GPL-3.0-or-later
-"""
-Functions to check the total language data available on Wikidata.
-"""
-
-from http.client import IncompleteRead
-from pathlib import Path
-from typing import Any, cast
-from urllib.error import HTTPError
-
-from SPARQLWrapper import JSON
-
-from scribe_data.utils import (
- DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
- WIKIDATA_QUERIES_ALL_DATA_DIR,
- check_qid_is_language,
- data_type_metadata,
- format_sublanguage_name,
- language_metadata,
- language_to_qid,
- list_all_languages,
-)
-from scribe_data.wikidata.wikidata_utils import parse_wd_lexeme_dump, sparql
-
-
-def get_qid_by_input(input_str: str | None) -> str | None:
- """
- Retrieve the QID for a given language or data type input string.
-
- Parameters
- ----------
- input_str : str, optional
- The input string representing a language or data type.
-
- Returns
- -------
- str | None
- The QID corresponding to the input string, or- None if not found.
- """
- if input_str:
- if input_str in language_to_qid:
- return language_to_qid[input_str]
-
- elif input_str in data_type_metadata:
- return data_type_metadata[input_str]
-
- return None
-
-
-def get_datatype_list(language: str) -> list | dict:
- """
- Get the data types for a given language based on the project directory structure.
-
- Parameters
- ----------
- language : str
- The language to return data types for.
-
- Returns
- -------
- list | dict
- A list of the corresponding data types.
- """
- language_key = language.strip().lower() # normalize input
- languages = list_all_languages(language_metadata)
-
- # Adjust language_key for sub-languages using the format_sublanguage_name function.
- formatted_language = format_sublanguage_name(language_key, language_metadata)
- language_key = formatted_language.split(" ")[
- 0
- ].lower() # use the main language part if formatted
-
- if language_key in languages:
- if "sub_languages" in language_metadata[language_key]:
- sub_languages = language_metadata[language_key]["sub_languages"]
- data_types = []
-
- for sub_lang_key in sub_languages:
- sub_lang_dir = (
- WIKIDATA_QUERIES_ALL_DATA_DIR / sub_languages[sub_lang_key]["iso"]
- )
- if sub_lang_dir.exists():
- data_types.extend(
- [f.name for f in sub_lang_dir.iterdir() if f.is_dir()]
- )
-
- if not data_types:
- raise ValueError(
- f"No data types available for sub-languages of '{formatted_language.capitalize()}'."
- )
-
- return sorted(set(data_types)) # remove duplicates and sort
-
- else:
- language_dir = WIKIDATA_QUERIES_ALL_DATA_DIR / language_key
- if not language_dir.exists():
- raise ValueError(f"Directory '{language_dir}' does not exist.")
-
- data_types = [f.name for f in language_dir.iterdir() if f.is_dir()]
-
- if not data_types:
- raise ValueError(
- f"No data types available for language '{formatted_language.capitalize()}'."
- )
-
- return sorted(data_types)
-
- else: # return all data types
- return data_type_metadata
-
-
-# MARK: Print
-
-
-def print_total_lexemes(language: str | None = None) -> None:
- """
- Print the total number of available entities for all data types.
-
- Parameters
- ----------
- language : str, optional
- The language to display data type entity counts for.
-
- Returns
- -------
- str
- A formatted string indicating the language, data type, and total number of lexemes for all the languages, if found.
- """
- if language is None:
- print("Returning total counts for all languages and data types...\n")
-
- elif (
- isinstance(language, str)
- and language.startswith("Q")
- and language[1:].isdigit()
- ):
- print(
- f"Wikidata QID {language.capitalize()} passed. Checking validity and then all data types."
- )
- language = check_qid_is_language(qid=language)
-
- else:
- print(f"Returning total counts for {language.capitalize()} data types...\n")
-
- def print_total_header(language: str, dt: str, total_lexemes: str) -> None:
- """
- Print the header of the total command output.
-
- Parameters
- ----------
- language : str
- The language for which to count lexemes.
-
- dt : str
- The data type (e.g., "nouns", "verbs") for which to count lexemes.
-
- total_lexemes : str
- The total number of lexemes derived formatted as a string.
-
- Returns
- -------
- None
- A message is printed to the terminal about the total number of lexemes.
- """
- language_display = (
- "All Languages" if language is None else language.capitalize()
- )
- print(f"{'Language':<20} {'Data Type':<25} {'Total Wikidata Lexemes':<25}")
- print("=" * 70)
- print(f"{language_display:<20} {dt.replace('_', '-'): <25} {total_lexemes:<25}")
-
- if language is None: # all languages
- languages = list_all_languages(language_metadata)
-
- for lang in languages:
- data_types = get_datatype_list(lang)
-
- first_row = True
- for dt in data_types:
- total_lexemes = get_total_lexemes(
- language=lang, data_type=dt, do_print=False
- )
- total_lexemes = f"{total_lexemes:,}"
- if first_row:
- print_total_header(lang, dt, total_lexemes)
- first_row = False
-
- else:
- print(f"{'':<20} {dt.replace('_', ' '): <25} {total_lexemes:<25}")
-
- print()
-
- else: # individual language
- first_row = True
- if language.startswith("Q") and language[1:].isdigit():
- data_types = data_type_metadata
- for t in ["emoji_keywords"]:
- if t in data_types:
- del data_types[t]
-
- else:
- data_types = get_datatype_list(language)
-
- for dt in data_types:
- total_lexemes = get_total_lexemes(
- language=language, data_type=dt, do_print=False
- )
- total_lexemes = f"{total_lexemes:,}"
- if first_row:
- print_total_header(language, dt, total_lexemes)
- first_row = False
-
- else:
- print(f"{'':<20} {dt.replace('_', ' '): <25} {total_lexemes:<25}")
-
- print()
-
-
-# MARK: Get Total
-
-
-def get_total_lexemes(
- language: str, data_type: str, do_print: bool = True
-) -> int | None:
- """
- Get the total number of lexemes for a given language and data type from Wikidata.
-
- Parameters
- ----------
- language : str
- The language for which to count lexemes.
-
- data_type : str
- The data type (e.g., "nouns", "verbs") for which to count lexemes.
-
- do_print : bool
- Print the total lexemes for the given language and data type.
-
- Returns
- -------
- str
- A formatted string indicating the language, data type and total number of lexemes, if found.
- """
- if (
- language is not None
- and (language.startswith("Q") or language.startswith("q"))
- and language[1:].isdigit()
- ):
- language_qid = language.capitalize()
-
- else:
- language_qid = get_qid_by_input(language)
-
- if (
- data_type is not None
- and (data_type.startswith("Q") or data_type.startswith("q"))
- and data_type[1:].isdigit()
- ):
- data_type_qid = data_type.capitalize()
-
- else:
- data_type_qid = get_qid_by_input(data_type)
-
- # MARK: Construct Query
-
- query_template = """
- SELECT
- (COUNT(DISTINCT ?lexeme) as ?total)
-
- WHERE {{
- ?lexeme a ontolex:LexicalEntry .
- {language_filter}
- {data_type_filter}
- }}
- """
-
- language_filter = (
- f"?lexeme dct:language wd:{language_qid} ."
- if language_qid
- else "?lexeme dct:language ?language ."
- )
-
- data_type_filter = (
- f"?lexeme wikibase:lexicalCategory wd:{data_type_qid} ."
- if data_type_qid
- else "?lexeme wikibase:lexicalCategory ?category ."
- )
-
- query = query_template.format(
- language_filter=language_filter, data_type_filter=data_type_filter
- )
-
- # MARK: Query Results
-
- sparql.setQuery(query)
- sparql.setReturnFormat(JSON)
- try_count = 0
- max_retries = 2
- results = None
-
- while try_count <= max_retries and results is None:
- try:
- results = sparql.query().convert()
-
- except HTTPError as http_err:
- print(f"HTTPError occurred: {http_err}")
-
- except IncompleteRead as read_err:
- print(f"Incomplete read error occurred: {read_err}")
-
- try_count += 1
-
- if results is None:
- if try_count <= max_retries:
- print("The query will be retried...")
-
- else:
- print("Query failed after retries.")
- return None
-
- # Check if the query returned any results.
- if results is None:
- print("Total number of lexemes: Not found")
- return None
-
- res_dict = cast(dict[str, Any], results)
- if (
- "results" in res_dict
- and "bindings" in res_dict["results"]
- and len(res_dict["results"]["bindings"]) > 0
- ):
- total_lexemes = int(
- res_dict.get("results", {}).get("bindings", [])[0]["total"]["value"]
- )
-
- output_template = ""
- if language:
- output_template += f"\nLanguage: {language.capitalize()}\n"
-
- if data_type:
- output_template += f"Data type: {data_type}\n"
-
- output_template += f"Total number of lexemes: {total_lexemes:,}\n"
- if do_print:
- print(output_template)
-
- return total_lexemes
-
- print("Total number of lexemes: Not found")
- return None
-
-
-# MARK: Wrapper
-
-
-def total_wrapper(
- languages: list[str] | None = None,
- data_types: list[str] | None = None,
- all_bool: bool = False,
- wikidata_dump: Path | bool | None = None,
-) -> None:
- """
- Conditionally provides the full functionality of the total command.
-
- Parameters
- ----------
- languages : List[str]
- The language(s) to potentially total data types for.
-
- data_types : List[str]
- The data type(s) to check for.
-
- all_bool : bool
- Whether all languages and data types should be listed.
-
- wikidata_dump : Optional[Union[Path, bool]]
- The local Wikidata lexeme dump path that can be used to process data.
- If True, indicates the flag was used without a path.
-
- Notes
- -----
- Now accepts lists for language and data type to output a table of total lexemes.
- """
- # Note: Handle --all flag via 'or ["all"]' assignments.
- # Flag without a wikidata lexeme dump path.
- if wikidata_dump is True:
- parse_wd_lexeme_dump(
- languages=languages or ["all"],
- data_types=data_types or ["all"],
- wikidata_dump_type=["total"],
- wikidata_dump_path=DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
- )
- return
-
- # If user provided a wikidata lexeme dump path.
- if isinstance(wikidata_dump, Path):
- parse_wd_lexeme_dump(
- languages=languages or ["all"],
- data_types=data_types or ["all"],
- wikidata_dump_type=["total"],
- wikidata_dump_path=wikidata_dump,
- )
- return
-
- language = languages[0] if languages else None # in case only one is passed
- data_type = data_types[0] if data_types else None # in case only one is passed
-
- if (not languages and not data_types) and all_bool:
- print_total_lexemes()
-
- elif languages and data_types and (len(languages) > 1 or len(data_types) > 1):
- print(f"{'Language':<20} {'Data Type':<25} {'Total Wikidata Lexemes':<25}")
- print("=" * 70)
-
- for lang in languages:
- # Flag to check if it's the first data type for the language.
- first_row = True
-
- for dt in data_types:
- total_lexemes = get_total_lexemes(
- language=lang, data_type=dt, do_print=False
- )
- total_lexemes = (
- f"{int(total_lexemes):,}" if total_lexemes is not None else "N/A"
- )
- if first_row:
- print(f"{lang:<20} {dt:<25} {total_lexemes:<25}")
- first_row = False
-
- else:
- print(
- f"{'':<20} {dt:<25} {total_lexemes:<25}"
- ) # print empty space for language
-
- print()
-
- elif language is not None and data_type is None:
- print_total_lexemes(language=language)
-
- elif language is not None and data_type is not None and not all_bool:
- get_total_lexemes(language=language, data_type=data_type)
-
- elif language is not None and data_type is not None:
- print(
- f"You have already specified language {language.capitalize()} and data type {data_type} - no need to specify --all."
- )
- get_total_lexemes(language=language, data_type=data_type)
-
- else:
- raise ValueError("Invalid input or missing information")
diff --git a/src/scribe_data/cli/total/__init__.py b/src/scribe_data/cli/total/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/src/scribe_data/cli/total/print_values.py b/src/scribe_data/cli/total/print_values.py
new file mode 100644
index 000000000..cc328f363
--- /dev/null
+++ b/src/scribe_data/cli/total/print_values.py
@@ -0,0 +1,185 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Functions to display the total language data available on Wikidata.
+"""
+
+from scribe_data.cli.total.query import query_total_lexemes
+from scribe_data.utils import (
+ WIKIDATA_QUERIES_ALL_DATA_DIR,
+ check_qid_is_language,
+ data_type_metadata,
+ format_sublanguage_name,
+ language_metadata,
+ list_all_languages,
+)
+
+# MARK: Data Types
+
+
+def get_datatype_list(language: str) -> list | dict:
+ """
+ Get the data types for a given language based on the project directory structure.
+
+ Parameters
+ ----------
+ language : str
+ The language to return data types for.
+
+ Returns
+ -------
+ list | dict
+ A list of the corresponding data types.
+ """
+ language_key = language.strip().lower() # normalize input
+ languages = list_all_languages(language_metadata)
+
+ # Adjust language_key for sub-languages using the format_sublanguage_name function.
+ formatted_language = format_sublanguage_name(language_key, language_metadata)
+ language_key = formatted_language.split(" ")[
+ 0
+ ].lower() # use the main language part if formatted
+
+ if language_key in languages:
+ if "sub_languages" in language_metadata[language_key]:
+ sub_languages = language_metadata[language_key]["sub_languages"]
+ data_types = []
+
+ for sub_lang_key in sub_languages:
+ sub_lang_dir = (
+ WIKIDATA_QUERIES_ALL_DATA_DIR / sub_languages[sub_lang_key]["iso"]
+ )
+ if sub_lang_dir.exists():
+ data_types.extend(
+ [f.name for f in sub_lang_dir.iterdir() if f.is_dir()]
+ )
+
+ if not data_types:
+ raise ValueError(
+ f"No data types available for sub-languages of '{formatted_language.capitalize()}'."
+ )
+
+ return sorted(set(data_types)) # remove duplicates and sort
+
+ else:
+ language_dir = WIKIDATA_QUERIES_ALL_DATA_DIR / language_key
+ if not language_dir.exists():
+ raise ValueError(f"Directory '{language_dir}' does not exist.")
+
+ data_types = [f.name for f in language_dir.iterdir() if f.is_dir()]
+
+ if not data_types:
+ raise ValueError(
+ f"No data types available for language '{formatted_language.capitalize()}'."
+ )
+
+ return sorted(data_types)
+
+ else:
+ return data_type_metadata
+
+
+# MARK: Print Values
+
+
+def print_total_lexemes(language: str | None = None) -> None:
+ """
+ Print the total number of available entities for all data types.
+
+ Parameters
+ ----------
+ language : str, optional
+ The language to display data type entity counts for.
+
+ Returns
+ -------
+ str
+ A formatted string indicating the language, data type, and total number of lexemes for all the languages, if found.
+ """
+ if language is None:
+ print("Returning total counts for all languages and data types...\n")
+
+ elif (
+ isinstance(language, str)
+ and language.startswith("Q")
+ and language[1:].isdigit()
+ ):
+ print(
+ f"Wikidata QID {language.capitalize()} passed. Checking validity and then all data types."
+ )
+ language = check_qid_is_language(qid=language)
+
+ else:
+ print(f"Returning total counts for {language.capitalize()} data types...\n")
+
+ def print_total_header(language: str, dt: str, total_lexemes: str) -> None:
+ """
+ Print the header of the total command output.
+
+ Parameters
+ ----------
+ language : str
+ The language for which to count lexemes.
+
+ dt : str
+ The data type (e.g., "nouns", "verbs") for which to count lexemes.
+
+ total_lexemes : str
+ The total number of lexemes derived formatted as a string.
+
+ Returns
+ -------
+ None
+ A message is printed to the terminal about the total number of lexemes.
+ """
+ language_display = (
+ "All Languages" if language is None else language.capitalize()
+ )
+ print(f"{'Language':<20} {'Data Type':<25} {'Total Wikidata Lexemes':<25}")
+ print("=" * 70)
+ print(f"{language_display:<20} {dt.replace('_', '-'): <25} {total_lexemes:<25}")
+
+ if language is None: # all languages
+ languages = list_all_languages(language_metadata)
+
+ for lang in languages:
+ data_types = get_datatype_list(lang)
+
+ first_row = True
+ for dt in data_types:
+ total_lexemes = query_total_lexemes(
+ language=lang, data_type=dt, do_print=False
+ )
+ total_lexemes = f"{total_lexemes:,}"
+ if first_row:
+ print_total_header(lang, dt, total_lexemes)
+ first_row = False
+
+ else:
+ print(f"{'':<20} {dt.replace('_', ' '): <25} {total_lexemes:<25}")
+
+ print()
+
+ else: # individual language
+ first_row = True
+ if language.startswith("Q") and language[1:].isdigit():
+ data_types = data_type_metadata
+ for t in ["emoji_keywords"]:
+ if t in data_types:
+ del data_types[t]
+
+ else:
+ data_types = get_datatype_list(language)
+
+ for dt in data_types:
+ total_lexemes = query_total_lexemes(
+ language=language, data_type=dt, do_print=False
+ )
+ total_lexemes = f"{total_lexemes:,}"
+ if first_row:
+ print_total_header(language, dt, total_lexemes)
+ first_row = False
+
+ else:
+ print(f"{'':<20} {dt.replace('_', ' '): <25} {total_lexemes:<25}")
+
+ print()
diff --git a/src/scribe_data/cli/total/query.py b/src/scribe_data/cli/total/query.py
new file mode 100644
index 000000000..5b1dd5e19
--- /dev/null
+++ b/src/scribe_data/cli/total/query.py
@@ -0,0 +1,173 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Functions to check the total language data available on Wikidata.
+"""
+
+from http.client import IncompleteRead
+from typing import Any, cast
+from urllib.error import HTTPError
+
+from SPARQLWrapper import JSON
+
+from scribe_data.utils import data_type_metadata, language_to_qid
+from scribe_data.wikidata.wikidata_utils import sparql
+
+# MARK: QIDs
+
+
+def get_qid_by_input(input_str: str | None) -> str | None:
+ """
+ Retrieve the QID for a given language or data type input string.
+
+ Parameters
+ ----------
+ input_str : str, optional
+ The input string representing a language or data type.
+
+ Returns
+ -------
+ str | None
+ The QID corresponding to the input string, or- None if not found.
+ """
+ if input_str:
+ if input_str in language_to_qid:
+ return language_to_qid[input_str]
+
+ elif input_str in data_type_metadata:
+ return data_type_metadata[input_str]
+
+ return None
+
+
+# MARK: Query Total
+
+
+def query_total_lexemes(
+ language: str, data_type: str, do_print: bool = True
+) -> int | None:
+ """
+ Get the total number of lexemes for a given language and data type from Wikidata.
+
+ Parameters
+ ----------
+ language : str
+ The language for which to count lexemes.
+
+ data_type : str
+ The data type (e.g., "nouns", "verbs") for which to count lexemes.
+
+ do_print : bool
+ Print the total lexemes for the given language and data type.
+
+ Returns
+ -------
+ str
+ A formatted string indicating the language, data type and total number of lexemes, if found.
+ """
+ if (
+ language is not None
+ and (language.startswith("Q") or language.startswith("q"))
+ and language[1:].isdigit()
+ ):
+ language_qid = language.capitalize()
+
+ else:
+ language_qid = get_qid_by_input(language)
+
+ if (
+ data_type is not None
+ and (data_type.startswith("Q") or data_type.startswith("q"))
+ and data_type[1:].isdigit()
+ ):
+ data_type_qid = data_type.capitalize()
+
+ else:
+ data_type_qid = get_qid_by_input(data_type)
+
+ # MARK: Construct Query
+
+ query_template = """
+ SELECT
+ (COUNT(DISTINCT ?lexeme) as ?total)
+
+ WHERE {{
+ ?lexeme a ontolex:LexicalEntry .
+ {language_filter}
+ {data_type_filter}
+ }}
+ """
+
+ language_filter = (
+ f"?lexeme dct:language wd:{language_qid} ."
+ if language_qid
+ else "?lexeme dct:language ?language ."
+ )
+
+ data_type_filter = (
+ f"?lexeme wikibase:lexicalCategory wd:{data_type_qid} ."
+ if data_type_qid
+ else "?lexeme wikibase:lexicalCategory ?category ."
+ )
+
+ query = query_template.format(
+ language_filter=language_filter, data_type_filter=data_type_filter
+ )
+
+ # MARK: Query Results
+
+ sparql.setQuery(query)
+ sparql.setReturnFormat(JSON)
+ try_count = 0
+ max_retries = 2
+ results = None
+
+ while try_count <= max_retries and results is None:
+ try:
+ results = sparql.query().convert()
+
+ except HTTPError as http_err:
+ print(f"HTTPError occurred: {http_err}")
+
+ except IncompleteRead as read_err:
+ print(f"Incomplete read error occurred: {read_err}")
+
+ try_count += 1
+
+ if results is None:
+ if try_count <= max_retries:
+ print("The query will be retried...")
+
+ else:
+ print("Query failed after retries.")
+ return None
+
+ # Check if the query returned any results.
+ if results is None:
+ print("Total number of lexemes: Not found")
+ return None
+
+ res_dict = cast(dict[str, Any], results)
+ if (
+ "results" in res_dict
+ and "bindings" in res_dict["results"]
+ and len(res_dict["results"]["bindings"]) > 0
+ ):
+ total_lexemes = int(
+ res_dict.get("results", {}).get("bindings", [])[0]["total"]["value"]
+ )
+
+ output_template = ""
+ if language:
+ output_template += f"\nLanguage: {language.capitalize()}\n"
+
+ if data_type:
+ output_template += f"Data type: {data_type}\n"
+
+ output_template += f"Total number of lexemes: {total_lexemes:,}\n"
+ if do_print:
+ print(output_template)
+
+ return total_lexemes
+
+ print("Total number of lexemes: Not found")
+ return None
diff --git a/src/scribe_data/cli/total/wrapper.py b/src/scribe_data/cli/total/wrapper.py
new file mode 100644
index 000000000..9bd97e57c
--- /dev/null
+++ b/src/scribe_data/cli/total/wrapper.py
@@ -0,0 +1,110 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Wrapper function to check and display the total language data available on Wikidata.
+"""
+
+from pathlib import Path
+
+from scribe_data.cli.total.print_values import print_total_lexemes
+from scribe_data.cli.total.query import query_total_lexemes
+from scribe_data.utils import DEFAULT_WIKIDATA_DUMP_EXPORT_DIR
+from scribe_data.wikidata.wikidata_utils import parse_wd_lexeme_dump
+
+# MARK: Wrapper
+
+
+def total_wrapper(
+ languages: list[str] | None = None,
+ data_types: list[str] | None = None,
+ all_bool: bool = False,
+ wikidata_dump: Path | bool | None = None,
+) -> None:
+ """
+ Conditionally provides the full functionality of the total command.
+
+ Parameters
+ ----------
+ languages : List[str]
+ The language(s) to potentially total data types for.
+
+ data_types : List[str]
+ The data type(s) to check for.
+
+ all_bool : bool
+ Whether all languages and data types should be listed.
+
+ wikidata_dump : Optional[Union[Path, bool]]
+ The local Wikidata lexeme dump path that can be used to process data.
+ If True, indicates the flag was used without a path.
+
+ Notes
+ -----
+ Now accepts lists for language and data type to output a table of total lexemes.
+ """
+ # Note: Handle --all flag via 'or ["all"]' assignments.
+ # Flag without a wikidata lexeme dump path.
+ if wikidata_dump is True:
+ parse_wd_lexeme_dump(
+ languages=languages or ["all"],
+ data_types=data_types or ["all"],
+ wikidata_dump_type=["total"],
+ wikidata_dump_path=DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
+ )
+ return
+
+ # If user provided a wikidata lexeme dump path.
+ if isinstance(wikidata_dump, Path):
+ parse_wd_lexeme_dump(
+ languages=languages or ["all"],
+ data_types=data_types or ["all"],
+ wikidata_dump_type=["total"],
+ wikidata_dump_path=wikidata_dump,
+ )
+ return
+
+ language = languages[0] if languages else None # in case only one is passed
+ data_type = data_types[0] if data_types else None # in case only one is passed
+
+ if (not languages and not data_types) and all_bool:
+ print_total_lexemes()
+
+ elif languages and data_types and (len(languages) > 1 or len(data_types) > 1):
+ print(f"{'Language':<20} {'Data Type':<25} {'Total Wikidata Lexemes':<25}")
+ print("=" * 70)
+
+ for lang in languages:
+ # Flag to check if it's the first data type for the language.
+ first_row = True
+
+ for dt in data_types:
+ total_lexemes = query_total_lexemes(
+ language=lang, data_type=dt, do_print=False
+ )
+ total_lexemes = (
+ f"{int(total_lexemes):,}" if total_lexemes is not None else "N/A"
+ )
+ if first_row:
+ print(f"{lang:<20} {dt:<25} {total_lexemes:<25}")
+ first_row = False
+
+ else:
+ print(
+ f"{'':<20} {dt:<25} {total_lexemes:<25}"
+ ) # print empty space for language
+
+ print()
+
+ elif language is not None and data_type is None:
+ print_total_lexemes(language=language)
+
+ elif language is not None and data_type is not None and not all_bool:
+ query_total_lexemes(language=language, data_type=data_type)
+
+ elif language is not None and data_type is not None:
+ print(
+ f"You have already specified language {language.capitalize()} and data type {data_type} - no need to specify --all."
+ )
+ query_total_lexemes(language=language, data_type=data_type)
+
+ else:
+ raise ValueError("Invalid input or missing information")
diff --git a/src/scribe_data/wikidata/parse_dump.py b/src/scribe_data/wikidata/parse_dump.py
index d1549e4f3..6af981628 100644
--- a/src/scribe_data/wikidata/parse_dump.py
+++ b/src/scribe_data/wikidata/parse_dump.py
@@ -436,7 +436,9 @@ def process_file(self, file_path: str, batch_size: int = 50000) -> None:
"Would you like to automatically re-download the dump file?",
default=True,
).ask():
- from scribe_data.cli.download import wd_lexeme_dump_download_wrapper
+ from scribe_data.cli.download.wikidata_lexeme_dump import (
+ wd_lexeme_dump_download_wrapper,
+ )
if new_file_path := wd_lexeme_dump_download_wrapper(
dump_snapshot="latest-lexemes",
diff --git a/src/scribe_data/wikidata/wikidata_utils.py b/src/scribe_data/wikidata/wikidata_utils.py
index 13a81aaca..cfda0f9d3 100644
--- a/src/scribe_data/wikidata/wikidata_utils.py
+++ b/src/scribe_data/wikidata/wikidata_utils.py
@@ -8,7 +8,9 @@
from rich import print as rprint
from SPARQLWrapper import JSON, POST, SPARQLWrapper
-from scribe_data.cli.download import wd_lexeme_dump_download_wrapper
+from scribe_data.cli.download.wikidata_lexeme_dump import (
+ wd_lexeme_dump_download_wrapper,
+)
from scribe_data.utils import (
DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
data_type_metadata,
diff --git a/src/scribe_data/wiktionary/parse_translations.py b/src/scribe_data/wiktionary/parse_translations.py
index c32a6a156..790410116 100644
--- a/src/scribe_data/wiktionary/parse_translations.py
+++ b/src/scribe_data/wiktionary/parse_translations.py
@@ -1305,7 +1305,7 @@ def _resolve_dump_path(
import questionary
- from scribe_data.cli.download import download_wiktionary_dumps
+ from scribe_data.cli.download.wiktionary_dump import download_wiktionary_dumps
print(f"\nNo {wiktionary} dump found locally.")
should_download = questionary.select(
diff --git a/tests/cli/contracts/test_contracts_check.py b/tests/cli/contracts/test_cli_contracts_check.py
similarity index 94%
rename from tests/cli/contracts/test_contracts_check.py
rename to tests/cli/contracts/test_cli_contracts_check.py
index ea3fc0868..78849c83c 100644
--- a/tests/cli/contracts/test_contracts_check.py
+++ b/tests/cli/contracts/test_cli_contracts_check.py
@@ -74,7 +74,7 @@ def mock_contract_metadata() -> dict[str, Any]:
@patch("scribe_data.cli.contracts.check.check_contract_data_completeness")
@patch("scribe_data.cli.contracts.check.print_missing_forms")
-def test_check_contracts_with_dir(
+def test_cli_contracts_with_dir(
mock_print: MagicMock, mock_check: MagicMock, mock_export_dir: Path
) -> None:
"""
@@ -90,7 +90,7 @@ def test_check_contracts_with_dir(
@patch("scribe_data.cli.contracts.check.check_contract_data_completeness")
@patch("scribe_data.cli.contracts.check.print_missing_forms")
-def test_check_contracts_default_dir(
+def test_cli_contracts_default_dir(
mock_print: MagicMock, mock_check: MagicMock
) -> None:
"""
@@ -108,7 +108,7 @@ def test_check_contracts_default_dir(
@patch("scribe_data.cli.contracts.check.Path")
-def test_check_contracts_nonexistent_dir(mock_path: MagicMock) -> None:
+def test_cli_contracts_nonexistent_dir(mock_path: MagicMock) -> None:
"""
Test check_contracts with a nonexistent directory.
"""
@@ -125,7 +125,7 @@ def test_check_contracts_nonexistent_dir(mock_path: MagicMock) -> None:
@patch("scribe_data.cli.contracts.check.data_contracts_langs", ["English"])
@patch("scribe_data.cli.contracts.check.get_language_iso")
@patch("scribe_data.cli.contracts.check.filter_contract_metadata")
-def test_check_contract_data_completeness_json_error(
+def test_cli_contracts_data_completeness_json_error(
mock_filter_metadata: MagicMock,
mock_get_iso: MagicMock,
mock_export_dir: Path,
@@ -148,7 +148,7 @@ def test_check_contract_data_completeness_json_error(
assert "Error reading" in mock_print.call_args[0][0]
-def test_print_missing_forms_none() -> None:
+def test_cli_contracts_print_missing_forms_none() -> None:
"""
Test print_missing_forms with no missing forms.
"""
@@ -161,7 +161,7 @@ def test_print_missing_forms_none() -> None:
)
-def test_print_missing_forms_with_missing() -> None:
+def test_cli_contracts_print_missing_forms_with_missing() -> None:
"""
Test print_missing_forms with missing forms.
"""
diff --git a/tests/cli/contracts/test_export.py b/tests/cli/contracts/test_cli_contracts_export.py
similarity index 90%
rename from tests/cli/contracts/test_export.py
rename to tests/cli/contracts/test_cli_contracts_export.py
index 59671d599..34421a790 100644
--- a/tests/cli/contracts/test_export.py
+++ b/tests/cli/contracts/test_cli_contracts_export.py
@@ -24,7 +24,7 @@ def contracts_source(tmp_path: Path) -> Path:
return source
-def test_export_contracts_fresh_export(tmp_path: Path, contracts_source: Path) -> None:
+def test_cli_contracts_export_new(tmp_path: Path, contracts_source: Path) -> None:
"""
Test fresh export when no existing contracts folder.
"""
@@ -41,7 +41,7 @@ def test_export_contracts_fresh_export(tmp_path: Path, contracts_source: Path) -
assert (output_dir / "de.yaml").exists()
-def test_export_contracts_success_message(
+def test_cli_contracts_export_success(
tmp_path: Path, contracts_source: Path, capsys
) -> None:
"""
@@ -59,7 +59,7 @@ def test_export_contracts_success_message(
assert "successfully exported" in captured.out.lower()
-def test_export_contracts_overwrite_confirmed(
+def test_cli_contracts_export_overwrite_confirmed(
tmp_path: Path, contracts_source: Path
) -> None:
"""
@@ -82,7 +82,7 @@ def test_export_contracts_overwrite_confirmed(
assert not (output_dir / "old.yaml").exists()
-def test_export_contracts_overwrite_declined(
+def test_cli_contracts_export_overwrite_declined(
tmp_path: Path, contracts_source: Path, capsys
) -> None:
"""
@@ -106,7 +106,7 @@ def test_export_contracts_overwrite_declined(
assert (output_dir / "old.yaml").exists()
-def test_export_contracts_source_not_found(tmp_path: Path) -> None:
+def test_cli_contracts_export_source_not_found(tmp_path: Path) -> None:
"""
Test assertion error when source directory not found.
"""
@@ -121,7 +121,9 @@ def test_export_contracts_source_not_found(tmp_path: Path) -> None:
export_contracts(output_dir=output_dir)
-def test_export_contracts_files_content(tmp_path: Path, contracts_source: Path) -> None:
+def test_cli_contracts_export_files_content(
+ tmp_path: Path, contracts_source: Path
+) -> None:
"""
Test that exported files have correct content.
"""
@@ -137,7 +139,7 @@ def test_export_contracts_files_content(tmp_path: Path, contracts_source: Path)
assert (output_dir / "de.yaml").read_text() == "language: german\n"
-def test_export_contracts_overwrite_default_declined(
+def test_cli_contracts_export_overwrite_default_declined(
tmp_path: Path, contracts_source: Path, capsys
) -> None:
"""
diff --git a/tests/cli/contracts/test_contracts_export.py b/tests/cli/contracts/test_cli_contracts_filter.py
similarity index 93%
rename from tests/cli/contracts/test_contracts_export.py
rename to tests/cli/contracts/test_cli_contracts_filter.py
index cd617f5ef..02187b83d 100644
--- a/tests/cli/contracts/test_contracts_export.py
+++ b/tests/cli/contracts/test_cli_contracts_filter.py
@@ -19,7 +19,7 @@
class TestFilterContractMetadata:
- def test_filter_contract_metadata_empty_file(self) -> None:
+ def test_cli_contracts_filter_metadata_empty_file(self) -> None:
"""
Test filtering with an empty contract file.
"""
@@ -31,7 +31,7 @@ def test_filter_contract_metadata_empty_file(self) -> None:
"verbs": {"conjugations": []},
}
- def test_filter_contract_metadata_numbers_dict(self) -> None:
+ def test_cli_contracts_filter_metadata_numbers_dict(self) -> None:
"""
Test filtering numbers as a dictionary.
"""
@@ -48,7 +48,7 @@ def test_filter_contract_metadata_numbers_dict(self) -> None:
assert "" not in result["nouns"]["numbers"]
assert "collective" in result["nouns"]["numbers"]
- def test_filter_contract_metadata_numbers_list(self) -> None:
+ def test_cli_contracts_filter_metadata_numbers_list(self) -> None:
"""
Test filtering numbers as a list.
"""
@@ -61,7 +61,7 @@ def test_filter_contract_metadata_numbers_list(self) -> None:
result = filter_contract_metadata(Path("fake_path.json"))
assert set(result["nouns"]["numbers"]) == {"singular", "plural", "dual"}
- def test_filter_contract_metadata_numbers_string(self) -> None:
+ def test_cli_contracts_filter_metadata_numbers_string(self) -> None:
"""
Test filtering numbers as a string.
"""
@@ -74,7 +74,7 @@ def test_filter_contract_metadata_numbers_string(self) -> None:
result = filter_contract_metadata(Path("fake_path.json"))
assert set(result["nouns"]["numbers"]) == {"singular", "plural", "dual"}
- def test_filter_contract_metadata_genders(self) -> None:
+ def test_cli_contracts_filter_metadata_genders(self) -> None:
"""
Test filtering genders.
"""
@@ -93,7 +93,7 @@ def test_filter_contract_metadata_genders(self) -> None:
assert "NOT_INCLUDED" not in result["nouns"]["genders"]
assert "" not in result["nouns"]["genders"]
- def test_filter_contract_metadata_conjugations_list(self) -> None:
+ def test_cli_contracts_filter_metadata_conjugations_list(self) -> None:
"""
Test filtering conjugations as a list.
"""
@@ -107,7 +107,7 @@ def test_filter_contract_metadata_conjugations_list(self) -> None:
assert set(result["verbs"]["conjugations"]) == {"run", "runs", "ran"}
assert "[running]" not in result["verbs"]["conjugations"]
- def test_filter_contract_metadata_error_handling(self) -> None:
+ def test_cli_contracts_filter_metadata_error_handling(self) -> None:
"""
Test error handling for invalid YAML.
"""
@@ -119,7 +119,7 @@ def test_filter_contract_metadata_error_handling(self) -> None:
class TestFilterExportedData:
- def test_filter_exported_data_nouns(self) -> None:
+ def test_cli_contracts_filter_exported_data_nouns(self) -> None:
"""
Test filtering exported noun data.
"""
@@ -169,7 +169,7 @@ def test_filter_exported_data_nouns(self) -> None:
assert result["L2"]["singular"] == "dog"
assert "irrelevant" not in result["L2"]
- def test_filter_exported_data_verbs(self) -> None:
+ def test_cli_contracts_filter_exported_data_verbs(self) -> None:
"""
Test filtering exported verb data.
"""
@@ -211,7 +211,7 @@ def test_filter_exported_data_verbs(self) -> None:
# L4 should not be included as it doesn't have enough valid fields.
assert "L4" not in result
- def test_filter_exported_data_unsupported_type(self) -> None:
+ def test_cli_contracts_filter_exported_data_unsupported_type(self) -> None:
"""
Test filtering with unsupported data type.
"""
@@ -226,7 +226,7 @@ def test_filter_exported_data_unsupported_type(self) -> None:
)
assert result == {}
- def test_filter_exported_data_error_handling(self) -> None:
+ def test_cli_contracts_filter_exported_data_error_handling(self) -> None:
"""
Test error handling for invalid JSON.
"""
@@ -253,7 +253,7 @@ class TestExportContracts:
@patch("pathlib.Path.exists")
@patch("builtins.open", new_callable=mock_open)
@patch("json.dump")
- def test_export_data_filtered_by_contracts(
+ def test_cli_contracts_export_data_filtered(
self,
mock_json_dump: MagicMock,
mock_file_open: MagicMock,
@@ -361,7 +361,7 @@ def mock_path_glob(self: Path, pattern: str) -> list[Path]:
@patch("scribe_data.cli.contracts.filter.get_language_from_iso")
@patch("os.listdir")
@patch("pathlib.Path.mkdir")
- def test_export_data_filtered_by_contracts_no_language_match(
+ def test_cli_contracts_export_data_filtered_no_language_match(
self,
mock_mkdir: MagicMock,
mock_listdir: MagicMock,
@@ -394,7 +394,7 @@ def test_export_data_filtered_by_contracts_no_language_match(
@patch("os.listdir")
@patch("pathlib.Path.mkdir")
@patch("pathlib.Path.exists")
- def test_export_data_filtered_by_contracts_no_input_file(
+ def test_cli_contracts_export_data_filtered_no_input_file(
self,
mock_exists: MagicMock,
mock_mkdir: MagicMock,
@@ -428,7 +428,7 @@ def test_export_data_filtered_by_contracts_no_input_file(
@patch("scribe_data.cli.contracts.filter.get_language_from_iso")
@patch("os.listdir")
@patch("pathlib.Path.mkdir")
- def test_export_data_filtered_by_contracts_empty_metadata(
+ def test_cli_contracts_export_data_filtered_empty_metadata(
self,
mock_mkdir: MagicMock,
mock_listdir: MagicMock,
diff --git a/tests/cli/convert/test_cli_convert_to_csv_or_tsv.py b/tests/cli/convert/test_cli_convert_to_csv_or_tsv.py
new file mode 100644
index 000000000..dc4719207
--- /dev/null
+++ b/tests/cli/convert/test_cli_convert_to_csv_or_tsv.py
@@ -0,0 +1,176 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI convert functionality.
+"""
+
+import unittest
+
+import pytest
+
+from scribe_data.cli.convert.to_csv_or_tsv import convert_to_csv_or_tsv
+
+# MARK: CSV or TSV
+
+
+class TestCLIConvertToCSVorTSV(unittest.TestCase):
+ @pytest.fixture(autouse=True)
+ def _setup_fixtures(self, tmp_path):
+ self.tmp_path = tmp_path
+
+ def test_cli_convert_to_csv_or_json_empty_language(self) -> None:
+ json_data = '{"key1": "value1", "key2": "value2"}'
+
+ input_file = self.tmp_path / "test.json"
+ input_file.write_text(json_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ with self.assertRaises(ValueError) as context:
+ convert_to_csv_or_tsv(
+ language="",
+ data_types="nouns",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="csv",
+ overwrite=True,
+ )
+
+ self.assertEqual(str(context.exception), "Language '' is not recognized.")
+
+ def test_cli_convert_to_csv_or_tsv_standard_dict_to_csv(self) -> None:
+ json_data = '{"a": "1", "b": "2"}'
+ expected_csv_output = "preposition,value\na,1\nb,2\n"
+
+ input_file = self.tmp_path / "test.json"
+ input_file.write_text(json_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ convert_to_csv_or_tsv(
+ language="English",
+ data_types="prepositions",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="csv",
+ overwrite=True,
+ )
+
+ output_file = output_dir / "English" / "prepositions.csv"
+ actual_content = output_file.read_text(encoding="utf-8")
+ assert actual_content == expected_csv_output
+
+ def test_cli_convert_to_csv_or_tsv_standard_dict_to_tsv(self) -> None:
+ json_data = '{"a": "1", "b": "2"}'
+ expected_tsv_output = "preposition\tvalue\na\t1\nb\t2\n"
+
+ input_file = self.tmp_path / "test.json"
+ input_file.write_text(json_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ convert_to_csv_or_tsv(
+ language="English",
+ data_types="prepositions",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="tsv",
+ overwrite=True,
+ )
+
+ output_file = output_dir / "English" / "prepositions.tsv"
+ actual_content = output_file.read_text(encoding="utf-8")
+ assert actual_content == expected_tsv_output
+
+ def test_cli_convert_to_csv_or_tsv_nested_dict_to_csv(self) -> None:
+ json_data = (
+ '{"a": {"value1": "1", "value2": "x"}, "b": {"value1": "2", "value2": "y"}}'
+ )
+ expected_csv_output = "noun,value1,value2\na,1,x\nb,2,y\n"
+
+ input_file = self.tmp_path / "test.json"
+ input_file.write_text(json_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ convert_to_csv_or_tsv(
+ language="English",
+ data_types="nouns",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="csv",
+ overwrite=True,
+ )
+
+ output_file = output_dir / "English" / "nouns.csv"
+ actual_content = output_file.read_text(encoding="utf-8")
+ assert actual_content == expected_csv_output
+
+ def test_cli_convert_to_csv_or_tsv_nested_dict_to_tsv(self) -> None:
+ json_data = (
+ '{"a": {"value1": "1", "value2": "x"}, "b": {"value1": "2", "value2": "y"}}'
+ )
+ expected_tsv_output = "noun\tvalue1\tvalue2\na\t1\tx\nb\t2\ty\n"
+
+ input_file = self.tmp_path / "test.json"
+ input_file.write_text(json_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ convert_to_csv_or_tsv(
+ language="English",
+ data_types="nouns",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="tsv",
+ overwrite=True,
+ )
+
+ output_file = output_dir / "English" / "nouns.tsv"
+ actual_content = output_file.read_text(encoding="utf-8")
+ assert actual_content == expected_tsv_output
+
+ def test_cli_convert_to_csv_or_tsv_list_of_dicts_to_csv(self) -> None:
+ json_data = '{"a": [{"emoji": "😀", "is_base": true, "rank": 1}, {"emoji": "😅", "is_base": false, "rank": 2}]}'
+ expected_csv_output = "word,emoji,is_base,rank\na,😀,True,1\na,😅,False,2\n"
+
+ input_file = self.tmp_path / "test.json"
+ input_file.write_text(json_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ convert_to_csv_or_tsv(
+ language="English",
+ data_types="emoji-keywords",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="csv",
+ overwrite=True,
+ )
+
+ output_file = output_dir / "English" / "emoji-keywords.csv"
+ actual_content = output_file.read_text(encoding="utf-8")
+ assert actual_content == expected_csv_output
+
+ def test_cli_convert_to_csv_or_tsv_list_of_dicts_to_tsv(self) -> None:
+ json_data = '{"a": [{"emoji": "😀", "is_base": true, "rank": 1}, {"emoji": "😅", "is_base": false, "rank": 2}]}'
+ expected_tsv_output = (
+ "word\temoji\tis_base\trank\na\t😀\tTrue\t1\na\t😅\tFalse\t2\n"
+ )
+
+ input_file = self.tmp_path / "test.json"
+ input_file.write_text(json_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ convert_to_csv_or_tsv(
+ language="English",
+ data_types="emoji-keywords",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="tsv",
+ overwrite=True,
+ )
+
+ output_file = output_dir / "English" / "emoji-keywords.tsv"
+ actual_content = output_file.read_text(encoding="utf-8")
+ assert actual_content == expected_tsv_output
diff --git a/tests/cli/convert/test_cli_convert_to_json.py b/tests/cli/convert/test_cli_convert_to_json.py
new file mode 100644
index 000000000..35f9ae9c8
--- /dev/null
+++ b/tests/cli/convert/test_cli_convert_to_json.py
@@ -0,0 +1,185 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI convert functionality.
+"""
+
+import json
+import unittest
+from io import StringIO
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from scribe_data.cli.convert.to_json import convert_to_json
+
+# MARK: JSON
+
+
+class TestCLIConvertToJSON(unittest.TestCase):
+ @pytest.fixture(autouse=True)
+ def _setup_fixtures(self, tmp_path):
+ self.tmp_path = tmp_path
+
+ @patch("scribe_data.cli.convert.to_json.Path", autospec=True)
+ def test_cli_convert_to_json_empty_language(self, mock_path: MagicMock) -> None:
+ csv_data = "key,value\na,1\nb,2"
+ mock_file = StringIO(csv_data)
+
+ mock_path_obj = MagicMock(spec=Path)
+ mock_path.return_value = mock_path_obj
+ mock_path_obj.suffix = ".csv"
+ mock_path_obj.exists.return_value = True
+ mock_path_obj.open.return_value.__enter__.return_value = mock_file
+
+ with self.assertRaises(ValueError) as context:
+ convert_to_json(
+ language="",
+ data_types="nouns",
+ input_file=Path("input.csv"),
+ output_dir=Path("/output_dir"),
+ output_type="json",
+ overwrite=True,
+ )
+ self.assertIn("Language '' is not recognized.", str(context.exception))
+
+ @patch("scribe_data.cli.convert.to_json.Path", autospec=True)
+ def test_cli_convert_to_json_supported_file_extension_csv(
+ self, mock_path_class: MagicMock
+ ) -> None:
+ mock_path_instance = MagicMock(spec=Path)
+
+ mock_path_class.return_value = mock_path_instance
+
+ mock_path_instance.suffix = ".csv"
+ mock_path_instance.exists.return_value = True
+
+ convert_to_json(
+ language="English",
+ data_types="nouns",
+ input_file=Path("test.csv"),
+ output_dir=Path("/output_dir"),
+ output_type="json",
+ overwrite=True,
+ )
+
+ @patch("scribe_data.cli.convert.to_json.Path", autospec=True)
+ def test_cli_convert_to_json_supported_file_extension_tsv(
+ self, mock_path_class: MagicMock
+ ) -> None:
+ mock_path_instance = MagicMock(spec=Path)
+
+ mock_path_class.return_value = mock_path_instance
+
+ mock_path_instance.suffix = ".tsv"
+ mock_path_instance.exists.return_value = True
+
+ convert_to_json(
+ language="English",
+ data_types="nouns",
+ input_file=Path("test.tsv"),
+ output_dir=Path("/output_dir"),
+ output_type="json",
+ overwrite=True,
+ )
+
+ def test_cli_convert_to_json_unsupported_file_extension(self) -> None:
+ input_file = self.tmp_path / "test.txt"
+ input_file.write_text("Hello, world!", encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ with self.assertRaises(ValueError) as context:
+ convert_to_json(
+ language="English",
+ data_types="nouns",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="json",
+ overwrite=True,
+ )
+
+ self.assertIn("Unsupported file extension", str(context.exception))
+ self.assertEqual(
+ str(context.exception),
+ f"Unsupported file extension '.txt' for {input_file}. Please provide a '.csv' or '.tsv' file.",
+ )
+
+ def test_cli_convert_to_json_standard_csv(self) -> None:
+ csv_data = "key,value\na,1\nb,2"
+ expected_json_output = {"a": "1", "b": "2"}
+
+ input_file = self.tmp_path / "test.csv"
+ input_file.write_text(csv_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ convert_to_json(
+ language="English",
+ data_types="nouns",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="json",
+ overwrite=True,
+ )
+
+ output_file = output_dir / "English" / "nouns.json"
+ with open(output_file, "r", encoding="utf-8") as f:
+ actual_content = json.load(f)
+
+ assert actual_content == expected_json_output
+
+ def test_cli_convert_to_json_with_multiple_keys(self) -> None:
+ csv_data = "key,value1,value2\na,1,x\nb,2,y\nc,3,z"
+ expected_json_output = {
+ "a": {"value1": "1", "value2": "x"},
+ "b": {"value1": "2", "value2": "y"},
+ "c": {"value1": "3", "value2": "z"},
+ }
+
+ input_file = self.tmp_path / "test.csv"
+ input_file.write_text(csv_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ convert_to_json(
+ language="English",
+ data_types="nouns",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="json",
+ overwrite=True,
+ )
+
+ output_file = output_dir / "English" / "nouns.json"
+ with open(output_file, "r", encoding="utf-8") as f:
+ actual_content = json.load(f)
+
+ assert actual_content == expected_json_output
+
+ def test_cli_convert_to_json_with_complex_structure(self) -> None:
+ csv_data = "key,emoji,is_base,rank\na,😀,true,1\nb,😅,false,2"
+ expected_json_output = {
+ "a": [{"emoji": "😀", "is_base": True, "rank": 1}],
+ "b": [{"emoji": "😅", "is_base": False, "rank": 2}],
+ }
+
+ input_file = self.tmp_path / "test.csv"
+ input_file.write_text(csv_data, encoding="utf-8")
+ output_dir = self.tmp_path / "output"
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ convert_to_json(
+ language="English",
+ data_types="nouns",
+ input_file=input_file,
+ output_dir=output_dir,
+ output_type="json",
+ overwrite=True,
+ )
+
+ output_file = output_dir / "English" / "nouns.json"
+ with open(output_file, "r", encoding="utf-8") as f:
+ actual_content = json.load(f)
+
+ assert actual_content == expected_json_output
diff --git a/tests/load/test_data_to_sqlite.py b/tests/cli/convert/test_cli_convert_to_sqlite.py
similarity index 90%
rename from tests/load/test_data_to_sqlite.py
rename to tests/cli/convert/test_cli_convert_to_sqlite.py
index 08701dd86..c088c6a54 100644
--- a/tests/load/test_data_to_sqlite.py
+++ b/tests/cli/convert/test_cli_convert_to_sqlite.py
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-3.0-or-later
"""
-Test the data_to_sqlite function.
+Test the convert_to_sqlite function.
"""
import json
@@ -11,9 +11,9 @@
import pytest
-from scribe_data.load.data_to_sqlite import (
+from scribe_data.cli.convert.to_sqlite import (
+ convert_to_sqlite,
create_table,
- data_to_sqlite,
table_insert,
translations_to_sqlite,
wiktionary_translations_to_sqlite,
@@ -63,7 +63,10 @@ def temp_json_dir(tmp_path: Path) -> Path:
return json_dir
-def test_create_table(temp_db: Any) -> None:
+# MARK: Operations
+
+
+def test_cli_convert_to_sqlite_create_table(temp_db: Any) -> None:
"""
Test creating a table with both snake and camel case identifiers.
"""
@@ -83,7 +86,7 @@ def test_create_table(temp_db: Any) -> None:
assert "another_col" in columns
-def test_table_insert(temp_db: Any) -> None:
+def test_cli_convert_to_sqlite_table_insert(temp_db: Any) -> None:
"""
Test inserting data into a table.
"""
@@ -121,7 +124,10 @@ def translations_setup(tmp_path: Path) -> dict[str, Any]:
}
-def test_translations_to_sqlite(
+# MARK: Conversions
+
+
+def test_cli_convert_to_sqlite_translations(
temp_json_dir: Path, translations_setup: dict[str, Any]
) -> None:
"""
@@ -151,7 +157,7 @@ def test_translations_to_sqlite(
conn.close()
-def test_overwrite_existing_file_user_confirms(
+def test_cli_convert_to_sqlite_overwrite_existing_file_user_confirms(
temp_json_dir: Path, translations_setup: dict[str, Any]
) -> None:
"""
@@ -177,7 +183,7 @@ def test_overwrite_existing_file_user_confirms(
mock_remove.assert_called_once_with(translations_setup["expected_db_path"])
-def test_overwrite_existing_file_user_declines(
+def test_cli_convert_to_sqlite_overwrite_existing_file_user_declines(
temp_json_dir: Path, translations_setup: dict[str, Any]
) -> None:
"""
@@ -204,7 +210,7 @@ def test_overwrite_existing_file_user_declines(
mock_print.assert_called_with("Skipping translation DB creation.")
-def test_translations_to_sqlite_missing_json(
+def test_cli_convert_to_sqlite_translations_to_sqlite_missing_json(
temp_json_dir: Path, translations_setup: dict[str, Any], capsys: Any
) -> None:
"""
@@ -242,7 +248,7 @@ def __getattr__(self, name: str) -> Any:
return getattr(self._conn, name)
-def test_translations_to_sqlite_commit_error(
+def test_cli_convert_to_sqlite_translations_commit_error(
temp_json_dir: Path, translations_setup: dict[str, Any], capsys: Any
) -> None:
"""
@@ -269,15 +275,15 @@ def mock_connect(*args: Any, **kwargs: Any) -> MockConnection:
assert "mock commit error" in captured.out
-def test_data_to_sqlite_invalid_language() -> None:
+def test_cli_convert_to_sqlite_convert_invalid_language() -> None:
"""
- Test data_to_sqlite with invalid language.
+ Test convert_to_sqlite with invalid language.
"""
with pytest.raises(ValueError):
- data_to_sqlite(languages=["invalid_language"])
+ convert_to_sqlite(languages=["invalid_language"])
-def test_create_table_duplicate_columns(temp_db: Any) -> None:
+def test_cli_convert_to_sqlite_create_table_duplicate_columns(temp_db: Any) -> None:
"""
Test creating a table with duplicate column names.
"""
@@ -294,7 +300,7 @@ def test_create_table_duplicate_columns(temp_db: Any) -> None:
assert len(set(columns)) == 3 # all columns should be unique
-def test_data_to_sqlite_translations_and_nouns(tmp_path: Path) -> None:
+def test_cli_convert_to_sqlite_translations_and_nouns(tmp_path: Path) -> None:
input_dir = tmp_path / "input"
output_dir = tmp_path / "output"
input_dir.mkdir()
@@ -320,7 +326,7 @@ def test_data_to_sqlite_translations_and_nouns(tmp_path: Path) -> None:
}
(english_dir / "nouns.json").write_text(json.dumps(nouns_data))
- data_to_sqlite(
+ convert_to_sqlite(
languages=["english"],
specific_tables=None,
input_file=str(input_dir),
@@ -352,7 +358,7 @@ def test_data_to_sqlite_translations_and_nouns(tmp_path: Path) -> None:
assert len(scribe_row) == 1
-def test_data_to_sqlite_skips_missing_json(tmp_path: Path) -> None:
+def test_cli_convert_to_sqlite_skips_missing_json(tmp_path: Path) -> None:
input_dir = tmp_path / "input"
input_dir.mkdir()
lang_dir = input_dir / "english"
@@ -364,15 +370,19 @@ def test_data_to_sqlite_skips_missing_json(tmp_path: Path) -> None:
mock.patch("scribe_data.utils.data_type_metadata", {"nouns": None}),
mock.patch("scribe_data.utils.language_metadata", {"english": {}}),
mock.patch("scribe_data.utils.list_all_languages", return_value=["english"]),
- mock.patch("scribe_data.load.data_to_sqlite.create_table") as mock_create_table,
- mock.patch("scribe_data.load.data_to_sqlite.table_insert") as mock_table_insert,
+ mock.patch(
+ "scribe_data.cli.convert.to_sqlite.create_table"
+ ) as mock_create_table,
+ mock.patch(
+ "scribe_data.cli.convert.to_sqlite.table_insert"
+ ) as mock_table_insert,
mock.patch(
"scribe_data.utils.get_language_iso",
side_effect=lambda lang: lang[:2].upper(),
),
):
- # Run data_to_sqlite for 'nouns' only, but JSON file missing.
- data_to_sqlite(
+ # Run convert_to_sqlite for 'nouns' only, but JSON file missing.
+ convert_to_sqlite(
languages=["english"],
specific_tables=["nouns"],
input_file=str(input_dir),
@@ -388,7 +398,7 @@ def test_data_to_sqlite_skips_missing_json(tmp_path: Path) -> None:
# MARK: Wiktionary translations to SQLite
-def test_wiktionary_translations_to_sqlite_basic(tmp_path):
+def test_cli_convert_to_sqlite_wiktionary_translations_basic(tmp_path):
"""
Test basic wiktionary_translations_to_sqlite conversion.
"""
@@ -474,7 +484,7 @@ def test_wiktionary_translations_to_sqlite_basic(tmp_path):
conn.close()
-def test_wiktionary_translations_to_sqlite_camel_case(tmp_path):
+def test_cli_convert_to_sqlite_wiktionary_translations_camel_case(tmp_path):
"""
Test wiktionary_translations_to_sqlite with camelCase identifiers.
"""
@@ -515,7 +525,7 @@ def test_wiktionary_translations_to_sqlite_camel_case(tmp_path):
conn.close()
-def test_wiktionary_translations_to_sqlite_missing_dir(tmp_path, capsys):
+def test_cli_convert_to_sqlite_wiktionary_translations_missing_dir(tmp_path, capsys):
"""
Test wiktionary_translations_to_sqlite with non-existent language directory.
"""
@@ -530,7 +540,7 @@ def test_wiktionary_translations_to_sqlite_missing_dir(tmp_path, capsys):
assert "Skipping Wiktionary translations" in captured.out
-def test_wiktionary_translations_to_sqlite_no_translation_files(tmp_path):
+def test_cli_convert_to_sqlite_wiktionary_translations_no_files(tmp_path):
"""
Test that no database is created when there are no translation files.
"""
@@ -555,7 +565,7 @@ def test_wiktionary_translations_to_sqlite_no_translation_files(tmp_path):
assert not db_path.exists()
-def test_wiktionary_translations_to_sqlite_multiple_files(tmp_path):
+def test_cli_convert_to_sqlite_wiktionary_translations_multiple_files(tmp_path):
"""
Test wiktionary_translations_to_sqlite with multiple translation files.
"""
diff --git a/tests/cli/convert/test_cli_convert_wrapper.py b/tests/cli/convert/test_cli_convert_wrapper.py
new file mode 100644
index 000000000..aba1b7cde
--- /dev/null
+++ b/tests/cli/convert/test_cli_convert_wrapper.py
@@ -0,0 +1,148 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI convert functionality.
+"""
+
+import unittest
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from scribe_data.cli.convert.wrapper import convert_wrapper
+
+# MARK: Wrapper
+
+
+class TestCLIConvertWrapper(unittest.TestCase):
+ @pytest.fixture(autouse=True)
+ def _setup_fixtures(self, tmp_path):
+ self.tmp_path = tmp_path
+
+ @patch("scribe_data.cli.convert.wrapper.Path", autospec=True)
+ @patch("scribe_data.cli.convert.wrapper.convert_to_sqlite", autospec=True)
+ @patch("shutil.copy")
+ def test_convert_wrapper_to_sqlite(
+ self,
+ mock_shutil_copy: MagicMock,
+ mock_convert_to_sqlite: MagicMock,
+ mock_path: MagicMock,
+ ) -> None:
+ mock_path.return_value.exists.return_value = True
+
+ convert_wrapper(
+ languages=["english"],
+ data_types=["nouns"],
+ input_path=Path("file"),
+ output_dir=Path("/output"),
+ output_type="sqlite",
+ overwrite=True,
+ identifier_case="camel",
+ )
+
+ mock_convert_to_sqlite.assert_called_with(
+ languages=["english"],
+ specific_tables=["nouns"],
+ identifier_case="camel",
+ input_file=Path("file"),
+ output_file=Path("/output"),
+ overwrite=True,
+ )
+
+ @patch("scribe_data.cli.convert.wrapper.Path", autospec=True)
+ @patch("scribe_data.cli.convert.wrapper.convert_to_sqlite", autospec=True)
+ def test_convert_wrapper_to_sqlite_no_output_dir(
+ self, mock_convert_to_sqlite: MagicMock, mock_path: MagicMock
+ ) -> None:
+ mock_input_file = MagicMock()
+ mock_input_file.exists.return_value = True
+
+ mock_path.return_value = mock_input_file
+
+ mock_input_file.parent = MagicMock()
+ mock_input_file.parent.__truediv__.return_value = MagicMock()
+ mock_input_file.parent.__truediv__.return_value.exists.return_value = False
+
+ convert_wrapper(
+ languages=["english"],
+ data_types=["nouns"],
+ input_path=Path(mock_input_file),
+ output_dir=None,
+ output_type="sqlite",
+ overwrite=True,
+ identifier_case="camel",
+ )
+
+ mock_convert_to_sqlite.assert_called_with(
+ languages=["english"],
+ specific_tables=["nouns"],
+ identifier_case="camel",
+ input_file=Path(mock_input_file),
+ output_file=Path("scribe_data_sqlite_export"),
+ overwrite=True,
+ )
+
+ @patch("scribe_data.cli.convert.wrapper.convert_to_sqlite", autospec=True)
+ def test_convert_wrapper_german_wiktionary_translations_sqlite(
+ self, mock_convert_to_sqlite: MagicMock
+ ) -> None:
+ convert_wrapper(
+ languages=["german"],
+ data_types=["wiktionary_translations"],
+ input_path=Path("/input"),
+ output_dir=Path("/output"),
+ output_type="sqlite",
+ overwrite=False,
+ identifier_case="camel",
+ )
+
+ mock_convert_to_sqlite.assert_called_once_with(
+ languages=["german"],
+ specific_tables=["wiktionary_translations"],
+ identifier_case="camel",
+ input_file=Path("/input"),
+ output_file=Path("/output"),
+ overwrite=False,
+ )
+
+ @patch(
+ "scribe_data.cli.convert.wrapper.DEFAULT_WIKTIONARY_JSON_EXPORT_DIR",
+ new=Path("/mock_wiktionary_dir"),
+ )
+ @patch("scribe_data.cli.convert.wrapper.convert_to_sqlite", autospec=True)
+ def test_convert_wrapper_wiktionary_no_input_path_uses_wiktionary_default(
+ self, mock_convert_to_sqlite: MagicMock
+ ) -> None:
+ convert_wrapper(
+ languages=["german"],
+ data_types=["wiktionary_translations"],
+ input_path=None,
+ output_dir=Path("/output"),
+ output_type="sqlite",
+ overwrite=False,
+ )
+
+ mock_convert_to_sqlite.assert_called_once_with(
+ languages=["german"],
+ specific_tables=["wiktionary_translations"],
+ identifier_case="camel",
+ input_file=Path("/mock_wiktionary_dir"),
+ output_file=Path("/output"),
+ overwrite=False,
+ )
+
+ def test_convert_wrapper(self) -> None:
+ with self.assertRaises(ValueError) as context:
+ convert_wrapper(
+ languages=["English"],
+ data_types=["nouns"],
+ input_path=Path("Data/ecode.csv"),
+ output_dir=Path("/output_dir"),
+ output_type="parquet",
+ overwrite=True,
+ )
+
+ self.assertEqual(
+ str(context.exception),
+ "Unsupported output type 'parquet'. Must be 'json', 'csv', 'tsv' or 'sqlite'.",
+ )
diff --git a/tests/cli/test_download.py b/tests/cli/download/test_cli_download_wikidata_lexeme_dump.py
similarity index 81%
rename from tests/cli/test_download.py
rename to tests/cli/download/test_cli_download_wikidata_lexeme_dump.py
index 5347a3a9f..c8012223c 100644
--- a/tests/cli/test_download.py
+++ b/tests/cli/download/test_cli_download_wikidata_lexeme_dump.py
@@ -10,8 +10,8 @@
import requests
-from scribe_data.cli.download import (
- available_closest_lexeme_dumpfile,
+from scribe_data.cli.download.wikidata_lexeme_dump import (
+ available_closest_lexeme_dump_file,
download_wd_lexeme_dump,
parse_date,
wd_lexeme_dump_download_wrapper,
@@ -35,8 +35,8 @@ def test_parse_date_invalid_format(self) -> None:
self.assertIsNone(parse_date("99-16-77"))
self.assertIsNone(parse_date("invalid-date"))
- @patch("scribe_data.cli.download.requests.get")
- def test_available_closest_lexeme_dumpfile(self, mock_get: MagicMock) -> None:
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.requests.get")
+ def test_available_closest_lexeme_dump_file(self, mock_get: MagicMock) -> None:
"""
Test finding closest available lexeme dump file.
@@ -50,14 +50,14 @@ def test_available_closest_lexeme_dumpfile(self, mock_get: MagicMock) -> None:
)
target_date = "20240103"
other_old_dumps = ["20240101", "20240105", "20240110"]
- closest = available_closest_lexeme_dumpfile(
+ closest = available_closest_lexeme_dump_file(
target_date, other_old_dumps, mock_check_func
)
self.assertEqual(closest, "20240101")
- @patch("scribe_data.cli.download.requests.get")
- @patch("scribe_data.cli.download.re.findall")
- def test_download_wd_lexeme_dump_latest(
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.requests.get")
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.re.findall")
+ def test_cli_download_wd_lexeme_dump_latest(
self, mock_findall: MagicMock, mock_get: MagicMock
) -> None:
"""
@@ -72,9 +72,9 @@ def test_download_wd_lexeme_dump_latest(
"https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.bz2",
)
- @patch("scribe_data.cli.download.requests.get")
- @patch("scribe_data.cli.download.re.findall")
- def test_download_wd_lexeme_dump_by_date(
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.requests.get")
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.re.findall")
+ def test_cli_download_wd_lexeme_dump_by_date(
self, mock_findall: MagicMock, mock_get: MagicMock
) -> None:
"""
@@ -89,14 +89,15 @@ def test_download_wd_lexeme_dump_by_date(
"https://dumps.wikimedia.org/wikidatawiki/entities/20241127/wikidata-20241127-lexemes.json.bz2",
)
- @patch("scribe_data.cli.download.requests.get")
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.requests.get")
@patch(
- "scribe_data.cli.download.check_lexeme_dump_prompt_download", return_value=False
+ "scribe_data.cli.download.wikidata_lexeme_dump.check_lexeme_dump_prompt_download",
+ return_value=False,
)
- @patch("scribe_data.cli.download.open", new_callable=mock_open)
- @patch("scribe_data.cli.download.tqdm")
- @patch("scribe_data.cli.download.os.makedirs")
- @patch("scribe_data.cli.download.questionary.confirm")
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.open", new_callable=mock_open)
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.tqdm")
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.os.makedirs")
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.questionary.confirm")
def test_wd_lexeme_dump_download_wrapper_latest(
self,
mock_confirm: MagicMock,
@@ -172,9 +173,9 @@ def test_check_lexeme_dump_prompt_download_delete(
self.assertTrue(mock_unlink.called)
self.assertTrue(result)
- @patch("scribe_data.cli.download.requests.get")
- @patch("scribe_data.cli.download.questionary.confirm")
- def test_download_wd_lexeme_dump_http_error(
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.requests.get")
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.questionary.confirm")
+ def test_cli_download_wd_lexeme_dump_http_error(
self, mock_confirm: MagicMock, mock_get: MagicMock
) -> None:
"""
@@ -199,8 +200,8 @@ def test_download_wd_lexeme_dump_http_error(
"We could not find your requested Wikidata lexeme dump."
)
- @patch("scribe_data.cli.download.requests.get")
- def test_download_wd_lexeme_dump_request_exception(
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.requests.get")
+ def test_cli_download_wd_lexeme_dump_request_exception(
self, mock_get: MagicMock
) -> None:
"""
@@ -213,9 +214,9 @@ def test_download_wd_lexeme_dump_request_exception(
self.assertIsNone(result)
mock_print.assert_called_with("An error occurred: Connection error")
- @patch("scribe_data.cli.download.requests.get")
- @patch("scribe_data.cli.download.questionary.confirm")
- def test_download_wd_lexeme_dump_find_closest(
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.requests.get")
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.questionary.confirm")
+ def test_cli_download_wd_lexeme_dump_find_closest(
self, mock_confirm: MagicMock, mock_get: MagicMock
) -> None:
"""
@@ -244,9 +245,9 @@ def test_download_wd_lexeme_dump_find_closest(
self.assertIsNotNone(result)
self.assertIn("20240101", result)
- @patch("scribe_data.cli.download.requests.get")
- @patch("scribe_data.cli.download.questionary.confirm")
- def test_download_wd_lexeme_dump_user_declines_closest(
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.requests.get")
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.questionary.confirm")
+ def test_cli_download_wd_lexeme_dump_user_declines_closest(
self, mock_confirm: MagicMock, mock_get: MagicMock
) -> None:
"""
@@ -269,14 +270,18 @@ def test_wd_lexeme_dump_download_wrapper_default_flag(self) -> None:
"""
Test wrapper function with default flag set to True.
"""
- with patch("scribe_data.cli.download.download_wd_lexeme_dump") as mock_download:
+ with patch(
+ "scribe_data.cli.download.wikidata_lexeme_dump.download_wd_lexeme_dump"
+ ) as mock_download:
mock_download.return_value = None
result = wd_lexeme_dump_download_wrapper(default=True)
self.assertFalse(result)
- @patch("scribe_data.cli.download.requests.get")
- def test_download_wd_lexeme_dump_invalid_date(self, mock_get: MagicMock) -> None:
+ @patch("scribe_data.cli.download.wikidata_lexeme_dump.requests.get")
+ def test_cli_download_wd_lexeme_dump_invalid_date(
+ self, mock_get: MagicMock
+ ) -> None:
"""
Test downloading with invalid date format.
"""
diff --git a/tests/cli/interactive/test_cli_interactive_config.py b/tests/cli/interactive/test_cli_interactive_config.py
new file mode 100644
index 000000000..c4f63f037
--- /dev/null
+++ b/tests/cli/interactive/test_cli_interactive_config.py
@@ -0,0 +1,124 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI interactive mode configuration functionality.
+"""
+
+import unittest
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+from scribe_data.cli.interactive.config import ScribeDataConfig
+from scribe_data.cli.interactive.run import configure_settings
+
+
+class TestScribeDataCLIInteractiveConfig(unittest.TestCase):
+ def setUp(self) -> None:
+ """
+ Set up test fixtures before each test method.
+ """
+ self.config = ScribeDataConfig()
+ # Mock the language_metadata and data_type_metadata.
+ self.config.languages = ["english", "spanish", "french"]
+ self.config.data_types = ["nouns", "verbs"]
+
+ def test_cli_interactive_config_initialization(self) -> None:
+ """
+ Test ScribeDataConfig initialization.
+ """
+ self.assertEqual(self.config.selected_languages, [])
+ self.assertEqual(self.config.selected_data_types, [])
+ self.assertEqual(self.config.output_type, "json")
+ self.assertIsInstance(self.config.output_dir, Path)
+ self.assertFalse(self.config.overwrite)
+ self.assertFalse(self.config.configured)
+
+ @patch("scribe_data.cli.interactive.run.prompt_for_data_types")
+ @patch("scribe_data.cli.interactive.run.prompt_for_languages")
+ @patch("scribe_data.cli.interactive.run.prompt")
+ @patch("scribe_data.cli.interactive.run.rprint")
+ def test_cli_interactive_configure_settings_all_languages(
+ self,
+ mock_rprint: MagicMock,
+ mock_prompt: MagicMock,
+ mock_prompt_languages: MagicMock,
+ mock_prompt_data_types: MagicMock,
+ ) -> None:
+ """
+ Test configure_settings with 'All' languages selection.
+ """
+
+ # Simulate the internal changes made by the prompt_for_* functions.
+ def mock_lang():
+ self.config.selected_languages = self.config.languages
+
+ def mock_data():
+ self.config.selected_data_types = ["nouns"]
+
+ mock_prompt_languages.side_effect = mock_lang
+ mock_prompt_data_types.side_effect = mock_data
+
+ responses = iter(
+ [
+ "json", # output type
+ "", # output directory (default)
+ "y", # overwrite
+ ]
+ )
+ mock_prompt.side_effect = lambda *args, **kwargs: next(responses)
+
+ with patch(
+ "scribe_data.cli.interactive.run.interactive_mode_config", self.config
+ ):
+ with patch("scribe_data.cli.interactive.run.display_summary"):
+ configure_settings()
+
+ self.assertEqual(self.config.selected_languages, self.config.languages)
+ self.assertEqual(self.config.selected_data_types, ["nouns"])
+ self.assertEqual(self.config.output_type, "json")
+ self.assertTrue(self.config.configured)
+
+ @patch("scribe_data.cli.interactive.run.prompt_for_data_types")
+ @patch("scribe_data.cli.interactive.run.prompt_for_languages")
+ @patch("scribe_data.cli.interactive.run.prompt")
+ @patch("scribe_data.cli.interactive.run.rprint")
+ def test_cli_interactive_configure_settings_specific_languages(
+ self,
+ mock_rprint: MagicMock,
+ mock_prompt: MagicMock,
+ mock_prompt_languages: MagicMock,
+ mock_prompt_data_types: MagicMock,
+ ) -> None:
+ """
+ Test configure_settings with specific language selection.
+ """
+
+ # Simulate the internal changes made by the prompt_for_* functions.
+ def mock_lang():
+ self.config.selected_languages = ["english", "spanish"]
+
+ def mock_data():
+ self.config.selected_data_types = ["nouns", "verbs"]
+
+ mock_prompt_languages.side_effect = mock_lang
+ mock_prompt_data_types.side_effect = mock_data
+
+ responses = iter(
+ [
+ "csv", # output type
+ "/custom/path", # output directory
+ "n", # overwrite
+ ]
+ )
+ mock_prompt.side_effect = lambda *args, **kwargs: next(responses)
+
+ with patch(
+ "scribe_data.cli.interactive.run.interactive_mode_config", self.config
+ ):
+ with patch("scribe_data.cli.interactive.run.display_summary"):
+ configure_settings()
+
+ self.assertEqual(self.config.selected_languages, ["english", "spanish"])
+ self.assertEqual(self.config.selected_data_types, ["nouns", "verbs"])
+ self.assertEqual(self.config.output_type, "csv")
+ self.assertEqual(self.config.output_dir.as_posix(), "/custom/path")
+ self.assertFalse(self.config.overwrite)
diff --git a/tests/cli/interactive/test_cli_interactive_execute.py b/tests/cli/interactive/test_cli_interactive_execute.py
new file mode 100644
index 000000000..ad8ea61be
--- /dev/null
+++ b/tests/cli/interactive/test_cli_interactive_execute.py
@@ -0,0 +1,70 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI interactive mode execution functionality.
+"""
+
+import unittest
+from unittest.mock import MagicMock, patch
+
+from scribe_data.cli.interactive.config import ScribeDataConfig
+from scribe_data.cli.interactive.execute import (
+ display_summary,
+ execute_request,
+)
+
+
+class TestScribeDataCLIInteractiveExecute(unittest.TestCase):
+ def setUp(self) -> None:
+ """
+ Set up test fixtures before each test method.
+ """
+ self.config = ScribeDataConfig()
+ # Mock the language_metadata and data_type_metadata.
+ self.config.languages = ["english", "spanish", "french"]
+ self.config.data_types = ["nouns", "verbs"]
+
+ @patch("scribe_data.cli.interactive.execute.get_data")
+ @patch("scribe_data.cli.interactive.execute.tqdm")
+ @patch("scribe_data.cli.interactive.execute.logger")
+ def test_cli_interactive_execute_request(
+ self, mock_logger: MagicMock, mock_tqdm: MagicMock, mock_get_data: MagicMock
+ ) -> None:
+ """
+ Test execute_request functionality.
+ """
+ self.config.selected_languages = ["english"]
+ self.config.selected_data_types = ["nouns"]
+ self.config.configured = True
+
+ mock_get_data.return_value = True
+ mock_progress = MagicMock()
+ mock_tqdm.return_value.__enter__.return_value = mock_progress
+
+ with patch(
+ "scribe_data.cli.interactive.execute.interactive_mode_config", self.config
+ ):
+ execute_request()
+
+ mock_get_data.assert_called_once_with(
+ languages=["english"],
+ data_types=["nouns"],
+ output_type=self.config.output_type,
+ output_dir=self.config.output_dir,
+ overwrite=self.config.overwrite,
+ interactive=True,
+ )
+
+ @patch("rich.console.Console.print")
+ def test_cli_interactive_display_summary(self, mock_print: MagicMock) -> None:
+ """
+ Test display_summary functionality.
+ """
+ self.config.selected_languages = ["english"]
+ self.config.selected_data_types = ["nouns"]
+ self.config.output_type = "json"
+
+ with patch(
+ "scribe_data.cli.interactive.execute.interactive_mode_config", self.config
+ ):
+ display_summary()
+ mock_print.assert_called()
diff --git a/tests/cli/interactive/test_cli_interactive_prompt.py b/tests/cli/interactive/test_cli_interactive_prompt.py
new file mode 100644
index 000000000..1b73ed092
--- /dev/null
+++ b/tests/cli/interactive/test_cli_interactive_prompt.py
@@ -0,0 +1,109 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI interactive mode prompt functionality.
+"""
+
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import MagicMock, call, patch
+
+from prompt_toolkit.completion import WordCompleter
+
+from scribe_data.cli.interactive.config import ScribeDataConfig
+from scribe_data.cli.interactive.prompt import (
+ create_word_completer,
+ prompt_for_data_types,
+ prompt_for_languages,
+ resolve_wiktionary_dump_path,
+)
+
+
+class TestScribeDataCLIInteractivePrompt(unittest.TestCase):
+ def setUp(self) -> None:
+ """
+ Set up test fixtures before each test method.
+ """
+ self.config = ScribeDataConfig()
+ # Mock the language_metadata and data_type_metadata.
+ self.config.languages = ["english", "spanish", "french"]
+ self.config.data_types = ["nouns", "verbs"]
+
+ @patch("scribe_data.cli.interactive.prompt.prompt")
+ @patch("scribe_data.cli.interactive.prompt.rprint")
+ def test_cli_interactive_request_total_lexeme(
+ self, mock_rprint: MagicMock, mock_prompt: MagicMock
+ ) -> None:
+ """
+ Test request_total_lexeme functionality.
+ """
+ # Set up mock responses.
+ mock_prompt.side_effect = [
+ "english, french", # first call for languages
+ "nouns", # first call for data types
+ ]
+
+ with patch(
+ "scribe_data.cli.interactive.prompt.interactive_mode_config", self.config
+ ):
+ with patch(
+ "scribe_data.cli.interactive.config.list_all_languages",
+ return_value=["english", "french"],
+ ):
+ prompt_for_languages()
+ prompt_for_data_types()
+
+ # Verify the config was updated correctly.
+ self.assertEqual(self.config.selected_languages, ["english", "french"])
+ self.assertEqual(self.config.selected_data_types, ["nouns"])
+
+ # Verify prompt was called with correct arguments.
+ expected_calls = [
+ call(
+ "Select languages (comma-separated or 'All'): ",
+ completer=unittest.mock.ANY,
+ default="",
+ ),
+ call(
+ "Select data types (comma-separated or 'All'): ",
+ completer=unittest.mock.ANY,
+ default="",
+ ),
+ ]
+ mock_prompt.assert_has_calls(expected_calls, any_order=False)
+
+ def test_resolve_wiktionary_dump_path_from_subdirectory(self) -> None:
+ """
+ Find dumps when cwd is not the project root.
+ """
+ with patch("os.getcwd") as mock_getcwd:
+ with tempfile.TemporaryDirectory() as tmp:
+ root = Path(tmp)
+ dump_dir = root / "scribe_data_wiktionary_dumps_export"
+ json_dir = root / "scribe_data_json_export"
+ dump_dir.mkdir()
+ json_dir.mkdir()
+ dump_file = dump_dir / "dewiktionary-pages-articles.xml.bz2"
+ dump_file.write_bytes(b"x")
+
+ mock_getcwd.return_value = str(json_dir)
+ resolved = resolve_wiktionary_dump_path(
+ "german",
+ "scribe_data_wiktionary_dumps_export",
+ )
+
+ self.assertEqual(resolved, dump_file.resolve())
+
+ def test_cli_interactive_create_word_completer(self) -> None:
+ """
+ Test create_word_completer functionality.
+ """
+ # Test without 'All' option.
+ options = ["english", "spanish", "french"]
+ completer = create_word_completer(options, include_all=False)
+ self.assertIsInstance(completer, WordCompleter)
+ self.assertEqual(completer.words, options)
+
+ # Test with 'All' option.
+ completer_with_all = create_word_completer(options, include_all=True)
+ self.assertEqual(completer_with_all.words, ["All"] + options)
diff --git a/tests/cli/interactive/test_cli_interactive_run.py b/tests/cli/interactive/test_cli_interactive_run.py
new file mode 100644
index 000000000..77c00988f
--- /dev/null
+++ b/tests/cli/interactive/test_cli_interactive_run.py
@@ -0,0 +1,60 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI interactive mode runner functionality.
+"""
+
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+
+from scribe_data.cli.interactive.config import ScribeDataConfig
+
+
+class TestScribeDataCLIInteractiveRun(unittest.TestCase):
+ def setUp(self) -> None:
+ """
+ Set up test fixtures before each test method.
+ """
+ self.config = ScribeDataConfig()
+ # Mock the language_metadata and data_type_metadata.
+ self.config.languages = ["english", "spanish", "french"]
+ self.config.data_types = ["nouns", "verbs"]
+
+ @patch(
+ "scribe_data.cli.interactive.run.resolve_wiktionary_dump_path",
+ return_value=Path("/dump/path"),
+ )
+ @patch("scribe_data.wiktionary.parse_translations.parse_wiktionary_translations")
+ @patch("scribe_data.cli.interactive.run.prompt")
+ @patch("scribe_data.cli.interactive.run.prompt_for_languages")
+ @patch("scribe_data.cli.interactive.run.questionary.select")
+ def test_cli_interactive_run_mode_translations(
+ self,
+ mock_select,
+ mock_prompt_languages,
+ mock_prompt,
+ mock_parse_wiktionary,
+ mock_resolve_dump,
+ ):
+ from scribe_data.cli.interactive.run import (
+ interactive_mode_config,
+ run_interactive_mode,
+ )
+
+ mock_select.return_value.ask.side_effect = ["translations"]
+ mock_prompt.side_effect = [
+ "german",
+ "/dump/path",
+ "scribe_data_wiktionary_json_export",
+ "false",
+ ]
+ interactive_mode_config.selected_languages = ["english"]
+
+ run_interactive_mode(operation="translations")
+
+ mock_parse_wiktionary.assert_called_once_with(
+ target_languages=["english"],
+ wiktionary_dump_path=Path("/dump/path"),
+ output_dir=Path("scribe_data_wiktionary_json_export"),
+ overwrite=False,
+ )
diff --git a/tests/cli/list/test_cli_list_data_types.py b/tests/cli/list/test_cli_list_data_types.py
new file mode 100644
index 000000000..c801968cb
--- /dev/null
+++ b/tests/cli/list/test_cli_list_data_types.py
@@ -0,0 +1,75 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI list data types functionality.
+"""
+
+import unittest
+from unittest.mock import MagicMock, call, patch
+
+from scribe_data.cli.list.data_types import list_data_types
+from scribe_data.cli.main import main
+
+
+class TestCLIListDataTypes(unittest.TestCase):
+ @patch("builtins.print")
+ def test_cli_list_data_types_all_languages(self, mock_print: MagicMock) -> None:
+ list_data_types()
+ print(mock_print.mock_calls)
+ expected_calls = [
+ call(),
+ call("Available data types: All languages"),
+ call("==================================="),
+ call("adjectives"),
+ call("adverbs"),
+ # call("articles"),
+ call("conjunctions"),
+ call("emoji-keywords"),
+ call("nouns"),
+ call("personal-pronouns"),
+ call("postpositions"),
+ call("prepositions"),
+ call("pronouns"),
+ call("proper-nouns"),
+ call("verbs"),
+ call(),
+ ]
+ mock_print.assert_has_calls(expected_calls)
+
+ @patch("builtins.print")
+ def test_cli_list_data_types_specific_language(
+ self, mock_print: MagicMock
+ ) -> None:
+ list_data_types("english")
+
+ expected_calls = [
+ call(),
+ call("Available data types: English"),
+ call("============================="),
+ call("adjectives"),
+ call("adverbs"),
+ call("emoji-keywords"),
+ call("nouns"),
+ call("personal-pronouns"),
+ call("prepositions"),
+ call("pronouns"),
+ call("proper-nouns"),
+ call("verbs"),
+ call(),
+ ]
+ mock_print.assert_has_calls(expected_calls)
+
+ def test_cli_list_data_types_invalid_language(self) -> None:
+ with self.assertRaises(ValueError):
+ list_data_types("InvalidLanguage")
+
+ def test_cli_list_data_types_no_data_types(self) -> None:
+ with self.assertRaises(ValueError):
+ list_data_types("Klingon")
+
+ @patch("scribe_data.cli.list.wrapper.list_data_types")
+ def test_cli_list_data_types_command(self, mock_list_data_types: MagicMock) -> None:
+ test_args = ["main.py", "list", "--data-type"]
+ with patch("sys.argv", test_args):
+ main()
+
+ mock_list_data_types.assert_called_once()
diff --git a/tests/cli/list/test_cli_list_languages.py b/tests/cli/list/test_cli_list_languages.py
new file mode 100644
index 000000000..fcf9b1d1f
--- /dev/null
+++ b/tests/cli/list/test_cli_list_languages.py
@@ -0,0 +1,93 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI list languages functionality.
+"""
+
+import unittest
+from unittest.mock import MagicMock, patch
+
+from scribe_data.cli.list.languages import list_languages, list_languages_for_data_type
+from scribe_data.cli.main import main
+from scribe_data.utils import (
+ get_language_iso,
+ get_language_qid,
+ list_all_languages,
+ list_languages_with_metadata_for_data_type,
+)
+
+
+class TestCLIListLanguages(unittest.TestCase):
+ @patch("builtins.print")
+ def test_cli_list_languages(self, mock_print: MagicMock) -> None:
+ list_languages()
+
+ # Verify the headers.
+ mock_print.assert_any_call("\nLanguage ISO QID ")
+ mock_print.assert_any_call("=================================")
+
+ # Dynamically get the first language from the metadata.
+ languages = list_all_languages()
+ first_language = languages[0]
+ first_iso = get_language_iso(first_language)
+ first_qid = get_language_qid(first_language)
+
+ # Verify the first language entry.
+ # Calculate column widths as in the actual function.
+ language_col_width = max(len(lang) for lang in languages) + 2
+ iso_col_width = max(len(get_language_iso(lang)) for lang in languages) + 2
+ qid_col_width = max(len(get_language_qid(lang)) for lang in languages) + 2
+
+ # Verify the first language entry with dynamic spacing.
+ mock_print.assert_any_call(
+ f"{first_language.capitalize():<{language_col_width}} {first_iso:<{iso_col_width}} {first_qid:<{qid_col_width}}"
+ )
+ # Total print calls: N (languages) + 3 (header, one separator, final line).
+ self.assertEqual(mock_print.call_count, len(languages) + 3)
+
+ @patch("builtins.print")
+ def test_cli_list_languages_for_data_type_valid(
+ self, mock_print: MagicMock
+ ) -> None:
+ # Call the function with a specific data type.
+ list_languages_for_data_type("nouns")
+
+ # Dynamically create the header based on column widths.
+ all_languages = list_languages_with_metadata_for_data_type()
+
+ # Calculate column widths as in the actual function.
+ language_col_width = max(len(lang["name"]) for lang in all_languages) + 2
+ iso_col_width = max(len(lang["iso"]) for lang in all_languages) + 2
+ qid_col_width = max(len(lang["qid"]) for lang in all_languages) + 2
+
+ # Dynamically generate the expected header string.
+ expected_header = f"{'\nLanguage':<{language_col_width}} {'ISO':<{iso_col_width}} {'QID':<{qid_col_width}}"
+
+ # Verify the headers dynamically.
+ mock_print.assert_any_call(expected_header)
+ mock_print.assert_any_call(
+ "=" * (language_col_width + iso_col_width + qid_col_width)
+ )
+
+ # Verify the first language entry if there are any languages.
+
+ first_language = all_languages[0]["name"].capitalize()
+ first_iso = all_languages[0]["iso"]
+ first_qid = all_languages[0]["qid"]
+
+ # Verify the first language entry with dynamic spacing.
+ mock_print.assert_any_call(
+ f"{first_language:<{language_col_width}} {first_iso:<{iso_col_width}} {first_qid:<{qid_col_width}}"
+ )
+
+ # Check the total number of calls.
+ # Total calls = N (languages) + 3 (header, one separator, final line)
+ expected_calls = len(all_languages) + 3
+ self.assertEqual(mock_print.call_count, expected_calls)
+
+ @patch("scribe_data.cli.list.wrapper.list_languages")
+ def test_cli_list_languages_command(self, mock_list_languages: MagicMock) -> None:
+ test_args = ["main.py", "list", "--language"]
+ with patch("sys.argv", test_args):
+ main()
+
+ mock_list_languages.assert_called_once()
diff --git a/tests/cli/list/test_cli_list_wrapper.py b/tests/cli/list/test_cli_list_wrapper.py
new file mode 100644
index 000000000..65e2ca7d1
--- /dev/null
+++ b/tests/cli/list/test_cli_list_wrapper.py
@@ -0,0 +1,67 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI list wrapper functionality.
+"""
+
+import unittest
+from unittest.mock import MagicMock, patch
+
+from scribe_data.cli.list.wrapper import list_all, list_wrapper
+from scribe_data.cli.main import main
+
+
+class TestCLIListWrapper(unittest.TestCase):
+ @patch("scribe_data.cli.list.wrapper.list_languages")
+ @patch("scribe_data.cli.list.wrapper.list_data_types")
+ def test_cli_list_wrapper_list_all(
+ self, mock_list_data_types: MagicMock, mock_list_languages: MagicMock
+ ) -> None:
+ list_all()
+ mock_list_languages.assert_called_once()
+ mock_list_data_types.assert_called_once()
+
+ @patch("scribe_data.cli.list.wrapper.list_all")
+ def test_cli_list_wrapper_all(self, mock_list_all: MagicMock) -> None:
+ list_wrapper(all_bool=True)
+ mock_list_all.assert_called_once()
+
+ @patch("scribe_data.cli.list.wrapper.list_languages")
+ def test_cli_list_wrapper_languages(self, mock_list_languages: MagicMock) -> None:
+ list_wrapper(language=True)
+ mock_list_languages.assert_called_once()
+
+ @patch("scribe_data.cli.list.wrapper.list_data_types")
+ def test_cli_list_wrapper_data_types(self, mock_list_data_types: MagicMock) -> None:
+ list_wrapper(data_type=True)
+ mock_list_data_types.assert_called_once()
+
+ @patch("builtins.print")
+ def test_cli_list_wrapper_language_and_data_type(
+ self, mock_print: MagicMock
+ ) -> None:
+ list_wrapper(language=True, data_type=True)
+ mock_print.assert_called_with(
+ "Please specify either a language or a data type."
+ )
+
+ @patch("scribe_data.cli.list.wrapper.list_languages_for_data_type")
+ def test_cli_list_wrapper_languages_for_data_type(
+ self, mock_list_languages_for_data_type: MagicMock
+ ) -> None:
+ list_wrapper(language=True, data_type="example_data_type")
+ mock_list_languages_for_data_type.assert_called_with("example_data_type")
+
+ @patch("scribe_data.cli.list.wrapper.list_data_types")
+ def test_cli_list_wrapper_data_types_for_language(
+ self, mock_list_data_types: MagicMock
+ ) -> None:
+ list_wrapper(language="English", data_type=True)
+ mock_list_data_types.assert_called_with("English")
+
+ @patch("scribe_data.cli.list.wrapper.list_all")
+ def test_cli_list_wrapper_list_all_command(self, mock_list_all: MagicMock) -> None:
+ test_cli_list_wrapper_args = ["main.py", "list", "--all"]
+ with patch("sys.argv", test_cli_list_wrapper_args):
+ main()
+
+ mock_list_all.assert_called_once()
diff --git a/tests/cli/test_get.py b/tests/cli/test_cli_get.py
similarity index 94%
rename from tests/cli/test_get.py
rename to tests/cli/test_cli_get.py
index b1faa6732..4d89f7f5d 100644
--- a/tests/cli/test_get.py
+++ b/tests/cli/test_cli_get.py
@@ -32,7 +32,7 @@ class TestGetData(unittest.TestCase):
# MARK: Subprocess Patching
@patch("scribe_data.cli.get.generate_emoji")
- def test_get_emoji_keywords(self, generate_emoji: MagicMock) -> None:
+ def test_cli_get_emoji_keywords(self, generate_emoji: MagicMock) -> None:
"""
Test the generation of emoji keywords.
@@ -62,7 +62,7 @@ def test_invalid_arguments(self) -> None:
@patch("scribe_data.cli.get.query_data")
@patch("scribe_data.cli.get.parse_wd_lexeme_dump")
@patch("scribe_data.cli.get.questionary.confirm")
- def test_get_all_data_types_for_language_user_says_no(
+ def test_cli_get_all_data_types_for_language_user_says_no(
self,
mock_questionary_confirm: MagicMock,
mock_parse: MagicMock,
@@ -89,7 +89,7 @@ def test_get_all_data_types_for_language_user_says_no(
mock_query_data.assert_not_called()
@patch("scribe_data.cli.get.parse_wd_lexeme_dump")
- def test_get_all_languages_and_data_types(self, mock_parse: MagicMock) -> None:
+ def test_cli_get_all_languages_and_data_types(self, mock_parse: MagicMock) -> None:
"""
Test retrieving all languages for a specific data type.
@@ -109,7 +109,7 @@ def test_get_all_languages_and_data_types(self, mock_parse: MagicMock) -> None:
# MARK: Language and Data Type
@patch("scribe_data.cli.get.query_data")
- def test_get_specific_language_and_data_type(
+ def test_cli_get_specific_language_and_data_type(
self, mock_query_data: MagicMock
) -> None:
"""
@@ -133,7 +133,7 @@ def test_get_specific_language_and_data_type(
@patch("scribe_data.cli.get.query_data")
@patch("scribe_data.cli.get.Path.glob", return_value=[])
@patch("scribe_data.cli.get.check_index_exists")
- def test_get_data_with_capitalized_language(
+ def test_cli_get_data_with_capitalized_language(
self,
mock_check_index: MagicMock,
mock_glob: MagicMock,
@@ -159,7 +159,7 @@ def test_get_data_with_capitalized_language(
@patch("scribe_data.cli.get.query_data")
@patch("scribe_data.cli.get.Path.glob", return_value=[])
@patch("scribe_data.cli.get.check_index_exists", return_value=False)
- def test_get_data_with_lowercase_language(
+ def test_cli_get_data_with_lowercase_language(
self,
mock_check_index: MagicMock,
mock_glob: MagicMock,
@@ -182,7 +182,7 @@ def test_get_data_with_lowercase_language(
# MARK: Output Directory
@patch("scribe_data.cli.get.query_data")
- def test_get_data_with_different_output_directory(
+ def test_cli_get_data_with_different_output_directory(
self, mock_query_data: MagicMock
) -> None:
"""
@@ -207,7 +207,7 @@ def test_get_data_with_different_output_directory(
@patch("scribe_data.cli.get.query_data")
@patch("scribe_data.cli.get.Path.glob", return_value=[])
- def test_get_data_with_overwrite_true(
+ def test_cli_get_data_with_overwrite_true(
self, mock_glob: MagicMock, mock_query_data: MagicMock
) -> None:
"""
@@ -227,7 +227,9 @@ def test_get_data_with_overwrite_true(
# MARK: Overwrite is False
@patch("scribe_data.cli.get.query_data")
- def test_get_data_with_overwrite_false(self, mock_query_data: MagicMock) -> None:
+ def test_cli_get_data_with_overwrite_false(
+ self, mock_query_data: MagicMock
+ ) -> None:
get_data(
languages=["English"],
data_types=["verbs"],
@@ -313,7 +315,7 @@ def test_user_overwrites_existing_file(
# MARK: Translations
@patch("scribe_data.wiktionary.parse_translations.parse_wiktionary_translations")
- def test_get_translations_no_language_specified(self, mock_parse):
+ def test_cli_get_translations_no_language_specified(self, mock_parse):
get_data(data_types=["translations"])
mock_parse.assert_called_once_with(
target_languages=None,
@@ -323,7 +325,7 @@ def test_get_translations_no_language_specified(self, mock_parse):
)
@patch("scribe_data.wiktionary.parse_translations.parse_wiktionary_translations")
- def test_get_translations_with_specific_language(self, mock_parse):
+ def test_cli_get_translations_with_specific_language(self, mock_parse):
get_data(
languages=["Spanish"],
data_types=["translations"],
@@ -337,7 +339,7 @@ def test_get_translations_with_specific_language(self, mock_parse):
)
@patch("scribe_data.wiktionary.parse_translations.parse_wiktionary_translations")
- def test_get_translations_with_dump(self, mock_parse):
+ def test_cli_get_translations_with_dump(self, mock_parse):
get_data(
languages=["German"],
data_types=["translations"],
@@ -354,7 +356,7 @@ def test_get_translations_with_dump(self, mock_parse):
@patch("scribe_data.cli.get.parse_wd_lexeme_dump")
@patch("scribe_data.cli.get.questionary.confirm")
- def test_get_data_with_wikidata_identifier(
+ def test_cli_get_data_with_wikidata_identifier(
self, mock_questionary_confirm: MagicMock, mock_parse: MagicMock
) -> None:
"""
@@ -382,7 +384,7 @@ def test_get_data_with_wikidata_identifier(
)
@patch("scribe_data.cli.get.parse_wd_lexeme_dump")
- def test_get_data_with_wikidata_identifier_and_data_type(
+ def test_cli_get_data_with_wikidata_identifier_and_data_type(
self, mock_parse: MagicMock
) -> None:
"""
@@ -409,7 +411,7 @@ def test_get_data_with_wikidata_identifier_and_data_type(
# MARK: All Languages for Data Type
@patch("scribe_data.cli.get.parse_wd_lexeme_dump")
@patch("scribe_data.cli.get.questionary.confirm")
- def test_get_all_languages_for_data_type_user_says_no(
+ def test_cli_get_all_languages_for_data_type_user_says_no(
self, mock_questionary_confirm: MagicMock, mock_parse: MagicMock
) -> None:
"""
@@ -433,7 +435,7 @@ def test_get_all_languages_for_data_type_user_says_no(
@patch("scribe_data.cli.get.query_data")
@patch("scribe_data.cli.get.questionary.confirm")
- def test_get_all_languages_for_data_type_user_says_yes(
+ def test_cli_get_all_languages_for_data_type_user_says_yes(
self, mock_questionary_confirm: MagicMock, mock_query_data: MagicMock
) -> None:
"""
@@ -571,7 +573,7 @@ def test_default_output_directory_selection(
@patch("scribe_data.cli.get.query_data")
@patch("scribe_data.cli.get.check_index_exists")
- def test_get_data_with_interactive_mode(
+ def test_cli_get_data_with_interactive_mode(
self, mock_check_exists: MagicMock, mock_query_data: MagicMock
) -> None:
"""
@@ -589,7 +591,7 @@ def test_get_data_with_interactive_mode(
)
@patch("scribe_data.cli.get.parse_wd_lexeme_dump")
- def test_get_data_with_custom_dump_path(self, mock_parse: MagicMock) -> None:
+ def test_cli_get_data_with_custom_dump_path(self, mock_parse: MagicMock) -> None:
"""
Test retrieving data with a custom Wikidata dump path.
"""
@@ -607,7 +609,9 @@ def test_get_data_with_custom_dump_path(self, mock_parse: MagicMock) -> None:
)
@patch("scribe_data.cli.get.query_data")
- def test_get_data_with_multiple_languages(self, mock_query_data: MagicMock) -> None:
+ def test_cli_get_data_with_multiple_languages(
+ self, mock_query_data: MagicMock
+ ) -> None:
"""
Test retrieving data for multiple languages.
"""
@@ -641,7 +645,7 @@ def test_error_handling_value_error(self, mock_query_data: MagicMock) -> None:
@patch("scribe_data.cli.get.parse_wd_lexeme_dump")
@patch("scribe_data.cli.get.questionary.confirm")
- def test_get_data_with_all_and_specific_type(
+ def test_cli_get_data_with_all_and_specific_type(
self, mock_questionary: MagicMock, mock_parse: MagicMock
) -> None:
"""
@@ -661,7 +665,7 @@ def test_get_data_with_all_and_specific_type(
@patch("scribe_data.cli.get.query_data")
@patch("scribe_data.cli.get.check_index_exists")
- def test_get_data_case_insensitive_type(
+ def test_cli_get_data_case_insensitive_type(
self, mock_check_exists: MagicMock, mock_query_data: MagicMock
) -> None:
"""
diff --git a/tests/cli/test_upgrade.py b/tests/cli/test_cli_upgrade.py
similarity index 94%
rename from tests/cli/test_upgrade.py
rename to tests/cli/test_cli_upgrade.py
index 836608e94..8a3227bf9 100644
--- a/tests/cli/test_upgrade.py
+++ b/tests/cli/test_cli_upgrade.py
@@ -19,7 +19,7 @@ class TestUpgradeCLI:
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_unable_to_fetch_latest_version(
+ def test_cli_upgrade_unable_to_fetch_latest_version(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -40,7 +40,7 @@ def test_upgrade_cli_unable_to_fetch_latest_version(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_already_latest_version(
+ def test_cli_upgrade_already_latest_version(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -61,7 +61,7 @@ def test_upgrade_cli_already_latest_version(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_suggest_latest_version(
+ def test_cli_upgrade_suggest_latest_version(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -86,7 +86,7 @@ def test_upgrade_cli_suggest_latest_version(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_successful_upgrade(
+ def test_cli_upgrade_successful_upgrade(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -117,7 +117,7 @@ def test_upgrade_cli_successful_upgrade(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_subprocess_error(
+ def test_cli_upgrade_subprocess_error(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -149,7 +149,7 @@ def test_upgrade_cli_subprocess_error(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_version_parsing_edge_cases(
+ def test_cli_upgrade_version_parsing_edge_cases(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -170,7 +170,7 @@ def test_upgrade_cli_version_parsing_edge_cases(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_string_comparison_edge_case(
+ def test_cli_upgrade_string_comparison_edge_case(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -197,7 +197,7 @@ def test_upgrade_cli_string_comparison_edge_case(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_proper_higher_version_scenario(
+ def test_cli_upgrade_proper_higher_version_scenario(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -221,7 +221,7 @@ def test_upgrade_cli_proper_higher_version_scenario(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_different_version_formats(
+ def test_cli_upgrade_different_version_formats(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -242,7 +242,7 @@ def test_upgrade_cli_different_version_formats(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_semantic_version_upgrade_needed(
+ def test_cli_upgrade_semantic_version_upgrade_needed(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -269,7 +269,7 @@ def test_upgrade_cli_semantic_version_upgrade_needed(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_with_empty_version_strings(
+ def test_cli_upgrade_with_empty_version_strings(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -299,7 +299,7 @@ def test_upgrade_cli_with_empty_version_strings(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_invalid_local_version(
+ def test_cli_upgrade_invalid_local_version(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
@@ -326,7 +326,7 @@ def test_upgrade_cli_invalid_local_version(
@patch("scribe_data.cli.upgrade.get_local_version")
@patch("scribe_data.cli.upgrade.get_latest_version")
@patch("builtins.print")
- def test_upgrade_cli_invalid_latest_version(
+ def test_cli_upgrade_invalid_latest_version(
self,
mock_print: MagicMock,
mock_get_latest: MagicMock,
diff --git a/tests/cli/test_utils.py b/tests/cli/test_cli_utils.py
similarity index 80%
rename from tests/cli/test_utils.py
rename to tests/cli/test_cli_utils.py
index 92cb8eee1..582562ba1 100644
--- a/tests/cli/test_utils.py
+++ b/tests/cli/test_cli_utils.py
@@ -16,29 +16,31 @@
class TestCLIUtils(unittest.TestCase):
- def test_correct_data_type(self) -> None:
+ def test_utils_correct_data_type(self) -> None:
self.assertEqual(correct_data_type("emoji_keyword"), "emoji_keywords")
self.assertEqual(correct_data_type("preposition"), "prepositions")
self.assertEqual(correct_data_type("invalid"), None)
- def test_correct_data_type_with_trailing_s(self) -> None:
+ def test_utils_correct_data_type_with_trailing_s(self) -> None:
self.assertEqual(correct_data_type("emoji_keywords"), "emoji_keywords")
self.assertEqual(correct_data_type("prepositions"), "prepositions")
- def test_correct_data_type_invalid_input(self) -> None:
+ def test_utils_correct_data_type_invalid_input(self) -> None:
self.assertIsNone(correct_data_type("invalid_data_type"))
self.assertIsNone(correct_data_type(""))
self.assertIsNone(correct_data_type(None))
@patch("builtins.print")
- def test_print_formatted_data_emoji_keywords(self, mock_print: MagicMock) -> None:
+ def test_utils_print_formatted_data_emoji_keywords(
+ self, mock_print: MagicMock
+ ) -> None:
data = {"key1": [{"emoji": "😀"}, {"emoji": "😁"}], "key2": [{"emoji": "😂"}]}
print_formatted_data(data, "emoji_keywords")
mock_print.assert_any_call("key1 : 😀 😁")
mock_print.assert_any_call("key2 : 😂")
@patch("builtins.print")
- def test_print_formatted_data_dict(self, mock_print: MagicMock) -> None:
+ def test_utils_print_formatted_data_dict(self, mock_print: MagicMock) -> None:
data = {
"key1": {"subkey1": "value1", "subkey2": "value2"},
"key2": ["item1", "item2"],
@@ -52,14 +54,14 @@ def test_print_formatted_data_dict(self, mock_print: MagicMock) -> None:
mock_print.assert_any_call(" item2")
@patch("builtins.print")
- def test_print_formatted_data_empty_data(self, mock_print: MagicMock) -> None:
+ def test_utils_print_formatted_data_empty_data(self, mock_print: MagicMock) -> None:
print_formatted_data({}, "emoji_keywords")
mock_print.assert_called_once_with(
"No data available for data type 'emoji_keywords'."
)
@patch("builtins.print")
- def test_print_formatted_data_invalid_data_type(
+ def test_utils_print_formatted_data_invalid_data_type(
self, mock_print: MagicMock
) -> None:
data = {"key1": "value1", "key2": "value2"}
@@ -68,7 +70,7 @@ def test_print_formatted_data_invalid_data_type(
mock_print.assert_any_call("key2 : value2")
@patch("builtins.print")
- def test_print_formatted_data_list(self, mock_print: MagicMock) -> None:
+ def test_utils_print_formatted_data_list(self, mock_print: MagicMock) -> None:
data = ["item1", "item2", "item3"]
print_formatted_data(data, "list_data")
mock_print.assert_any_call("item1")
@@ -76,20 +78,22 @@ def test_print_formatted_data_list(self, mock_print: MagicMock) -> None:
mock_print.assert_any_call("item3")
@patch("builtins.print")
- def test_print_formatted_data_list_of_dicts(self, mock_print: MagicMock) -> None:
+ def test_utils_print_formatted_data_list_of_dicts(
+ self, mock_print: MagicMock
+ ) -> None:
data = [{"key1": "value1"}, {"key2": "value2"}]
print_formatted_data(data, "list_of_dicts")
mock_print.assert_any_call("key1 : value1")
mock_print.assert_any_call("key2 : value2")
- def test_print_formatted_data_prepositions(self) -> None:
+ def test_utils_print_formatted_data_prepositions(self) -> None:
data = {"key1": "value1", "key2": "value2"}
with patch("builtins.print") as mock_print:
print_formatted_data(data, "prepositions")
mock_print.assert_any_call("key1 : value1")
mock_print.assert_any_call("key2 : value2")
- def test_print_formatted_data_nested_dict(self) -> None:
+ def test_utils_print_formatted_data_nested_dict(self) -> None:
data = {"key1": {"subkey1": "subvalue1", "subkey2": "subvalue2"}}
with patch("builtins.print") as mock_print:
print_formatted_data(data, "nested_dict")
@@ -97,14 +101,14 @@ def test_print_formatted_data_nested_dict(self) -> None:
mock_print.assert_any_call(" subkey1 : subvalue1")
mock_print.assert_any_call(" subkey2 : subvalue2")
- def test_print_formatted_data_list_of_dicts_with_different_keys(self) -> None:
+ def test_utils_print_formatted_data_list_of_dicts_with_different_keys(self) -> None:
data = [{"key1": "value1"}, {"key2": "value2"}]
with patch("builtins.print") as mock_print:
print_formatted_data(data, "list_of_dicts_different_keys")
mock_print.assert_any_call("key1 : value1")
mock_print.assert_any_call("key2 : value2")
- def test_print_formatted_data_unknown_type(self) -> None:
+ def test_utils_print_formatted_data_unknown_type(self) -> None:
data = "unknown data type"
with patch("builtins.print") as mock_print:
print_formatted_data(data, "unknown")
@@ -128,8 +132,8 @@ def mock_get_qid(self, input_value: str) -> str | None:
"""
return self.qid_mapping.get(input_value.lower())
- @patch("scribe_data.cli.total.get_qid_by_input")
- def test_validate_language_and_data_type_valid(
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ def test_utils_validate_language_and_data_type_valid(
self, mock_get_qid: MagicMock
) -> None:
mock_get_qid.side_effect = self.mock_get_qid
@@ -143,8 +147,8 @@ def test_validate_language_and_data_type_valid(
except ValueError:
self.fail("validate_language_and_data_type raised ValueError unexpectedly!")
- @patch("scribe_data.cli.total.get_qid_by_input")
- def test_validate_language_and_data_type_invalid_language(
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ def test_utils_validate_language_and_data_type_invalid_language(
self, mock_get_qid: MagicMock
) -> None:
mock_get_qid.side_effect = self.mock_get_qid
@@ -159,8 +163,8 @@ def test_validate_language_and_data_type_invalid_language(
self.assertEqual(str(context.exception), "Invalid language 'InvalidLanguage'.")
- @patch("scribe_data.cli.total.get_qid_by_input")
- def test_validate_language_and_data_type_invalid_data_type(
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ def test_utils_validate_language_and_data_type_invalid_data_type(
self, mock_get_qid: MagicMock
) -> None:
mock_get_qid.side_effect = self.mock_get_qid
@@ -175,8 +179,8 @@ def test_validate_language_and_data_type_invalid_data_type(
self.assertEqual(str(context.exception), "Invalid data-type 'InvalidDataType'.")
- @patch("scribe_data.cli.total.get_qid_by_input")
- def test_validate_language_and_data_type_both_invalid(
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ def test_utils_validate_language_and_data_type_both_invalid(
self, mock_get_qid: MagicMock
) -> None:
mock_get_qid.side_effect = lambda x: None # Simulate invalid inputs
@@ -194,7 +198,7 @@ def test_validate_language_and_data_type_both_invalid(
"Invalid language 'InvalidLanguage'.\nInvalid data-type 'InvalidDataType'.",
)
- def test_validate_language_and_data_type_with_list(self) -> None:
+ def test_utils_validate_language_and_data_type_with_list(self) -> None:
"""
Test validation with lists of languages and data types.
"""
@@ -207,7 +211,7 @@ def test_validate_language_and_data_type_with_list(self) -> None:
"validate_language_and_data_type raised ValueError unexpectedly with valid lists!"
)
- def test_validate_language_and_data_type_with_qids(self) -> None:
+ def test_utils_validate_language_and_data_type_with_qids(self) -> None:
"""
Test validation directly with QIDs.
"""
@@ -220,7 +224,9 @@ def test_validate_language_and_data_type_with_qids(self) -> None:
"validate_language_and_data_type raised ValueError unexpectedly with valid QIDs!"
)
- def test_validate_language_and_data_type_mixed_validity_in_lists(self) -> None:
+ def test_utils_validate_language_and_data_type_mixed_validity_in_lists(
+ self,
+ ) -> None:
"""
Test validation with mixed valid and invalid entries in lists.
"""
diff --git a/tests/cli/test_version.py b/tests/cli/test_cli_version.py
similarity index 86%
rename from tests/cli/test_version.py
rename to tests/cli/test_cli_version.py
index 5f7083f4d..8e3351fe3 100644
--- a/tests/cli/test_version.py
+++ b/tests/cli/test_cli_version.py
@@ -18,7 +18,7 @@
class TestVersionFunctions(unittest.TestCase):
@patch("scribe_data.cli.version.importlib.metadata.version")
- def test_get_local_version_installed(self, mock_version: MagicMock) -> None:
+ def test_cli_version_get_local_installed(self, mock_version: MagicMock) -> None:
mock_version.return_value = "1.0.0"
self.assertEqual(get_local_version(), "1.0.0")
@@ -26,24 +26,24 @@ def test_get_local_version_installed(self, mock_version: MagicMock) -> None:
"scribe_data.cli.version.importlib.metadata.version",
side_effect=importlib.metadata.PackageNotFoundError,
)
- def test_get_local_version_not_installed(self, mock_version: MagicMock) -> None:
+ def test_cli_version_get_local_not_installed(self, mock_version: MagicMock) -> None:
self.assertEqual(get_local_version(), UNKNOWN_VERSION_NOT_PIP)
@patch("requests.get")
- def test_get_latest_version(self, mock_get: MagicMock) -> None:
+ def test_cli_version_get_latest_version(self, mock_get: MagicMock) -> None:
mock_get.return_value.status_code = 200
mock_get.return_value.json.return_value = {"name": "v1.0.1"}
self.assertEqual(get_latest_version(), "v1.0.1")
@patch("requests.get", side_effect=Exception("Unable to fetch version"))
- def test_get_latest_version_failure(self, mock_get: MagicMock) -> None:
+ def test_cli_version_get_latest_failure(self, mock_get: MagicMock) -> None:
self.assertEqual(get_latest_version(), UNKNOWN_VERSION_NOT_FETCHED)
@patch("scribe_data.cli.version.get_local_version", return_value="X.Y.Z")
@patch(
"scribe_data.cli.version.get_latest_version", return_value="Scribe-Data X.Y.Z"
)
- def test_get_version_message_up_to_date(
+ def test_cli_version_get_message_up_to_date(
self, mock_latest_version: MagicMock, mock_local_version: MagicMock
) -> None:
"""
@@ -56,7 +56,7 @@ def test_get_version_message_up_to_date(
@patch(
"scribe_data.cli.version.get_latest_version", return_value="Scribe-Data X.Y.Z"
)
- def test_upgrade_available(
+ def test_cli_version_upgrade_available(
self, mock_latest_version: MagicMock, mock_local_version: MagicMock
) -> None:
"""
@@ -72,7 +72,7 @@ def test_upgrade_available(
@patch(
"scribe_data.cli.version.get_latest_version", return_value="Scribe-Data X.Y.Z"
)
- def test_local_version_unknown(
+ def test_cli_version_local_unknown(
self, mock_latest_version: MagicMock, mock_local_version: MagicMock
) -> None:
"""
@@ -85,7 +85,7 @@ def test_local_version_unknown(
"scribe_data.cli.version.get_latest_version",
return_value=UNKNOWN_VERSION_NOT_FETCHED,
)
- def test_latest_version_unknown(
+ def test_cli_version_latest_unknown(
self, mock_latest_version: MagicMock, mock_local_version: MagicMock
) -> None:
"""
diff --git a/tests/cli/test_convert.py b/tests/cli/test_convert.py
deleted file mode 100644
index 6df2b2702..000000000
--- a/tests/cli/test_convert.py
+++ /dev/null
@@ -1,500 +0,0 @@
-# SPDX-License-Identifier: GPL-3.0-or-later
-"""
-Tests for the CLI convert functionality.
-"""
-
-import json
-import unittest
-from io import StringIO
-from pathlib import Path
-from unittest.mock import MagicMock, patch
-
-import pytest
-
-from scribe_data.cli.convert import (
- convert_to_csv_or_tsv,
- convert_to_json,
- convert_wrapper,
-)
-
-
-class TestConvert(unittest.TestCase):
- # MARK: Helper Functions
-
- def normalize_line_endings(self, data: str) -> str:
- """
- Normalize line endings in a given string.
-
-
- Parameters
- ----------
- data: str
- The input string whose line endings are to be normalized.
-
- Returns
- ---------
- data: str
- The input string with normalized line endings.
- """
- return data.replace("\r\n", "\n").replace("\r", "\n")
-
- @pytest.fixture(autouse=True)
- def _setup_fixtures(self, tmp_path):
- self.tmp_path = tmp_path
-
- # MARK: JSON
-
- @patch("scribe_data.cli.convert.Path", autospec=True)
- def test_convert_to_json_empty_language(self, mock_path: MagicMock) -> None:
- csv_data = "key,value\na,1\nb,2"
- mock_file = StringIO(csv_data)
-
- mock_path_obj = MagicMock(spec=Path)
- mock_path.return_value = mock_path_obj
- mock_path_obj.suffix = ".csv"
- mock_path_obj.exists.return_value = True
- mock_path_obj.open.return_value.__enter__.return_value = mock_file
-
- with self.assertRaises(ValueError) as context:
- convert_to_json(
- language="",
- data_types="nouns",
- input_file=Path("input.csv"),
- output_dir=Path("/output_dir"),
- output_type="json",
- overwrite=True,
- )
- self.assertIn("Language '' is not recognized.", str(context.exception))
-
- @patch("scribe_data.cli.convert.Path", autospec=True)
- def test_convert_to_json_supported_file_extension_csv(
- self, mock_path_class: MagicMock
- ) -> None:
- mock_path_instance = MagicMock(spec=Path)
-
- mock_path_class.return_value = mock_path_instance
-
- mock_path_instance.suffix = ".csv"
- mock_path_instance.exists.return_value = True
-
- convert_to_json(
- language="English",
- data_types="nouns",
- input_file=Path("test.csv"),
- output_dir=Path("/output_dir"),
- output_type="json",
- overwrite=True,
- )
-
- @patch("scribe_data.cli.convert.Path", autospec=True)
- def test_convert_to_json_supported_file_extension_tsv(
- self, mock_path_class: MagicMock
- ) -> None:
- mock_path_instance = MagicMock(spec=Path)
-
- mock_path_class.return_value = mock_path_instance
-
- mock_path_instance.suffix = ".tsv"
- mock_path_instance.exists.return_value = True
-
- convert_to_json(
- language="English",
- data_types="nouns",
- input_file=Path("test.tsv"),
- output_dir=Path("/output_dir"),
- output_type="json",
- overwrite=True,
- )
-
- def test_convert_to_json_unsupported_file_extension(self) -> None:
- input_file = self.tmp_path / "test.txt"
- input_file.write_text("Hello, world!", encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- with self.assertRaises(ValueError) as context:
- convert_to_json(
- language="English",
- data_types="nouns",
- input_file=input_file,
- output_dir=output_dir,
- output_type="json",
- overwrite=True,
- )
-
- self.assertIn("Unsupported file extension", str(context.exception))
- self.assertEqual(
- str(context.exception),
- f"Unsupported file extension '.txt' for {input_file}. Please provide a '.csv' or '.tsv' file.",
- )
-
- # MARK: JSON
-
- def test_convert_to_json_standard_csv(self) -> None:
- csv_data = "key,value\na,1\nb,2"
- expected_json_output = {"a": "1", "b": "2"}
-
- input_file = self.tmp_path / "test.csv"
- input_file.write_text(csv_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- convert_to_json(
- language="English",
- data_types="nouns",
- input_file=input_file,
- output_dir=output_dir,
- output_type="json",
- overwrite=True,
- )
-
- output_file = output_dir / "English" / "nouns.json"
- with open(output_file, "r", encoding="utf-8") as f:
- actual_content = json.load(f)
-
- assert actual_content == expected_json_output
-
- def test_convert_to_json_with_multiple_keys(self) -> None:
- csv_data = "key,value1,value2\na,1,x\nb,2,y\nc,3,z"
- expected_json_output = {
- "a": {"value1": "1", "value2": "x"},
- "b": {"value1": "2", "value2": "y"},
- "c": {"value1": "3", "value2": "z"},
- }
-
- input_file = self.tmp_path / "test.csv"
- input_file.write_text(csv_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- convert_to_json(
- language="English",
- data_types="nouns",
- input_file=input_file,
- output_dir=output_dir,
- output_type="json",
- overwrite=True,
- )
-
- output_file = output_dir / "English" / "nouns.json"
- with open(output_file, "r", encoding="utf-8") as f:
- actual_content = json.load(f)
-
- assert actual_content == expected_json_output
-
- def test_convert_to_json_with_complex_structure(self) -> None:
- csv_data = "key,emoji,is_base,rank\na,😀,true,1\nb,😅,false,2"
- expected_json_output = {
- "a": [{"emoji": "😀", "is_base": True, "rank": 1}],
- "b": [{"emoji": "😅", "is_base": False, "rank": 2}],
- }
-
- input_file = self.tmp_path / "test.csv"
- input_file.write_text(csv_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- convert_to_json(
- language="English",
- data_types="nouns",
- input_file=input_file,
- output_dir=output_dir,
- output_type="json",
- overwrite=True,
- )
-
- output_file = output_dir / "English" / "nouns.json"
- with open(output_file, "r", encoding="utf-8") as f:
- actual_content = json.load(f)
-
- assert actual_content == expected_json_output
-
- # MARK: CSV or TSV
-
- def test_convert_to_csv_or_json_empty_language(self) -> None:
- json_data = '{"key1": "value1", "key2": "value2"}'
-
- input_file = self.tmp_path / "test.json"
- input_file.write_text(json_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- with self.assertRaises(ValueError) as context:
- convert_to_csv_or_tsv(
- language="",
- data_types="nouns",
- input_file=input_file,
- output_dir=output_dir,
- output_type="csv",
- overwrite=True,
- )
-
- self.assertEqual(str(context.exception), "Language '' is not recognized.")
-
- def test_convert_to_csv_or_tsv_standard_dict_to_csv(self) -> None:
- json_data = '{"a": "1", "b": "2"}'
- expected_csv_output = "preposition,value\na,1\nb,2\n"
-
- input_file = self.tmp_path / "test.json"
- input_file.write_text(json_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- convert_to_csv_or_tsv(
- language="English",
- data_types="prepositions",
- input_file=input_file,
- output_dir=output_dir,
- output_type="csv",
- overwrite=True,
- )
-
- output_file = output_dir / "English" / "prepositions.csv"
- actual_content = output_file.read_text(encoding="utf-8")
- assert actual_content == expected_csv_output
-
- def test_convert_to_csv_or_tsv_standard_dict_to_tsv(self) -> None:
- json_data = '{"a": "1", "b": "2"}'
- expected_tsv_output = "preposition\tvalue\na\t1\nb\t2\n"
-
- input_file = self.tmp_path / "test.json"
- input_file.write_text(json_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- convert_to_csv_or_tsv(
- language="English",
- data_types="prepositions",
- input_file=input_file,
- output_dir=output_dir,
- output_type="tsv",
- overwrite=True,
- )
-
- output_file = output_dir / "English" / "prepositions.tsv"
- actual_content = output_file.read_text(encoding="utf-8")
- assert actual_content == expected_tsv_output
-
- def test_convert_to_csv_or_tsv_nested_dict_to_csv(self) -> None:
- json_data = (
- '{"a": {"value1": "1", "value2": "x"}, "b": {"value1": "2", "value2": "y"}}'
- )
- expected_csv_output = "noun,value1,value2\na,1,x\nb,2,y\n"
-
- input_file = self.tmp_path / "test.json"
- input_file.write_text(json_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- convert_to_csv_or_tsv(
- language="English",
- data_types="nouns",
- input_file=input_file,
- output_dir=output_dir,
- output_type="csv",
- overwrite=True,
- )
-
- output_file = output_dir / "English" / "nouns.csv"
- actual_content = output_file.read_text(encoding="utf-8")
- assert actual_content == expected_csv_output
-
- def test_convert_to_csv_or_tsv_nested_dict_to_tsv(self) -> None:
- json_data = (
- '{"a": {"value1": "1", "value2": "x"}, "b": {"value1": "2", "value2": "y"}}'
- )
- expected_tsv_output = "noun\tvalue1\tvalue2\na\t1\tx\nb\t2\ty\n"
-
- input_file = self.tmp_path / "test.json"
- input_file.write_text(json_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- convert_to_csv_or_tsv(
- language="English",
- data_types="nouns",
- input_file=input_file,
- output_dir=output_dir,
- output_type="tsv",
- overwrite=True,
- )
-
- output_file = output_dir / "English" / "nouns.tsv"
- actual_content = output_file.read_text(encoding="utf-8")
- assert actual_content == expected_tsv_output
-
- def test_convert_to_csv_or_tsv_list_of_dicts_to_csv(self) -> None:
- json_data = '{"a": [{"emoji": "😀", "is_base": true, "rank": 1}, {"emoji": "😅", "is_base": false, "rank": 2}]}'
- expected_csv_output = "word,emoji,is_base,rank\na,😀,True,1\na,😅,False,2\n"
-
- input_file = self.tmp_path / "test.json"
- input_file.write_text(json_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- convert_to_csv_or_tsv(
- language="English",
- data_types="emoji-keywords",
- input_file=input_file,
- output_dir=output_dir,
- output_type="csv",
- overwrite=True,
- )
-
- output_file = output_dir / "English" / "emoji-keywords.csv"
- actual_content = output_file.read_text(encoding="utf-8")
- assert actual_content == expected_csv_output
-
- def test_convert_to_csv_or_tsv_list_of_dicts_to_tsv(self) -> None:
- json_data = '{"a": [{"emoji": "😀", "is_base": true, "rank": 1}, {"emoji": "😅", "is_base": false, "rank": 2}]}'
- expected_tsv_output = (
- "word\temoji\tis_base\trank\na\t😀\tTrue\t1\na\t😅\tFalse\t2\n"
- )
-
- input_file = self.tmp_path / "test.json"
- input_file.write_text(json_data, encoding="utf-8")
- output_dir = self.tmp_path / "output"
- output_dir.mkdir(parents=True, exist_ok=True)
-
- convert_to_csv_or_tsv(
- language="English",
- data_types="emoji-keywords",
- input_file=input_file,
- output_dir=output_dir,
- output_type="tsv",
- overwrite=True,
- )
-
- output_file = output_dir / "English" / "emoji-keywords.tsv"
- actual_content = output_file.read_text(encoding="utf-8")
- assert actual_content == expected_tsv_output
-
- # MARK: SQLITE
-
- @patch("scribe_data.cli.convert.Path", autospec=True)
- @patch("scribe_data.cli.convert.data_to_sqlite", autospec=True)
- @patch("shutil.copy")
- def test_convert_to_sqlite(
- self,
- mock_shutil_copy: MagicMock,
- mock_data_to_sqlite: MagicMock,
- mock_path: MagicMock,
- ) -> None:
- mock_path.return_value.exists.return_value = True
-
- convert_wrapper(
- languages=["english"],
- data_types=["nouns"],
- input_path=Path("file"),
- output_dir=Path("/output"),
- output_type="sqlite",
- overwrite=True,
- identifier_case="camel",
- )
-
- mock_data_to_sqlite.assert_called_with(
- languages=["english"],
- specific_tables=["nouns"],
- identifier_case="camel",
- input_file=Path("file"),
- output_file=Path("/output"),
- overwrite=True,
- )
-
- @patch("scribe_data.cli.convert.Path", autospec=True)
- @patch("scribe_data.cli.convert.data_to_sqlite", autospec=True)
- def test_convert_to_sqlite_no_output_dir(
- self, mock_data_to_sqlite: MagicMock, mock_path: MagicMock
- ) -> None:
- mock_input_file = MagicMock()
- mock_input_file.exists.return_value = True
-
- mock_path.return_value = mock_input_file
-
- mock_input_file.parent = MagicMock()
- mock_input_file.parent.__truediv__.return_value = MagicMock()
- mock_input_file.parent.__truediv__.return_value.exists.return_value = False
-
- convert_wrapper(
- languages=["english"],
- data_types=["nouns"],
- input_path=Path(mock_input_file),
- output_dir=None,
- output_type="sqlite",
- overwrite=True,
- identifier_case="camel",
- )
-
- mock_data_to_sqlite.assert_called_with(
- languages=["english"],
- specific_tables=["nouns"],
- identifier_case="camel",
- input_file=Path(mock_input_file),
- output_file=Path("scribe_data_sqlite_export"),
- overwrite=True,
- )
-
- @patch("scribe_data.cli.convert.data_to_sqlite", autospec=True)
- def test_convert_wrapper_german_wiktionary_translations_sqlite(
- self, mock_data_to_sqlite: MagicMock
- ) -> None:
- convert_wrapper(
- languages=["german"],
- data_types=["wiktionary_translations"],
- input_path=Path("/input"),
- output_dir=Path("/output"),
- output_type="sqlite",
- overwrite=False,
- identifier_case="camel",
- )
-
- mock_data_to_sqlite.assert_called_once_with(
- languages=["german"],
- specific_tables=["wiktionary_translations"],
- identifier_case="camel",
- input_file=Path("/input"),
- output_file=Path("/output"),
- overwrite=False,
- )
-
- @patch(
- "scribe_data.cli.convert.DEFAULT_WIKTIONARY_JSON_EXPORT_DIR",
- new=Path("/mock_wiktionary_dir"),
- )
- @patch("scribe_data.cli.convert.data_to_sqlite", autospec=True)
- def test_convert_wrapper_wiktionary_no_input_path_uses_wiktionary_default(
- self, mock_data_to_sqlite: MagicMock
- ) -> None:
- convert_wrapper(
- languages=["german"],
- data_types=["wiktionary_translations"],
- input_path=None,
- output_dir=Path("/output"),
- output_type="sqlite",
- overwrite=False,
- )
-
- mock_data_to_sqlite.assert_called_once_with(
- languages=["german"],
- specific_tables=["wiktionary_translations"],
- identifier_case="camel",
- input_file=Path("/mock_wiktionary_dir"),
- output_file=Path("/output"),
- overwrite=False,
- )
-
- def test_convert(self) -> None:
- with self.assertRaises(ValueError) as context:
- convert_wrapper(
- languages=["English"],
- data_types=["nouns"],
- input_path=Path("Data/ecode.csv"),
- output_dir=Path("/output_dir"),
- output_type="parquet",
- overwrite=True,
- )
-
- self.assertEqual(
- str(context.exception),
- "Unsupported output type 'parquet'. Must be 'json', 'csv', 'tsv' or 'sqlite'.",
- )
diff --git a/tests/cli/test_interactive.py b/tests/cli/test_interactive.py
deleted file mode 100644
index 8f35171bf..000000000
--- a/tests/cli/test_interactive.py
+++ /dev/null
@@ -1,260 +0,0 @@
-# SPDX-License-Identifier: GPL-3.0-or-later
-"""
-Tests for the CLI interactive mode functionality.
-"""
-
-import tempfile
-import unittest
-from pathlib import Path
-from unittest.mock import MagicMock, call, patch
-
-from prompt_toolkit.completion import WordCompleter
-
-from scribe_data.cli.interactive import (
- ScribeDataConfig,
- configure_settings,
- display_summary,
- prompt_for_data_types,
- prompt_for_languages,
- run_request,
-)
-
-
-class TestScribeDataInteractive(unittest.TestCase):
- def setUp(self) -> None:
- """
- Set up test fixtures before each test method.
- """
- self.config = ScribeDataConfig()
- # Mock the language_metadata and data_type_metadata.
- self.config.languages = ["english", "spanish", "french"]
- self.config.data_types = ["nouns", "verbs"]
-
- def test_scribe_data_config_initialization(self) -> None:
- """
- Test ScribeDataConfig initialization.
- """
- self.assertEqual(self.config.selected_languages, [])
- self.assertEqual(self.config.selected_data_types, [])
- self.assertEqual(self.config.output_type, "json")
- self.assertIsInstance(self.config.output_dir, Path)
- self.assertFalse(self.config.overwrite)
- self.assertFalse(self.config.configured)
-
- @patch("scribe_data.cli.interactive.prompt")
- @patch("scribe_data.cli.interactive.rprint")
- def test_configure_settings_all_languages(
- self, mock_rprint: MagicMock, mock_prompt: MagicMock
- ) -> None:
- """
- Test configure_settings with 'All' languages selection.
- """
- # Set up mock responses.
- responses = iter(
- [
- "All", # languages
- "nouns", # data types
- "json", # output type
- "", # output directory (default)
- "y", # overwrite
- ]
- )
- mock_prompt.side_effect = lambda *args, **kwargs: next(responses)
-
- with patch("scribe_data.cli.interactive.config", self.config):
- with patch("scribe_data.cli.interactive.display_summary"):
- configure_settings()
-
- self.assertEqual(self.config.selected_languages, self.config.languages)
- self.assertEqual(self.config.selected_data_types, ["nouns"])
- self.assertEqual(self.config.output_type, "json")
- self.assertTrue(self.config.configured)
-
- @patch("scribe_data.cli.interactive.prompt")
- @patch("scribe_data.cli.interactive.rprint")
- def test_configure_settings_specific_languages(
- self, mock_rprint: MagicMock, mock_prompt: MagicMock
- ) -> None:
- """
- Test configure_settings with specific language selection.
- """
- # Set up mock responses.
- responses = iter(
- [
- "english, spanish", # languages
- "nouns, verbs", # data types
- "csv", # output type
- "/custom/path", # output directory
- "n", # overwrite
- ]
- )
- mock_prompt.side_effect = lambda *args, **kwargs: next(responses)
-
- with patch("scribe_data.cli.interactive.config", self.config):
- with patch("scribe_data.cli.interactive.display_summary"):
- configure_settings()
-
- self.assertEqual(self.config.selected_languages, ["english", "spanish"])
- self.assertEqual(self.config.selected_data_types, ["nouns", "verbs"])
- self.assertEqual(self.config.output_type, "csv")
- self.assertEqual(self.config.output_dir.as_posix(), "/custom/path")
- self.assertFalse(self.config.overwrite)
-
- @patch("scribe_data.cli.interactive.get_data")
- @patch("scribe_data.cli.interactive.tqdm")
- @patch("scribe_data.cli.interactive.logger")
- def test_run_request(
- self, mock_logger: MagicMock, mock_tqdm: MagicMock, mock_get_data: MagicMock
- ) -> None:
- """
- Test run_request functionality.
- """
- self.config.selected_languages = ["english"]
- self.config.selected_data_types = ["nouns"]
- self.config.configured = True
-
- mock_get_data.return_value = True
- mock_progress = MagicMock()
- mock_tqdm.return_value.__enter__.return_value = mock_progress
-
- with patch("scribe_data.cli.interactive.config", self.config):
- run_request()
-
- mock_get_data.assert_called_once_with(
- languages=["english"],
- data_types=["nouns"],
- output_type=self.config.output_type,
- output_dir=self.config.output_dir,
- overwrite=self.config.overwrite,
- interactive=True,
- )
-
- @patch("scribe_data.cli.interactive.prompt")
- @patch("scribe_data.cli.interactive.rprint")
- def test_request_total_lexeme(
- self, mock_rprint: MagicMock, mock_prompt: MagicMock
- ) -> None:
- """
- Test request_total_lexeme functionality.
- """
- # Set up mock responses.
- mock_prompt.side_effect = [
- "english, french", # first call for languages
- "nouns", # first call for data types
- ]
-
- with patch("scribe_data.cli.interactive.config", self.config):
- with patch(
- "scribe_data.cli.interactive.list_all_languages",
- return_value=["english", "french"],
- ):
- prompt_for_languages()
- prompt_for_data_types()
-
- # Verify the config was updated correctly.
- self.assertEqual(self.config.selected_languages, ["english", "french"])
- self.assertEqual(self.config.selected_data_types, ["nouns"])
-
- # Verify prompt was called with correct arguments.
- expected_calls = [
- call(
- "Select languages (comma-separated or 'All'): ",
- completer=unittest.mock.ANY,
- default="",
- ),
- call(
- "Select data types (comma-separated or 'All'): ",
- completer=unittest.mock.ANY,
- default="",
- ),
- ]
- mock_prompt.assert_has_calls(expected_calls, any_order=False)
-
- @patch("rich.console.Console.print")
- def test_display_summary(self, mock_print: MagicMock) -> None:
- """
- Test display_summary functionality.
- """
- self.config.selected_languages = ["english"]
- self.config.selected_data_types = ["nouns"]
- self.config.output_type = "json"
-
- with patch("scribe_data.cli.interactive.config", self.config):
- display_summary()
- mock_print.assert_called()
-
- def test_resolve_wiktionary_dump_path_from_subdirectory(self) -> None:
- """
- Find dumps when cwd is not the project root.
- """
- from scribe_data.cli.interactive import resolve_wiktionary_dump_path
-
- with patch("os.getcwd") as mock_getcwd:
- with tempfile.TemporaryDirectory() as tmp:
- root = Path(tmp)
- dump_dir = root / "scribe_data_wiktionary_dumps_export"
- json_dir = root / "scribe_data_json_export"
- dump_dir.mkdir()
- json_dir.mkdir()
- dump_file = dump_dir / "dewiktionary-pages-articles.xml.bz2"
- dump_file.write_bytes(b"x")
-
- mock_getcwd.return_value = str(json_dir)
- resolved = resolve_wiktionary_dump_path(
- "german",
- "scribe_data_wiktionary_dumps_export",
- )
-
- self.assertEqual(resolved, dump_file.resolve())
-
- def test_create_word_completer(self) -> None:
- """
- Test create_word_completer functionality.
- """
- from scribe_data.cli.interactive import create_word_completer
-
- # Test without 'All' option.
- options = ["english", "spanish", "french"]
- completer = create_word_completer(options, include_all=False)
- self.assertIsInstance(completer, WordCompleter)
- self.assertEqual(completer.words, options)
-
- # Test with 'All' option.
- completer_with_all = create_word_completer(options, include_all=True)
- self.assertEqual(completer_with_all.words, ["All"] + options)
-
- @patch(
- "scribe_data.cli.interactive.resolve_wiktionary_dump_path",
- return_value=Path("/dump/path"),
- )
- @patch("scribe_data.wiktionary.parse_translations.parse_wiktionary_translations")
- @patch("scribe_data.cli.interactive.prompt")
- @patch("scribe_data.cli.interactive.prompt_for_languages")
- @patch("scribe_data.cli.interactive.questionary.select")
- def test_start_interactive_mode_translations(
- self,
- mock_select,
- mock_prompt_languages,
- mock_prompt,
- mock_parse_wiktionary,
- mock_resolve_dump,
- ):
- from scribe_data.cli.interactive import config, start_interactive_mode
-
- mock_select.return_value.ask.side_effect = ["translations"]
- mock_prompt.side_effect = [
- "german",
- "/dump/path",
- "scribe_data_wiktionary_json_export",
- "false",
- ]
- config.selected_languages = ["english"]
-
- start_interactive_mode(operation="translations")
-
- mock_parse_wiktionary.assert_called_once_with(
- target_languages=["english"],
- wiktionary_dump_path=Path("/dump/path"),
- output_dir=Path("scribe_data_wiktionary_json_export"),
- overwrite=False,
- )
diff --git a/tests/cli/test_list.py b/tests/cli/test_list.py
deleted file mode 100644
index 5d45b50a4..000000000
--- a/tests/cli/test_list.py
+++ /dev/null
@@ -1,211 +0,0 @@
-# SPDX-License-Identifier: GPL-3.0-or-later
-"""
-Tests for the CLI list functionality.
-"""
-
-import unittest
-from unittest.mock import MagicMock, call, patch
-
-from scribe_data.cli.list import (
- get_language_iso,
- get_language_qid,
- list_all,
- list_data_types,
- list_languages,
- list_languages_for_data_type,
- list_wrapper,
-)
-from scribe_data.cli.main import main
-from scribe_data.utils import (
- list_all_languages,
- list_languages_with_metadata_for_data_type,
-)
-
-
-class TestListFunctions(unittest.TestCase):
- @patch("builtins.print")
- def test_list_languages(self, mock_print: MagicMock) -> None:
- list_languages()
-
- # Verify the headers.
- mock_print.assert_any_call("\nLanguage ISO QID ")
- mock_print.assert_any_call("=================================")
-
- # Dynamically get the first language from the metadata.
- languages = list_all_languages()
- first_language = languages[0]
- first_iso = get_language_iso(first_language)
- first_qid = get_language_qid(first_language)
-
- # Verify the first language entry.
- # Calculate column widths as in the actual function.
- language_col_width = max(len(lang) for lang in languages) + 2
- iso_col_width = max(len(get_language_iso(lang)) for lang in languages) + 2
- qid_col_width = max(len(get_language_qid(lang)) for lang in languages) + 2
-
- # Verify the first language entry with dynamic spacing.
- mock_print.assert_any_call(
- f"{first_language.capitalize():<{language_col_width}} {first_iso:<{iso_col_width}} {first_qid:<{qid_col_width}}"
- )
- # Total print calls: N (languages) + 3 (header, one separator, final line).
- self.assertEqual(mock_print.call_count, len(languages) + 3)
-
- @patch("builtins.print")
- def test_list_data_types_all_languages(self, mock_print: MagicMock) -> None:
- list_data_types()
- print(mock_print.mock_calls)
- expected_calls = [
- call(),
- call("Available data types: All languages"),
- call("==================================="),
- call("adjectives"),
- call("adverbs"),
- # call("articles"),
- call("conjunctions"),
- call("emoji-keywords"),
- call("nouns"),
- call("personal-pronouns"),
- call("postpositions"),
- call("prepositions"),
- call("pronouns"),
- call("proper-nouns"),
- call("verbs"),
- call(),
- ]
- mock_print.assert_has_calls(expected_calls)
-
- @patch("builtins.print")
- def test_list_data_types_specific_language(self, mock_print: MagicMock) -> None:
- list_data_types("english")
-
- expected_calls = [
- call(),
- call("Available data types: English"),
- call("============================="),
- call("adjectives"),
- call("adverbs"),
- call("emoji-keywords"),
- call("nouns"),
- call("personal-pronouns"),
- call("prepositions"),
- call("pronouns"),
- call("proper-nouns"),
- call("verbs"),
- call(),
- ]
- mock_print.assert_has_calls(expected_calls)
-
- def test_list_data_types_invalid_language(self) -> None:
- with self.assertRaises(ValueError):
- list_data_types("InvalidLanguage")
-
- def test_list_data_types_no_data_types(self) -> None:
- with self.assertRaises(ValueError):
- list_data_types("Klingon")
-
- @patch("scribe_data.cli.list.list_languages")
- @patch("scribe_data.cli.list.list_data_types")
- def test_list_all(
- self, mock_list_data_types: MagicMock, mock_list_languages: MagicMock
- ) -> None:
- list_all()
- mock_list_languages.assert_called_once()
- mock_list_data_types.assert_called_once()
-
- @patch("scribe_data.cli.list.list_all")
- def test_list_wrapper_all(self, mock_list_all: MagicMock) -> None:
- list_wrapper(all_bool=True)
- mock_list_all.assert_called_once()
-
- @patch("scribe_data.cli.list.list_languages")
- def test_list_wrapper_languages(self, mock_list_languages: MagicMock) -> None:
- list_wrapper(language=True)
- mock_list_languages.assert_called_once()
-
- @patch("scribe_data.cli.list.list_data_types")
- def test_list_wrapper_data_types(self, mock_list_data_types: MagicMock) -> None:
- list_wrapper(data_type=True)
- mock_list_data_types.assert_called_once()
-
- @patch("builtins.print")
- def test_list_wrapper_language_and_data_type(self, mock_print: MagicMock) -> None:
- list_wrapper(language=True, data_type=True)
- mock_print.assert_called_with(
- "Please specify either a language or a data type."
- )
-
- @patch("scribe_data.cli.list.list_languages_for_data_type")
- def test_list_wrapper_languages_for_data_type(
- self, mock_list_languages_for_data_type: MagicMock
- ) -> None:
- list_wrapper(language=True, data_type="example_data_type")
- mock_list_languages_for_data_type.assert_called_with("example_data_type")
-
- @patch("scribe_data.cli.list.list_data_types")
- def test_list_wrapper_data_types_for_language(
- self, mock_list_data_types: MagicMock
- ) -> None:
- list_wrapper(language="English", data_type=True)
- mock_list_data_types.assert_called_with("English")
-
- @patch("builtins.print")
- def test_list_languages_for_data_type_valid(self, mock_print: MagicMock) -> None:
- # Call the function with a specific data type.
- list_languages_for_data_type("nouns")
-
- # Dynamically create the header based on column widths.
- all_languages = list_languages_with_metadata_for_data_type()
-
- # Calculate column widths as in the actual function.
- language_col_width = max(len(lang["name"]) for lang in all_languages) + 2
- iso_col_width = max(len(lang["iso"]) for lang in all_languages) + 2
- qid_col_width = max(len(lang["qid"]) for lang in all_languages) + 2
-
- # Dynamically generate the expected header string.
- expected_header = f"{'\nLanguage':<{language_col_width}} {'ISO':<{iso_col_width}} {'QID':<{qid_col_width}}"
-
- # Verify the headers dynamically.
- mock_print.assert_any_call(expected_header)
- mock_print.assert_any_call(
- "=" * (language_col_width + iso_col_width + qid_col_width)
- )
-
- # Verify the first language entry if there are any languages.
-
- first_language = all_languages[0]["name"].capitalize()
- first_iso = all_languages[0]["iso"]
- first_qid = all_languages[0]["qid"]
-
- # Verify the first language entry with dynamic spacing.
- mock_print.assert_any_call(
- f"{first_language:<{language_col_width}} {first_iso:<{iso_col_width}} {first_qid:<{qid_col_width}}"
- )
-
- # Check the total number of calls.
- # Total calls = N (languages) + 3 (header, one separator, final line)
- expected_calls = len(all_languages) + 3
- self.assertEqual(mock_print.call_count, expected_calls)
-
- @patch("scribe_data.cli.list.list_languages")
- def test_list_languages_command(self, mock_list_languages: MagicMock) -> None:
- test_args = ["main.py", "list", "--language"]
- with patch("sys.argv", test_args):
- main()
-
- mock_list_languages.assert_called_once()
-
- @patch("scribe_data.cli.list.list_data_types")
- def test_list_data_types_command(self, mock_list_data_types: MagicMock) -> None:
- test_args = ["main.py", "list", "--data-type"]
- with patch("sys.argv", test_args):
- main()
-
- mock_list_data_types.assert_called_once()
-
- @patch("scribe_data.cli.list.list_all")
- def test_list_all_command(self, mock_list_all: MagicMock) -> None:
- test_args = ["main.py", "list", "--all"]
- with patch("sys.argv", test_args):
- main()
-
- mock_list_all.assert_called_once()
diff --git a/tests/cli/test_total.py b/tests/cli/test_total.py
deleted file mode 100644
index c999418cb..000000000
--- a/tests/cli/test_total.py
+++ /dev/null
@@ -1,610 +0,0 @@
-# SPDX-License-Identifier: GPL-3.0-or-later
-"""
-Tests for the CLI total functionality.
-"""
-
-import unittest
-from http.client import IncompleteRead
-from pathlib import Path
-from unittest.mock import MagicMock, call, patch
-from urllib.error import HTTPError
-
-import yaml
-
-from scribe_data.cli.total import (
- get_datatype_list,
- get_qid_by_input,
- get_total_lexemes,
- total_wrapper,
-)
-from scribe_data.utils import (
- DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
- WIKIDATA_QIDS_PIDS_FILE,
- check_qid_is_language,
-)
-
-try:
- with WIKIDATA_QIDS_PIDS_FILE.open("r", encoding="utf-8") as file:
- wikidata_qids_pids = yaml.safe_load(file)
-
-except (IOError, yaml.YAMLError) as e:
- print(f"Error reading wikidata QIDs/PIDs metadata: {e}")
-
-
-class TestTotalLexemes(unittest.TestCase):
- @patch("scribe_data.cli.total.get_qid_by_input")
- @patch("scribe_data.cli.total.sparql.query")
- def test_get_total_lexemes_valid(
- self, mock_query: MagicMock, mock_get_qid: MagicMock
- ) -> None:
- mock_get_qid.side_effect = lambda x: {"english": "Q1860", "nouns": "Q1084"}.get(
- x.lower()
- )
- mock_results = MagicMock()
- mock_results.convert.return_value = {
- "results": {"bindings": [{"total": {"value": "42"}}]}
- }
- mock_query.return_value = mock_results
-
- with patch("builtins.print") as mock_print:
- get_total_lexemes(language="English", data_type="nouns")
-
- mock_print.assert_called_once_with(
- "\nLanguage: English\nData type: nouns\nTotal number of lexemes: 42\n"
- )
-
- @patch("scribe_data.cli.total.get_qid_by_input")
- @patch("scribe_data.cli.total.sparql.query")
- def test_get_total_lexemes_no_results(
- self, mock_query: MagicMock, mock_get_qid: MagicMock
- ) -> None:
- mock_get_qid.side_effect = lambda x: {"english": "Q1860", "nouns": "Q1084"}.get(
- x.lower()
- )
- mock_results = MagicMock()
- mock_results.convert.return_value = {"results": {"bindings": []}}
- mock_query.return_value = mock_results
-
- with patch("builtins.print") as mock_print:
- get_total_lexemes(language="English", data_type="nouns")
-
- mock_print.assert_called_once_with("Total number of lexemes: Not found")
-
- @patch("scribe_data.cli.total.get_qid_by_input")
- @patch("scribe_data.cli.total.sparql.query")
- def test_get_total_lexemes_invalid_language(
- self, mock_query: MagicMock, mock_get_qid: MagicMock
- ) -> None:
- mock_get_qid.side_effect = lambda x: None
- mock_query.return_value = MagicMock()
-
- with patch("builtins.print") as mock_print:
- get_total_lexemes(language="InvalidLanguage", data_type="nouns")
-
- mock_print.assert_called_once_with("Total number of lexemes: Not found")
-
- @patch("scribe_data.cli.total.get_qid_by_input")
- @patch("scribe_data.cli.total.sparql.query")
- def test_get_total_lexemes_empty_and_none_inputs(
- self, mock_query: MagicMock, mock_get_qid: MagicMock
- ) -> None:
- mock_get_qid.return_value = None
- mock_query.return_value = MagicMock()
-
- # Call the function with empty and None inputs.
- with patch("builtins.print") as mock_print:
- get_total_lexemes(language="", data_type="nouns")
- get_total_lexemes(language=None, data_type="verbs")
-
- expected_calls = [
- call("Total number of lexemes: Not found"),
- call("Total number of lexemes: Not found"),
- ]
- mock_print.assert_has_calls(expected_calls, any_order=True)
-
- @patch("scribe_data.cli.total.get_qid_by_input")
- @patch("scribe_data.cli.total.sparql.query")
- def test_get_total_lexemes_nonexistent_language(
- self, mock_query: MagicMock, mock_get_qid: MagicMock
- ) -> None:
- mock_get_qid.return_value = None
- mock_query.return_value = MagicMock()
-
- with patch("builtins.print") as mock_print:
- get_total_lexemes(language="Martian", data_type="nouns")
-
- mock_print.assert_called_once_with("Total number of lexemes: Not found")
-
- @patch("scribe_data.cli.total.get_qid_by_input")
- @patch("scribe_data.cli.total.sparql.query")
- def test_get_total_lexemes_various_data_types(
- self, mock_query: MagicMock, mock_get_qid: MagicMock
- ) -> None:
- mock_get_qid.side_effect = lambda x: {
- "english": "Q1860",
- "verbs": "Q24905",
- "nouns": "Q1084",
- }.get(x.lower())
- mock_results = MagicMock()
- mock_results.convert.return_value = {
- "results": {"bindings": [{"total": {"value": "30"}}]}
- }
-
- mock_query.return_value = mock_results
-
- # Call the function with different data types.
- with patch("builtins.print") as mock_print:
- get_total_lexemes(language="English", data_type="verbs")
- get_total_lexemes(language="English", data_type="nouns")
-
- expected_calls = [
- call(
- "\nLanguage: English\nData type: verbs\nTotal number of lexemes: 30\n"
- ),
- call(
- "\nLanguage: English\nData type: nouns\nTotal number of lexemes: 30\n"
- ),
- ]
- mock_print.assert_has_calls(expected_calls)
-
- @patch("scribe_data.cli.total.get_qid_by_input")
- @patch("scribe_data.cli.total.sparql.query")
- @patch("scribe_data.cli.total.WIKIDATA_QUERIES_ALL_DATA_DIR")
- def test_get_total_lexemes_sub_languages(
- self, mock_dir: MagicMock, mock_query: MagicMock, mock_get_qid: MagicMock
- ) -> None:
- # Setup for sub-languages.
- mock_get_qid.side_effect = lambda x: {
- "bokmål": "Q25167",
- "nynorsk": "Q25164",
- }.get(x.lower())
- mock_results = MagicMock()
- mock_results.convert.return_value = {
- "results": {"bindings": [{"total": {"value": "30"}}]}
- }
- mock_query.return_value = mock_results
-
- # Mocking directory paths and contents.
- mock_dir.__truediv__.return_value.exists.return_value = True
- mock_dir.__truediv__.return_value.iterdir.return_value = [
- MagicMock(name="verbs", is_dir=lambda: True),
- MagicMock(name="nouns", is_dir=lambda: True),
- ]
-
- with patch("builtins.print") as mock_print:
- get_total_lexemes(language="Norwegian", data_type="verbs")
- get_total_lexemes(language="Norwegian", data_type="nouns")
-
- expected_calls = [
- call(
- "\nLanguage: Norwegian\nData type: verbs\nTotal number of lexemes: 30\n"
- ),
- call(
- "\nLanguage: Norwegian\nData type: nouns\nTotal number of lexemes: 30\n"
- ),
- ]
- mock_print.assert_has_calls(expected_calls)
-
-
-class TestGetQidByInput(unittest.TestCase):
- def setUp(self) -> None:
- self.valid_data_types = {
- "english": "Q1860",
- "nouns": "Q1084",
- "verbs": "Q24905",
- }
-
- @patch("scribe_data.cli.total.data_type_metadata", new_callable=dict)
- def test_get_qid_by_input_valid(self, mock_data_type_metadata: MagicMock) -> None:
- mock_data_type_metadata.update(self.valid_data_types)
-
- for data_type, expected_qid in self.valid_data_types.items():
- self.assertEqual(get_qid_by_input(data_type), expected_qid)
-
- @patch("scribe_data.cli.total.data_type_metadata", new_callable=dict)
- def test_get_qid_by_input_invalid(self, mock_data_type_metadata: MagicMock) -> None:
- mock_data_type_metadata.update(self.valid_data_types)
-
- self.assertIsNone(get_qid_by_input("invalid_data_type"))
-
-
-class TestGetDatatypeList(unittest.TestCase):
- @patch("scribe_data.cli.total.WIKIDATA_QUERIES_ALL_DATA_DIR")
- def test_get_datatype_list_invalid_language(self, mock_dir: MagicMock) -> None:
- mock_dir.__truediv__.return_value.exists.return_value = False
-
- with self.assertRaises(ValueError):
- get_datatype_list("InvalidLanguage")
-
- @patch("scribe_data.cli.total.WIKIDATA_QUERIES_ALL_DATA_DIR")
- def test_get_datatype_list_no_data_types(self, mock_dir: MagicMock) -> None:
- mock_dir.__truediv__.return_value.exists.return_value = True
- mock_dir.__truediv__.return_value.iterdir.return_value = []
-
- with self.assertRaises(ValueError):
- get_datatype_list("English")
-
-
-class TestCheckQidIsLanguage(unittest.TestCase):
- @patch("scribe_data.utils.requests.get")
- def test_check_qid_is_language_valid(self, mock_get: MagicMock) -> None:
- mock_response = MagicMock()
- mock_response.json.return_value = {
- "statements": {
- wikidata_qids_pids["instance_of"]: [{"value": {"content": "Q34770"}}]
- },
- "labels": {"en": "English"},
- }
- mock_get.return_value = mock_response
-
- with patch("builtins.print") as mock_print:
- result = check_qid_is_language("Q1860")
-
- self.assertEqual(result, "English")
- mock_print.assert_called_once_with("English (Q1860) is a language.\n")
-
- @patch("scribe_data.utils.requests.get")
- def test_check_qid_is_language_invalid(self, mock_get: MagicMock) -> None:
- mock_response = MagicMock()
- mock_response.json.return_value = {
- "statements": {
- wikidata_qids_pids["instance_of"]: [{"value": {"content": "Q5"}}]
- },
- "labels": {"en": "Human"},
- }
- mock_get.return_value = mock_response
-
- with self.assertRaises(ValueError):
- check_qid_is_language("Q5")
-
-
-class TestTotalWrapper(unittest.TestCase):
- @patch("scribe_data.cli.total.print_total_lexemes")
- def test_total_wrapper_all_bool(self, mock_print_total_lexemes: MagicMock) -> None:
- total_wrapper(all_bool=True)
- mock_print_total_lexemes.assert_called_once_with()
-
- @patch("scribe_data.cli.total.print_total_lexemes")
- def test_total_wrapper_language_only(
- self, mock_print_total_lexemes: MagicMock
- ) -> None:
- total_wrapper(languages=["English"])
- mock_print_total_lexemes.assert_called_once_with(language="English")
-
- @patch("scribe_data.cli.total.get_total_lexemes")
- def test_total_wrapper_language_and_data_type(
- self, mock_get_total_lexemes_lexemes: MagicMock
- ) -> None:
- total_wrapper(languages=["English"], data_types=["nouns"])
- mock_get_total_lexemes_lexemes.assert_called_once_with(
- language="English", data_type="nouns"
- )
-
- def test_total_wrapper_invalid_input(self) -> None:
- with self.assertRaises(ValueError):
- total_wrapper()
-
- # MARK: Using Dump
-
- @patch("scribe_data.cli.total.parse_wd_lexeme_dump")
- def test_total_wrapper_wikidata_dump_flag(self, mock_parse_dump: MagicMock) -> None:
- """
- Test when wikidata_dump is True (flag without path).
- """
- total_wrapper(wikidata_dump=True)
- mock_parse_dump.assert_called_once_with(
- languages=["all"],
- data_types=["all"],
- wikidata_dump_type=["total"],
- wikidata_dump_path=DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
- )
-
- @patch("scribe_data.cli.total.parse_wd_lexeme_dump")
- def test_total_wrapper_wikidata_dump_with_all(
- self, mock_parse_dump: MagicMock
- ) -> None:
- """
- Test when both wikidata_dump and all_bool are True.
- """
- total_wrapper(wikidata_dump=True, all_bool=True)
- mock_parse_dump.assert_called_once_with(
- languages=["all"],
- data_types=["all"],
- wikidata_dump_type=["total"],
- wikidata_dump_path=DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
- )
-
- @patch("scribe_data.cli.total.parse_wd_lexeme_dump")
- def test_total_wrapper_wikidata_dump_with_language_and_type(
- self, mock_parse_dump: MagicMock
- ) -> None:
- """
- Test wikidata_dump with specific language and data type.
- """
- total_wrapper(
- languages=["English"],
- data_types=["nouns"],
- wikidata_dump=Path("/path/to/dump.json"),
- )
- mock_parse_dump.assert_called_once_with(
- languages=["English"],
- data_types=["nouns"],
- wikidata_dump_type=["total"],
- wikidata_dump_path=Path("/path/to/dump.json"),
- )
-
- # MARK: Using QID
-
- @patch("scribe_data.cli.total.check_qid_is_language")
- @patch("scribe_data.cli.total.print_total_lexemes")
- def test_total_wrapper_with_qid(
- self, mock_print_total: MagicMock, mock_check_qid: MagicMock
- ) -> None:
- """
- Test when language is provided as a QID.
- """
- mock_check_qid.return_value = "Thai"
- total_wrapper(languages=["Q9217"])
- mock_print_total.assert_called_once_with(language="Q9217")
-
- @patch("scribe_data.cli.total.check_qid_is_language")
- @patch("scribe_data.cli.total.get_total_lexemes")
- def test_total_wrapper_with_qid_and_datatype(
- self, mock_get_total_lexemes: MagicMock, mock_check_qid: MagicMock
- ) -> None:
- """
- Test when language QID and data type are provided.
- """
- mock_check_qid.return_value = "Thai"
- total_wrapper(languages=["Q9217"], data_types=["nouns"])
- mock_get_total_lexemes.assert_called_once_with(
- language="Q9217", data_type="nouns"
- )
-
- @patch("scribe_data.cli.total.parse_wd_lexeme_dump")
- def test_total_wrapper_qid_with_wikidata_dump(
- self, mock_parse_dump: MagicMock
- ) -> None:
- """
- Test QID with wikidata dump.
- """
- total_wrapper(languages=["Q9217"], wikidata_dump=True, all_bool=True)
- mock_parse_dump.assert_called_once_with(
- languages=["Q9217"],
- data_types=["all"],
- wikidata_dump_type=["total"],
- wikidata_dump_path=DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
- )
-
- @patch("scribe_data.cli.total.get_total_lexemes")
- def test_get_total_lexemes_with_qid(
- self, mock_get_total_lexemes: MagicMock
- ) -> None:
- """
- Test get_total_lexemes with QID input.
- """
- total_wrapper(languages=["Q9217"], data_types=["Q1084"]) # Q1084 is noun QID
- mock_get_total_lexemes.assert_called_once_with(
- language="Q9217", data_type="Q1084"
- )
-
- # MARK: Multiple Languages and Data Types
-
- @patch("scribe_data.cli.total.get_total_lexemes")
- def test_total_wrapper_multiple_languages(
- self, mock_get_total_lexemes: MagicMock
- ) -> None:
- """
- Test retrieving totals for multiple languages.
- """
- # Mock return value to avoid formatting error.
- mock_get_total_lexemes.return_value = 100
-
- total_wrapper(languages=["English", "German"], data_types=["nouns"])
-
- expected_calls = [
- call(language="English", data_type="nouns", do_print=False),
- call(language="German", data_type="nouns", do_print=False),
- ]
- mock_get_total_lexemes.assert_has_calls(expected_calls)
-
- @patch("scribe_data.cli.total.get_total_lexemes")
- def test_total_wrapper_multiple_data_types(
- self, mock_get_total_lexemes: MagicMock
- ) -> None:
- """
- Test retrieving totals for multiple data types.
- """
- # Mock return value to avoid formatting error.
- mock_get_total_lexemes.return_value = 100
-
- total_wrapper(languages=["English"], data_types=["nouns", "verbs"])
-
- expected_calls = [
- call(language="English", data_type="nouns", do_print=False),
- call(language="English", data_type="verbs", do_print=False),
- ]
- mock_get_total_lexemes.assert_has_calls(expected_calls)
-
- @patch("scribe_data.cli.total.get_total_lexemes")
- def test_total_wrapper_multiple_languages_and_types(
- self, mock_get_total_lexemes: MagicMock
- ) -> None:
- """
- Test retrieving totals for multiple languages and data types.
- """
- # Mock return value to avoid formatting error.
- mock_get_total_lexemes.return_value = 100
-
- total_wrapper(languages=["English", "German"], data_types=["nouns", "verbs"])
-
- expected_calls = [
- call(language="English", data_type="nouns", do_print=False),
- call(language="English", data_type="verbs", do_print=False),
- call(language="German", data_type="nouns", do_print=False),
- call(language="German", data_type="verbs", do_print=False),
- ]
- mock_get_total_lexemes.assert_has_calls(expected_calls)
-
- # MARK: Error Handling
-
- @patch("scribe_data.cli.total.sparql.query")
- def test_get_total_lexemes_http_error(self, mock_query: MagicMock) -> None:
- """
- Test handling of HTTPError when querying totals.
- """
- # Set up mock to return None for results after max retries.
- mock_query.side_effect = [
- HTTPError(url="test", code=500, msg="error", hdrs={}, fp=None),
- HTTPError(url="test", code=500, msg="error", hdrs={}, fp=None),
- HTTPError(url="test", code=500, msg="error", hdrs={}, fp=None),
- ]
-
- with patch("builtins.print") as mock_print:
- result = get_total_lexemes(language="English", data_type="nouns")
-
- self.assertIsNone(result)
- mock_print.assert_any_call("Query failed after retries.")
-
- @patch("scribe_data.cli.total.sparql.query")
- def test_get_total_lexemes_incomplete_read(self, mock_query: MagicMock) -> None:
- """
- Test handling of IncompleteRead error when querying totals.
- """
- # Set up mock to return None for results after max retries.
- mock_query.side_effect = [
- IncompleteRead(partial=b""),
- IncompleteRead(partial=b""),
- IncompleteRead(partial=b""),
- ]
-
- with patch("builtins.print") as mock_print:
- result = get_total_lexemes(language="English", data_type="nouns")
-
- self.assertIsNone(result)
- mock_print.assert_any_call("Query failed after retries.")
-
- # MARK: Sub-language Handling
-
- @patch("scribe_data.cli.total.get_datatype_list")
- @patch("scribe_data.cli.total.get_total_lexemes")
- def test_print_total_lexemes_with_sublanguages(
- self, mock_get_total_lexemes: MagicMock, mock_get_datatypes: MagicMock
- ) -> None:
- """
- Test printing totals for a language with sub-languages.
- """
- mock_get_datatypes.return_value = ["nouns", "verbs"]
- mock_get_total_lexemes.return_value = 100
-
- with patch("builtins.print") as mock_print:
- total_wrapper(languages=["Norwegian"], data_types=["nouns", "verbs"])
-
- # Verify header was printed.
- mock_print.assert_any_call(
- f"{'Language':<20} {'Data Type':<25} {'Total Wikidata Lexemes':<25}"
- )
- mock_print.assert_any_call("=" * 70)
-
- # Verify data was printed for each data type.
- mock_get_total_lexemes.assert_any_call(
- language="Norwegian", data_type="nouns", do_print=False
- )
- mock_get_total_lexemes.assert_any_call(
- language="Norwegian", data_type="verbs", do_print=False
- )
-
- # MARK: Data Type List Handling
-
- @patch("scribe_data.cli.total.language_metadata")
- @patch("scribe_data.cli.total.list_all_languages")
- @patch("scribe_data.cli.total.WIKIDATA_QUERIES_ALL_DATA_DIR")
- def test_get_datatype_list_with_sublanguages(
- self,
- mock_dir: MagicMock,
- mock_list_languages: MagicMock,
- mock_metadata: MagicMock,
- ) -> None:
- """
- Test getting data type list for a language with sub-languages.
- """
- # Mock language metadata and list_all_languages.
- mock_metadata_dict = {
- "norwegian": {
- "sub_languages": {"bokmal": {"iso": "nb"}, "nynorsk": {"iso": "nn"}}
- }
- }
-
- # Mock dictionary-like behavior for language_metadata.
- mock_metadata.__iter__.return_value = mock_metadata_dict.items()
- mock_metadata.items.return_value = mock_metadata_dict.items()
- mock_metadata.get.return_value = mock_metadata_dict["norwegian"]
- mock_metadata.__getitem__.return_value = mock_metadata_dict["norwegian"]
-
- mock_list_languages.return_value = ["norwegian"]
-
- # Create mock directory entries with proper string names.
- mock_nouns = MagicMock()
- mock_nouns.name = "nouns"
- mock_nouns.is_dir.return_value = True
-
- mock_verbs = MagicMock()
- mock_verbs.name = "verbs"
- mock_verbs.is_dir.return_value = True
-
- # Mock directory structure for both sub-languages.
- def mock_path_handler(path: str) -> MagicMock:
- mock_path = MagicMock()
- mock_path.exists.return_value = True
- mock_path.iterdir.return_value = [mock_nouns, mock_verbs]
- return mock_path
-
- mock_dir.__truediv__.side_effect = mock_path_handler
-
- result = get_datatype_list("norwegian") # note: lowercase
- self.assertEqual(sorted(result), ["nouns", "verbs"])
-
- @patch("scribe_data.cli.total.language_metadata")
- @patch("scribe_data.cli.total.WIKIDATA_QUERIES_ALL_DATA_DIR")
- def test_get_datatype_list_empty_directory(
- self, mock_dir: MagicMock, mock_metadata: MagicMock
- ) -> None:
- """
- Test getting data type list from an empty directory.
- """
- # Mock language metadata.
- mock_metadata.get.return_value = {}
-
- mock_dir.__truediv__.return_value.exists.return_value = True
- mock_dir.__truediv__.return_value.iterdir.return_value = []
-
- with self.assertRaises(ValueError):
- get_datatype_list("English")
-
- @patch("scribe_data.cli.total.get_total_lexemes")
- def test_total_wrapper_with_invalid_language(
- self, mock_get_total_lexemes: MagicMock
- ) -> None:
- """
- Test total wrapper with invalid language.
- """
- mock_get_total_lexemes.side_effect = ValueError("Invalid language")
-
- with self.assertRaises(ValueError):
- total_wrapper(languages=["invalid_lang"], data_types=["nouns"])
-
- mock_get_total_lexemes.assert_called_once()
-
- @patch("scribe_data.cli.total.get_total_lexemes")
- def test_total_wrapper_with_invalid_data_type(
- self, mock_get_total_lexemes: MagicMock
- ) -> None:
- """
- Test total wrapper with invalid data type.
- """
- mock_get_total_lexemes.side_effect = ValueError("Invalid data type")
-
- with self.assertRaises(ValueError):
- total_wrapper(languages=["English"], data_types=["invalid_type"])
-
- mock_get_total_lexemes.assert_called_once()
diff --git a/tests/cli/total/test_cli_total_query.py b/tests/cli/total/test_cli_total_query.py
new file mode 100644
index 000000000..7f3d0cf1a
--- /dev/null
+++ b/tests/cli/total/test_cli_total_query.py
@@ -0,0 +1,249 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI total query functionality.
+"""
+
+import unittest
+from unittest.mock import MagicMock, call, patch
+
+import yaml
+
+from scribe_data.cli.total.print_values import get_datatype_list
+from scribe_data.cli.total.query import get_qid_by_input, query_total_lexemes
+from scribe_data.utils import WIKIDATA_QIDS_PIDS_FILE, check_qid_is_language
+
+try:
+ with WIKIDATA_QIDS_PIDS_FILE.open("r", encoding="utf-8") as file:
+ wikidata_qids_pids = yaml.safe_load(file)
+
+except (IOError, yaml.YAMLError) as e:
+ print(f"Error reading wikidata QIDs/PIDs metadata: {e}")
+
+# MARK: Query
+
+
+class TestCLITotalQuery(unittest.TestCase):
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ @patch("scribe_data.cli.total.query.sparql.query")
+ def test_cli_total_query_lexemes_valid(
+ self, mock_query: MagicMock, mock_get_qid: MagicMock
+ ) -> None:
+ mock_get_qid.side_effect = lambda x: {"english": "Q1860", "nouns": "Q1084"}.get(
+ x.lower()
+ )
+ mock_results = MagicMock()
+ mock_results.convert.return_value = {
+ "results": {"bindings": [{"total": {"value": "42"}}]}
+ }
+ mock_query.return_value = mock_results
+
+ with patch("builtins.print") as mock_print:
+ query_total_lexemes(language="English", data_type="nouns")
+
+ mock_print.assert_called_once_with(
+ "\nLanguage: English\nData type: nouns\nTotal number of lexemes: 42\n"
+ )
+
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ @patch("scribe_data.cli.total.query.sparql.query")
+ def test_cli_total_query_lexemes_no_results(
+ self, mock_query: MagicMock, mock_get_qid: MagicMock
+ ) -> None:
+ mock_get_qid.side_effect = lambda x: {"english": "Q1860", "nouns": "Q1084"}.get(
+ x.lower()
+ )
+ mock_results = MagicMock()
+ mock_results.convert.return_value = {"results": {"bindings": []}}
+ mock_query.return_value = mock_results
+
+ with patch("builtins.print") as mock_print:
+ query_total_lexemes(language="English", data_type="nouns")
+
+ mock_print.assert_called_once_with("Total number of lexemes: Not found")
+
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ @patch("scribe_data.cli.total.query.sparql.query")
+ def test_cli_total_query_lexemes_invalid_language(
+ self, mock_query: MagicMock, mock_get_qid: MagicMock
+ ) -> None:
+ mock_get_qid.side_effect = lambda x: None
+ mock_query.return_value = MagicMock()
+
+ with patch("builtins.print") as mock_print:
+ query_total_lexemes(language="InvalidLanguage", data_type="nouns")
+
+ mock_print.assert_called_once_with("Total number of lexemes: Not found")
+
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ @patch("scribe_data.cli.total.query.sparql.query")
+ def test_cli_total_query_lexemes_empty_and_none_inputs(
+ self, mock_query: MagicMock, mock_get_qid: MagicMock
+ ) -> None:
+ mock_get_qid.return_value = None
+ mock_query.return_value = MagicMock()
+
+ # Call the function with empty and None inputs.
+ with patch("builtins.print") as mock_print:
+ query_total_lexemes(language="", data_type="nouns")
+ query_total_lexemes(language=None, data_type="verbs")
+
+ expected_calls = [
+ call("Total number of lexemes: Not found"),
+ call("Total number of lexemes: Not found"),
+ ]
+ mock_print.assert_has_calls(expected_calls, any_order=True)
+
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ @patch("scribe_data.cli.total.query.sparql.query")
+ def test_cli_total_query_lexemes_nonexistent_language(
+ self, mock_query: MagicMock, mock_get_qid: MagicMock
+ ) -> None:
+ mock_get_qid.return_value = None
+ mock_query.return_value = MagicMock()
+
+ with patch("builtins.print") as mock_print:
+ query_total_lexemes(language="Martian", data_type="nouns")
+
+ mock_print.assert_called_once_with("Total number of lexemes: Not found")
+
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ @patch("scribe_data.cli.total.query.sparql.query")
+ def test_cli_total_query_lexemes_various_data_types(
+ self, mock_query: MagicMock, mock_get_qid: MagicMock
+ ) -> None:
+ mock_get_qid.side_effect = lambda x: {
+ "english": "Q1860",
+ "verbs": "Q24905",
+ "nouns": "Q1084",
+ }.get(x.lower())
+ mock_results = MagicMock()
+ mock_results.convert.return_value = {
+ "results": {"bindings": [{"total": {"value": "30"}}]}
+ }
+
+ mock_query.return_value = mock_results
+
+ # Call the function with different data types.
+ with patch("builtins.print") as mock_print:
+ query_total_lexemes(language="English", data_type="verbs")
+ query_total_lexemes(language="English", data_type="nouns")
+
+ expected_calls = [
+ call(
+ "\nLanguage: English\nData type: verbs\nTotal number of lexemes: 30\n"
+ ),
+ call(
+ "\nLanguage: English\nData type: nouns\nTotal number of lexemes: 30\n"
+ ),
+ ]
+ mock_print.assert_has_calls(expected_calls)
+
+ @patch("scribe_data.cli.total.query.get_qid_by_input")
+ @patch("scribe_data.cli.total.query.sparql.query")
+ @patch("scribe_data.cli.total.print_values.WIKIDATA_QUERIES_ALL_DATA_DIR")
+ def test_cli_total_query_lexemes_sub_languages(
+ self, mock_dir: MagicMock, mock_query: MagicMock, mock_get_qid: MagicMock
+ ) -> None:
+ # Setup for sub-languages.
+ mock_get_qid.side_effect = lambda x: {
+ "bokmål": "Q25167",
+ "nynorsk": "Q25164",
+ }.get(x.lower())
+ mock_results = MagicMock()
+ mock_results.convert.return_value = {
+ "results": {"bindings": [{"total": {"value": "30"}}]}
+ }
+ mock_query.return_value = mock_results
+
+ # Mocking directory paths and contents.
+ mock_dir.__truediv__.return_value.exists.return_value = True
+ mock_dir.__truediv__.return_value.iterdir.return_value = [
+ MagicMock(name="verbs", is_dir=lambda: True),
+ MagicMock(name="nouns", is_dir=lambda: True),
+ ]
+
+ with patch("builtins.print") as mock_print:
+ query_total_lexemes(language="Norwegian", data_type="verbs")
+ query_total_lexemes(language="Norwegian", data_type="nouns")
+
+ expected_calls = [
+ call(
+ "\nLanguage: Norwegian\nData type: verbs\nTotal number of lexemes: 30\n"
+ ),
+ call(
+ "\nLanguage: Norwegian\nData type: nouns\nTotal number of lexemes: 30\n"
+ ),
+ ]
+ mock_print.assert_has_calls(expected_calls)
+
+
+class TestGetQidByInput(unittest.TestCase):
+ def setUp(self) -> None:
+ self.valid_data_types = {
+ "english": "Q1860",
+ "nouns": "Q1084",
+ "verbs": "Q24905",
+ }
+
+ @patch("scribe_data.cli.total.query.data_type_metadata", new_callable=dict)
+ def test_get_qid_by_input_valid(self, mock_data_type_metadata: MagicMock) -> None:
+ mock_data_type_metadata.update(self.valid_data_types)
+
+ for data_type, expected_qid in self.valid_data_types.items():
+ self.assertEqual(get_qid_by_input(data_type), expected_qid)
+
+ @patch("scribe_data.cli.total.query.data_type_metadata", new_callable=dict)
+ def test_get_qid_by_input_invalid(self, mock_data_type_metadata: MagicMock) -> None:
+ mock_data_type_metadata.update(self.valid_data_types)
+
+ self.assertIsNone(get_qid_by_input("invalid_data_type"))
+
+
+class TestGetDatatypeList(unittest.TestCase):
+ @patch("scribe_data.cli.total.print_values.WIKIDATA_QUERIES_ALL_DATA_DIR")
+ def test_get_datatype_list_invalid_language(self, mock_dir: MagicMock) -> None:
+ mock_dir.__truediv__.return_value.exists.return_value = False
+
+ with self.assertRaises(ValueError):
+ get_datatype_list("InvalidLanguage")
+
+ @patch("scribe_data.cli.total.print_values.WIKIDATA_QUERIES_ALL_DATA_DIR")
+ def test_get_datatype_list_no_data_types(self, mock_dir: MagicMock) -> None:
+ mock_dir.__truediv__.return_value.exists.return_value = True
+ mock_dir.__truediv__.return_value.iterdir.return_value = []
+
+ with self.assertRaises(ValueError):
+ get_datatype_list("English")
+
+
+class TestCheckQidIsLanguage(unittest.TestCase):
+ @patch("scribe_data.utils.requests.get")
+ def test_check_qid_is_language_valid(self, mock_get: MagicMock) -> None:
+ mock_response = MagicMock()
+ mock_response.json.return_value = {
+ "statements": {
+ wikidata_qids_pids["instance_of"]: [{"value": {"content": "Q34770"}}]
+ },
+ "labels": {"en": "English"},
+ }
+ mock_get.return_value = mock_response
+
+ with patch("builtins.print") as mock_print:
+ result = check_qid_is_language("Q1860")
+
+ self.assertEqual(result, "English")
+ mock_print.assert_called_once_with("English (Q1860) is a language.\n")
+
+ @patch("scribe_data.utils.requests.get")
+ def test_check_qid_is_language_invalid(self, mock_get: MagicMock) -> None:
+ mock_response = MagicMock()
+ mock_response.json.return_value = {
+ "statements": {
+ wikidata_qids_pids["instance_of"]: [{"value": {"content": "Q5"}}]
+ },
+ "labels": {"en": "Human"},
+ }
+ mock_get.return_value = mock_response
+
+ with self.assertRaises(ValueError):
+ check_qid_is_language("Q5")
diff --git a/tests/cli/total/test_cli_total_wrapper.py b/tests/cli/total/test_cli_total_wrapper.py
new file mode 100644
index 000000000..3f5e1a7fa
--- /dev/null
+++ b/tests/cli/total/test_cli_total_wrapper.py
@@ -0,0 +1,382 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+Tests for the CLI total wrapper functionality.
+"""
+
+import unittest
+from http.client import IncompleteRead
+from pathlib import Path
+from unittest.mock import MagicMock, call, patch
+from urllib.error import HTTPError
+
+import yaml
+
+from scribe_data.cli.total.print_values import get_datatype_list
+from scribe_data.cli.total.query import query_total_lexemes
+from scribe_data.cli.total.wrapper import total_wrapper
+from scribe_data.utils import DEFAULT_WIKIDATA_DUMP_EXPORT_DIR, WIKIDATA_QIDS_PIDS_FILE
+
+try:
+ with WIKIDATA_QIDS_PIDS_FILE.open("r", encoding="utf-8") as file:
+ wikidata_qids_pids = yaml.safe_load(file)
+
+except (IOError, yaml.YAMLError) as e:
+ print(f"Error reading wikidata QIDs/PIDs metadata: {e}")
+
+# MARK: Wrapper
+
+
+class TestCLITotalWrapper(unittest.TestCase):
+ @patch("scribe_data.cli.total.wrapper.print_total_lexemes")
+ def test_cli_total_wrapper_all_bool(
+ self, mock_print_total_lexemes: MagicMock
+ ) -> None:
+ total_wrapper(all_bool=True)
+ mock_print_total_lexemes.assert_called_once_with()
+
+ @patch("scribe_data.cli.total.wrapper.print_total_lexemes")
+ def test_cli_total_wrapper_language_only(
+ self, mock_print_total_lexemes: MagicMock
+ ) -> None:
+ total_wrapper(languages=["English"])
+ mock_print_total_lexemes.assert_called_once_with(language="English")
+
+ @patch("scribe_data.cli.total.wrapper.query_total_lexemes")
+ def test_cli_total_wrapper_language_and_data_type(
+ self, mock_query_total_lexemes_lexemes: MagicMock
+ ) -> None:
+ total_wrapper(languages=["English"], data_types=["nouns"])
+ mock_query_total_lexemes_lexemes.assert_called_once_with(
+ language="English", data_type="nouns"
+ )
+
+ def test_cli_total_wrapper_invalid_input(self) -> None:
+ with self.assertRaises(ValueError):
+ total_wrapper()
+
+ # MARK: Using Dump
+
+ @patch("scribe_data.cli.total.wrapper.parse_wd_lexeme_dump")
+ def test_cli_total_wrapper_wikidata_dump_flag(
+ self, mock_parse_dump: MagicMock
+ ) -> None:
+ """
+ Test when wikidata_dump is True (flag without path).
+ """
+ total_wrapper(wikidata_dump=True)
+ mock_parse_dump.assert_called_once_with(
+ languages=["all"],
+ data_types=["all"],
+ wikidata_dump_type=["total"],
+ wikidata_dump_path=DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
+ )
+
+ @patch("scribe_data.cli.total.wrapper.parse_wd_lexeme_dump")
+ def test_cli_total_wrapper_wikidata_dump_with_all(
+ self, mock_parse_dump: MagicMock
+ ) -> None:
+ """
+ Test when both wikidata_dump and all_bool are True.
+ """
+ total_wrapper(wikidata_dump=True, all_bool=True)
+ mock_parse_dump.assert_called_once_with(
+ languages=["all"],
+ data_types=["all"],
+ wikidata_dump_type=["total"],
+ wikidata_dump_path=DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
+ )
+
+ @patch("scribe_data.cli.total.wrapper.parse_wd_lexeme_dump")
+ def test_cli_total_wrapper_wikidata_dump_with_language_and_type(
+ self, mock_parse_dump: MagicMock
+ ) -> None:
+ """
+ Test wikidata_dump with specific language and data type.
+ """
+ total_wrapper(
+ languages=["English"],
+ data_types=["nouns"],
+ wikidata_dump=Path("/path/to/dump.json"),
+ )
+ mock_parse_dump.assert_called_once_with(
+ languages=["English"],
+ data_types=["nouns"],
+ wikidata_dump_type=["total"],
+ wikidata_dump_path=Path("/path/to/dump.json"),
+ )
+
+ # MARK: Using QID
+
+ @patch("scribe_data.cli.total.print_values.check_qid_is_language")
+ @patch("scribe_data.cli.total.wrapper.print_total_lexemes")
+ def test_cli_total_wrapper_with_qid(
+ self, mock_print_total: MagicMock, mock_check_qid: MagicMock
+ ) -> None:
+ """
+ Test when language is provided as a QID.
+ """
+ mock_check_qid.return_value = "Thai"
+ total_wrapper(languages=["Q9217"])
+ mock_print_total.assert_called_once_with(language="Q9217")
+
+ @patch("scribe_data.cli.total.print_values.check_qid_is_language")
+ @patch("scribe_data.cli.total.wrapper.query_total_lexemes")
+ def test_cli_total_wrapper_with_qid_and_datatype(
+ self, mock_query_total_lexemes: MagicMock, mock_check_qid: MagicMock
+ ) -> None:
+ """
+ Test when language QID and data type are provided.
+ """
+ mock_check_qid.return_value = "Thai"
+ total_wrapper(languages=["Q9217"], data_types=["nouns"])
+ mock_query_total_lexemes.assert_called_once_with(
+ language="Q9217", data_type="nouns"
+ )
+
+ @patch("scribe_data.cli.total.wrapper.parse_wd_lexeme_dump")
+ def test_cli_total_wrapper_qid_with_wikidata_dump(
+ self, mock_parse_dump: MagicMock
+ ) -> None:
+ """
+ Test QID with wikidata dump.
+ """
+ total_wrapper(languages=["Q9217"], wikidata_dump=True, all_bool=True)
+ mock_parse_dump.assert_called_once_with(
+ languages=["Q9217"],
+ data_types=["all"],
+ wikidata_dump_type=["total"],
+ wikidata_dump_path=DEFAULT_WIKIDATA_DUMP_EXPORT_DIR,
+ )
+
+ @patch("scribe_data.cli.total.wrapper.query_total_lexemes")
+ def test_query_total_lexemes_with_qid(
+ self, mock_query_total_lexemes: MagicMock
+ ) -> None:
+ """
+ Test query_total_lexemes with QID input.
+ """
+ total_wrapper(languages=["Q9217"], data_types=["Q1084"]) # Q1084 is noun QID
+ mock_query_total_lexemes.assert_called_once_with(
+ language="Q9217", data_type="Q1084"
+ )
+
+ # MARK: Multiple Languages and Data Types
+
+ @patch("scribe_data.cli.total.wrapper.query_total_lexemes")
+ def test_cli_total_wrapper_multiple_languages(
+ self, mock_query_total_lexemes: MagicMock
+ ) -> None:
+ """
+ Test retrieving totals for multiple languages.
+ """
+ # Mock return value to avoid formatting error.
+ mock_query_total_lexemes.return_value = 100
+
+ total_wrapper(languages=["English", "German"], data_types=["nouns"])
+
+ expected_calls = [
+ call(language="English", data_type="nouns", do_print=False),
+ call(language="German", data_type="nouns", do_print=False),
+ ]
+ mock_query_total_lexemes.assert_has_calls(expected_calls)
+
+ @patch("scribe_data.cli.total.wrapper.query_total_lexemes")
+ def test_cli_total_wrapper_multiple_data_types(
+ self, mock_query_total_lexemes: MagicMock
+ ) -> None:
+ """
+ Test retrieving totals for multiple data types.
+ """
+ # Mock return value to avoid formatting error.
+ mock_query_total_lexemes.return_value = 100
+
+ total_wrapper(languages=["English"], data_types=["nouns", "verbs"])
+
+ expected_calls = [
+ call(language="English", data_type="nouns", do_print=False),
+ call(language="English", data_type="verbs", do_print=False),
+ ]
+ mock_query_total_lexemes.assert_has_calls(expected_calls)
+
+ @patch("scribe_data.cli.total.wrapper.query_total_lexemes")
+ def test_cli_total_wrapper_multiple_languages_and_types(
+ self, mock_query_total_lexemes: MagicMock
+ ) -> None:
+ """
+ Test retrieving totals for multiple languages and data types.
+ """
+ # Mock return value to avoid formatting error.
+ mock_query_total_lexemes.return_value = 100
+
+ total_wrapper(languages=["English", "German"], data_types=["nouns", "verbs"])
+
+ expected_calls = [
+ call(language="English", data_type="nouns", do_print=False),
+ call(language="English", data_type="verbs", do_print=False),
+ call(language="German", data_type="nouns", do_print=False),
+ call(language="German", data_type="verbs", do_print=False),
+ ]
+ mock_query_total_lexemes.assert_has_calls(expected_calls)
+
+ # MARK: Error Handling
+
+ @patch("scribe_data.cli.total.query.sparql.query")
+ def test_query_total_lexemes_http_error(self, mock_query: MagicMock) -> None:
+ """
+ Test handling of HTTPError when querying totals.
+ """
+ # Set up mock to return None for results after max retries.
+ mock_query.side_effect = [
+ HTTPError(url="test", code=500, msg="error", hdrs={}, fp=None),
+ HTTPError(url="test", code=500, msg="error", hdrs={}, fp=None),
+ HTTPError(url="test", code=500, msg="error", hdrs={}, fp=None),
+ ]
+
+ with patch("builtins.print") as mock_print:
+ result = query_total_lexemes(language="English", data_type="nouns")
+
+ self.assertIsNone(result)
+ mock_print.assert_any_call("Query failed after retries.")
+
+ @patch("scribe_data.cli.total.query.sparql.query")
+ def test_query_total_lexemes_incomplete_read(self, mock_query: MagicMock) -> None:
+ """
+ Test handling of IncompleteRead error when querying totals.
+ """
+ # Set up mock to return None for results after max retries.
+ mock_query.side_effect = [
+ IncompleteRead(partial=b""),
+ IncompleteRead(partial=b""),
+ IncompleteRead(partial=b""),
+ ]
+
+ with patch("builtins.print") as mock_print:
+ result = query_total_lexemes(language="English", data_type="nouns")
+
+ self.assertIsNone(result)
+ mock_print.assert_any_call("Query failed after retries.")
+
+ # MARK: Data Type List Handling
+
+ @patch("scribe_data.cli.total.print_values.language_metadata")
+ @patch("scribe_data.cli.total.print_values.list_all_languages")
+ @patch("scribe_data.cli.total.print_values.WIKIDATA_QUERIES_ALL_DATA_DIR")
+ def test_get_datatype_list_with_sublanguages(
+ self,
+ mock_dir: MagicMock,
+ mock_list_languages: MagicMock,
+ mock_metadata: MagicMock,
+ ) -> None:
+ """
+ Test getting data type list for a language with sub-languages.
+ """
+ # Mock language metadata and list_all_languages.
+ mock_metadata_dict = {
+ "norwegian": {
+ "sub_languages": {"bokmal": {"iso": "nb"}, "nynorsk": {"iso": "nn"}}
+ }
+ }
+
+ # Mock dictionary-like behavior for language_metadata.
+ mock_metadata.__iter__.return_value = mock_metadata_dict.items()
+ mock_metadata.items.return_value = mock_metadata_dict.items()
+ mock_metadata.get.return_value = mock_metadata_dict["norwegian"]
+ mock_metadata.__getitem__.return_value = mock_metadata_dict["norwegian"]
+
+ mock_list_languages.return_value = ["norwegian"]
+
+ # Create mock directory entries with proper string names.
+ mock_nouns = MagicMock()
+ mock_nouns.name = "nouns"
+ mock_nouns.is_dir.return_value = True
+
+ mock_verbs = MagicMock()
+ mock_verbs.name = "verbs"
+ mock_verbs.is_dir.return_value = True
+
+ # Mock directory structure for both sub-languages.
+ def mock_path_handler(path: str) -> MagicMock:
+ mock_path = MagicMock()
+ mock_path.exists.return_value = True
+ mock_path.iterdir.return_value = [mock_nouns, mock_verbs]
+ return mock_path
+
+ mock_dir.__truediv__.side_effect = mock_path_handler
+
+ result = get_datatype_list("norwegian") # note: lowercase
+ self.assertEqual(sorted(result), ["nouns", "verbs"])
+
+ @patch("scribe_data.cli.total.print_values.language_metadata")
+ @patch("scribe_data.cli.total.print_values.WIKIDATA_QUERIES_ALL_DATA_DIR")
+ def test_get_datatype_list_empty_directory(
+ self, mock_dir: MagicMock, mock_metadata: MagicMock
+ ) -> None:
+ """
+ Test getting data type list from an empty directory.
+ """
+ # Mock language metadata.
+ mock_metadata.get.return_value = {}
+
+ mock_dir.__truediv__.return_value.exists.return_value = True
+ mock_dir.__truediv__.return_value.iterdir.return_value = []
+
+ with self.assertRaises(ValueError):
+ get_datatype_list("English")
+
+ @patch("scribe_data.cli.total.wrapper.query_total_lexemes")
+ def test_cli_total_wrapper_with_invalid_language(
+ self, mock_query_total_lexemes: MagicMock
+ ) -> None:
+ """
+ Test total wrapper with invalid language.
+ """
+ mock_query_total_lexemes.side_effect = ValueError("Invalid language")
+
+ with self.assertRaises(ValueError):
+ total_wrapper(languages=["invalid_lang"], data_types=["nouns"])
+
+ mock_query_total_lexemes.assert_called_once()
+
+ @patch("scribe_data.cli.total.wrapper.query_total_lexemes")
+ def test_cli_total_wrapper_with_invalid_data_type(
+ self, mock_query_total_lexemes: MagicMock
+ ) -> None:
+ """
+ Test total wrapper with invalid data type.
+ """
+ mock_query_total_lexemes.side_effect = ValueError("Invalid data type")
+
+ with self.assertRaises(ValueError):
+ total_wrapper(languages=["English"], data_types=["invalid_type"])
+
+ mock_query_total_lexemes.assert_called_once()
+
+ # MARK: Sub-language Handling
+
+ @patch("scribe_data.cli.total.print_values.get_datatype_list")
+ @patch("scribe_data.cli.total.wrapper.query_total_lexemes")
+ def test_print_total_lexemes_with_sublanguages(
+ self, mock_query_total_lexemes: MagicMock, mock_get_datatypes: MagicMock
+ ) -> None:
+ """
+ Test printing totals for a language with sub-languages.
+ """
+ mock_get_datatypes.return_value = ["nouns", "verbs"]
+ mock_query_total_lexemes.return_value = 100
+
+ with patch("builtins.print") as mock_print:
+ total_wrapper(languages=["Norwegian"], data_types=["nouns", "verbs"])
+
+ # Verify header was printed.
+ mock_print.assert_any_call(
+ f"{'Language':<20} {'Data Type':<25} {'Total Wikidata Lexemes':<25}"
+ )
+ mock_print.assert_any_call("=" * 70)
+
+ # Verify data was printed for each data type.
+ mock_query_total_lexemes.assert_any_call(
+ language="Norwegian", data_type="nouns", do_print=False
+ )
+ mock_query_total_lexemes.assert_any_call(
+ language="Norwegian", data_type="verbs", do_print=False
+ )
diff --git a/tests/resources/test_metadata.py b/tests/resources/test_resources_metadata.py
similarity index 93%
rename from tests/resources/test_metadata.py
rename to tests/resources/test_resources_metadata.py
index 70f490c7d..4909cacdb 100644
--- a/tests/resources/test_metadata.py
+++ b/tests/resources/test_resources_metadata.py
@@ -39,13 +39,13 @@ def check_file_readable(self, file_path: pathlib.Path) -> None:
# Catching any other file reading error
self.fail(f"Failed to read {file_path}: {str(e)}")
- def test_language_metadata_file_exists(self) -> None:
+ def test_resources_language_metadata_file_exists(self) -> None:
"""
Check if the language_metadata.yaml file exists.
"""
self.check_file_exists(LANGUAGE_METADATA_PATH)
- def test_language_metadata_file_readable(self) -> None:
+ def test_resources_language_metadata_file_readable(self) -> None:
"""
Check if the language_metadata.yaml file is readable.
"""
diff --git a/tests/load/test_update_utils.py b/tests/test_utils.py
similarity index 83%
rename from tests/load/test_update_utils.py
rename to tests/test_utils.py
index 2d107ad7a..47d736980 100644
--- a/tests/load/test_update_utils.py
+++ b/tests/test_utils.py
@@ -8,9 +8,15 @@
import pytest
-sys.path.append(Path(__file__).parent.parent.parent)
-
-from scribe_data import utils
+sys.path.append(Path(__file__).parent.parent)
+
+from scribe_data.utils import (
+ format_sublanguage_name,
+ get_language_from_iso,
+ get_language_iso,
+ get_language_qid,
+ list_all_languages,
+)
@pytest.mark.parametrize(
@@ -28,12 +34,12 @@
],
)
def test_get_language_qid_positive(language: str, qid_code: str) -> None:
- assert utils.get_language_qid(language) == qid_code
+ assert get_language_qid(language) == qid_code
def test_get_language_qid_negative() -> None:
with pytest.raises(ValueError) as excp:
- _ = utils.get_language_qid("Newspeak")
+ _ = get_language_qid("Newspeak")
assert (
str(excp.value)
@@ -56,12 +62,12 @@ def test_get_language_qid_negative() -> None:
],
)
def test_get_language_iso_positive(language: str, iso_code: str) -> None:
- assert utils.get_language_iso(language) == iso_code
+ assert get_language_iso(language) == iso_code
def test_get_language_iso_negative() -> None:
with pytest.raises(ValueError) as excp:
- _ = utils.get_language_iso("Gibberish")
+ _ = get_language_iso("Gibberish")
assert (
str(excp.value)
@@ -84,12 +90,12 @@ def test_get_language_iso_negative() -> None:
],
)
def test_get_language_from_iso_positive(iso_code: str, language: str) -> None:
- assert utils.get_language_from_iso(iso_code) == language
+ assert get_language_from_iso(iso_code) == language
def test_get_language_from_iso_negative() -> None:
with pytest.raises(ValueError) as excp:
- _ = utils.get_language_from_iso("ixi")
+ _ = get_language_from_iso("ixi")
assert str(excp.value) == "IXI is currently not a supported ISO language."
@@ -103,7 +109,7 @@ def test_get_language_from_iso_negative() -> None:
],
)
def test_format_sublanguage_name_positive(lang: str, expected_output: str) -> None:
- assert utils.format_sublanguage_name(lang) == expected_output
+ assert format_sublanguage_name(lang) == expected_output
@pytest.mark.parametrize(
@@ -114,12 +120,12 @@ def test_format_sublanguage_name_positive(lang: str, expected_output: str) -> No
],
)
def test_format_sublanguage_name_qid_positive(lang: str, expected_output: str) -> None:
- assert utils.format_sublanguage_name(lang) == expected_output
+ assert format_sublanguage_name(lang) == expected_output
def test_format_sublanguage_name_negative() -> None:
with pytest.raises(ValueError) as excp:
- _ = utils.format_sublanguage_name("Newspeak")
+ _ = format_sublanguage_name("Newspeak")
assert str(excp.value) == "Newspeak is not a valid language or sub-language."
@@ -174,4 +180,4 @@ def test_list_all_languages() -> None:
"yoruba",
]
- assert utils.list_all_languages() == expected_languages
+ assert list_all_languages() == expected_languages
diff --git a/tests/unicode/test_generate_emoji_keywords.py b/tests/unicode/test_unicode_generate_emoji_keywords.py
similarity index 94%
rename from tests/unicode/test_generate_emoji_keywords.py
rename to tests/unicode/test_unicode_generate_emoji_keywords.py
index 858f89848..980542e5c 100644
--- a/tests/unicode/test_generate_emoji_keywords.py
+++ b/tests/unicode/test_unicode_generate_emoji_keywords.py
@@ -46,7 +46,7 @@ def mock_process_unicode() -> Iterator[MagicMock]:
yield mock_lexicon
-def test_generate_emoji_success(
+def test_unicode_generate_emoji_success(
mock_pyicu: tuple[MagicMock, MagicMock],
mock_utils: tuple[MagicMock, MagicMock],
mock_process_unicode: MagicMock,
@@ -70,7 +70,7 @@ def test_generate_emoji_success(
mock_export.assert_called_once()
-def test_generate_emoji_pyicu_not_installed(
+def test_unicode_generate_emoji_pyicu_not_installed(
mock_pyicu: tuple[MagicMock, MagicMock],
) -> None:
mock_check_install, mock_check_installed = mock_pyicu
@@ -83,7 +83,7 @@ def test_generate_emoji_pyicu_not_installed(
mock_check_installed.assert_called_once()
-def test_generate_emoji_unsupported_language(
+def test_unicode_generate_emoji_unsupported_language(
mock_pyicu: tuple[MagicMock, MagicMock],
mock_utils: tuple[MagicMock, MagicMock],
tmp_path: Path,
@@ -102,7 +102,7 @@ def test_generate_emoji_unsupported_language(
mock_iso.assert_called_once_with(language="xx")
-def test_generate_emoji_output_dir_handling(
+def test_unicode_generate_emoji_output_dir_handling(
mock_pyicu: tuple[MagicMock, MagicMock],
mock_utils: tuple[MagicMock, MagicMock],
mock_process_unicode: MagicMock,
diff --git a/tests/wikidata/test_check_query.py b/tests/wikidata/test_wikidata_check_query.py
similarity index 83%
rename from tests/wikidata/test_check_query.py
rename to tests/wikidata/test_wikidata_check_query.py
index 882f844d8..65b7c4eeb 100755
--- a/tests/wikidata/test_check_query.py
+++ b/tests/wikidata/test_wikidata_check_query.py
@@ -46,42 +46,42 @@ def a_query() -> QueryFile:
# MARK: Query
-def test_full_path(a_query: QueryFile) -> None:
+def test_wikidata_full_path(a_query: QueryFile) -> None:
assert a_query.path == A_PATH
@patch("builtins.open", new_callable=mock_open, read_data="QUERY")
-def test_query_load(_: MagicMock, a_query: QueryFile) -> None:
+def test_wikidata_query_load(_: MagicMock, a_query: QueryFile) -> None:
assert a_query.load(12) == "QUERY\nLIMIT 12\n"
-def test_query_equals(a_query: QueryFile) -> None:
+def test_wikidata_query_equals(a_query: QueryFile) -> None:
assert a_query == QueryFile(A_PATH)
-def test_query_not_equals(a_query: QueryFile) -> None:
+def test_wikidata_query_not_equals(a_query: QueryFile) -> None:
assert a_query != QueryFile(normalize_path("/root/project/src/Dir/query.sparql"))
-def test_query_not_equals_object(a_query: QueryFile) -> None:
+def test_wikidata_query_not_equals_object(a_query: QueryFile) -> None:
assert a_query != object()
-def test_query_str(a_query: QueryFile) -> None:
+def test_wikidata_query_str(a_query: QueryFile) -> None:
assert (
str(a_query)
== f"QueryFile(path={normalize_path('/root/project/src/dir/query.sparql')})"
)
-def test_query_repr(a_query: QueryFile) -> None:
+def test_wikidata_query_repr(a_query: QueryFile) -> None:
assert (
repr(a_query)
== f"QueryFile(path={normalize_path('/root/project/src/dir/query.sparql')})"
)
-def test_query_execution_exception(a_query: QueryFile) -> None:
+def test_wikidata_query_execution_exception(a_query: QueryFile) -> None:
exception = QueryExecutionException("failure", a_query)
assert str(exception) == f"{S_PATH} : failure"
@@ -90,7 +90,7 @@ def test_query_execution_exception(a_query: QueryFile) -> None:
@patch("urllib.request.urlopen")
-def test_ping_pass(mock_urlopen: MagicMock) -> None:
+def test_wikidata_ping_pass(mock_urlopen: MagicMock) -> None:
mock_urlopen.return_value.__enter__.return_value.getcode.return_value = (
HTTPStatus.OK
)
@@ -98,19 +98,19 @@ def test_ping_pass(mock_urlopen: MagicMock) -> None:
@patch("urllib.request.urlopen")
-def test_ping_httperror_fail(mock_urlopen: MagicMock) -> None:
+def test_wikidata_ping_httperror_fail(mock_urlopen: MagicMock) -> None:
mock_urlopen.return_value.__enter__.side_effect = HTTPError
assert not ping("http://www.python.org", 0)
@patch("urllib.request.urlopen")
-def test_ping_exception_fail(mock_urlopen: MagicMock) -> None:
+def test_wikidata_ping_exception_fail(mock_urlopen: MagicMock) -> None:
mock_urlopen.return_value.__enter__.side_effect = Exception
assert not ping("http://www.python.org", 0)
@patch("urllib.request.urlopen")
-def test_ping_fail(mock_urlopen: MagicMock) -> None:
+def test_wikidata_ping_fail(mock_urlopen: MagicMock) -> None:
mock_urlopen.return_value.__enter__.return_value.getcode.return_value = (
HTTPStatus.BAD_REQUEST
)
@@ -121,12 +121,12 @@ def test_ping_fail(mock_urlopen: MagicMock) -> None:
@patch.object(Path, "is_file", return_value=True)
-def test_check_sparql_file_exists(_: MagicMock) -> None:
+def test_wikidata_check_sparql_file_exists(_: MagicMock) -> None:
assert check_sparql_file(S_PATH) == A_PATH
@patch.object(Path, "is_file", return_value=False)
-def test_check_sparql_file_not_exists(_: MagicMock) -> None:
+def test_wikidata_check_sparql_file_not_exists(_: MagicMock) -> None:
with pytest.raises(argparse.ArgumentTypeError) as err:
_ = check_sparql_file(S_PATH)
@@ -134,7 +134,7 @@ def test_check_sparql_file_not_exists(_: MagicMock) -> None:
@patch.object(Path, "is_file", return_value=True)
-def test_check_sparql_file_not_sparql_extension(_: MagicMock) -> None:
+def test_wikidata_check_sparql_file_not_sparql_extension(_: MagicMock) -> None:
fpath = Path("/root/query.txt")
with pytest.raises(argparse.ArgumentTypeError) as err:
_ = check_sparql_file(fpath)
@@ -162,7 +162,7 @@ def test_check_sparql_file_not_sparql_extension(_: MagicMock) -> None:
],
)
@patch("subprocess.run")
-def test_changed_queries(
+def test_wikidata_changed_queries(
mock_run: MagicMock, git_status: str, expected: list[Any]
) -> None:
mock_result = MagicMock()
@@ -173,7 +173,7 @@ def test_changed_queries(
@patch("subprocess.run")
-def test_changed_queries_failure(
+def test_wikidata_changed_queries_failure(
mock_run: MagicMock, capsys: pytest.CaptureFixture
) -> None:
mock_result = MagicMock()
@@ -208,7 +208,7 @@ def test_changed_queries_failure(
),
],
)
-def test_all_queries(tree: list[Any], expected: list[Any]) -> None:
+def test_wikidata_all_queries(tree: list[Any], expected: list[Any]) -> None:
with patch("os.walk") as mock_walk:
mock_walk.return_value = tree
@@ -216,7 +216,7 @@ def test_all_queries(tree: list[Any], expected: list[Any]) -> None:
# MARK: execute
-def test_execute(a_query: QueryFile) -> None:
+def test_wikidata_execute(a_query: QueryFile) -> None:
with pytest.raises(QueryExecutionException) as err:
_ = execute(a_query, 1, None, 0)
@@ -232,7 +232,7 @@ def test_execute(a_query: QueryFile) -> None:
("1000", 1000),
],
)
-def test_check_limit_pos(candidate: str, limit: int) -> None:
+def test_wikidata_check_limit_pos(candidate: str, limit: int) -> None:
assert check_limit(candidate) == limit
@@ -245,7 +245,7 @@ def test_check_limit_pos(candidate: str, limit: int) -> None:
"word",
],
)
-def test_check_limit_neg(candidate: str) -> None:
+def test_wikidata_check_limit_neg(candidate: str) -> None:
with pytest.raises(argparse.ArgumentTypeError) as err:
_ = check_limit(candidate)
@@ -263,7 +263,7 @@ def test_check_limit_neg(candidate: str) -> None:
("8888", 8888),
],
)
-def test_check_timeout_pos(candidate: str, timeout: int) -> None:
+def test_wikidata_check_timeout_pos(candidate: str, timeout: int) -> None:
assert check_timeout(candidate) == timeout
@@ -276,7 +276,7 @@ def test_check_timeout_pos(candidate: str, timeout: int) -> None:
"ten",
],
)
-def test_check_timeout_neg(candidate: str) -> None:
+def test_wikidata_check_timeout_neg(candidate: str) -> None:
with pytest.raises(argparse.ArgumentTypeError) as err:
_ = check_timeout(candidate)
@@ -287,7 +287,7 @@ def test_check_timeout_neg(candidate: str) -> None:
@pytest.mark.parametrize("arg", ["-h", "--help"])
-def test_main_help(arg: str) -> None:
+def test_wikidata_main_help(arg: str) -> None:
with pytest.raises(SystemExit) as err:
_ = main(arg)
assert err.code == 0
@@ -304,7 +304,7 @@ def test_main_help(arg: str) -> None:
["-c", "-f", "-a"],
],
)
-def test_main_mutex_opts(args: list[str]) -> None:
+def test_wikidata_main_mutex_opts(args: list[str]) -> None:
"""
Some options cannot be used together.
"""
@@ -313,7 +313,9 @@ def test_main_mutex_opts(args: list[str]) -> None:
assert err.code == 2
-def test_error_report_single(a_query: QueryFile, capsys: pytest.CaptureFixture) -> None:
+def test_wikidata_error_report_single(
+ a_query: QueryFile, capsys: pytest.CaptureFixture
+) -> None:
failures = [QueryExecutionException("timeout", a_query)]
error_report(failures)
err_out = capsys.readouterr().err
@@ -323,7 +325,7 @@ def test_error_report_single(a_query: QueryFile, capsys: pytest.CaptureFixture)
)
-def test_error_report_multiple(
+def test_wikidata_error_report_multiple(
a_query: QueryFile, capsys: pytest.CaptureFixture
) -> None:
failures = [
@@ -339,12 +341,12 @@ def test_error_report_multiple(
)
-def test_error_report_no_errors(capsys: pytest.CaptureFixture) -> None:
+def test_wikidata_error_report_no_errors(capsys: pytest.CaptureFixture) -> None:
error_report([])
assert capsys.readouterr().err == ""
-def test_success_report_single_display_set(
+def test_wikidata_success_report_single_display_set(
a_query: QueryFile, capsys: pytest.CaptureFixture
) -> None:
successes = [(a_query, {"a": 23})]
@@ -356,7 +358,9 @@ def test_success_report_single_display_set(
)
-def test_success_report_no_success_display_set(capsys: pytest.CaptureFixture) -> None:
+def test_wikidata_success_report_no_success_display_set(
+ capsys: pytest.CaptureFixture,
+) -> None:
success_report([], display=True)
assert capsys.readouterr().out == ""
@@ -365,7 +369,7 @@ def test_success_report_no_success_display_set(capsys: pytest.CaptureFixture) ->
"successes",
[[], [(a_query, {"a": 23})], [(a_query, {"a": 23}), (a_query, {"b": 53})]],
)
-def test_success_report_display_not_set(
+def test_wikidata_success_report_display_not_set(
successes: list[Any], capsys: pytest.CaptureFixture
) -> None:
success_report(successes, display=False)
@@ -373,7 +377,7 @@ def test_success_report_display_not_set(
assert out == ""
-def test_success_report_multiple_display_set(
+def test_wikidata_success_report_multiple_display_set(
a_query: QueryFile, capsys: pytest.CaptureFixture
) -> None:
successes = [(a_query, {"a": 23}), (a_query, {"b": 57})]
@@ -389,14 +393,14 @@ def test_success_report_multiple_display_set(
# MARK: check_query_forms
-def test_qid_label_dict_not_empty() -> None:
+def test_wikidata_qid_label_dict_not_empty() -> None:
assert check_query_forms.qid_label_dict, "qid_label_dict should not be empty"
# MARK: extract_forms_from_sparql
-def test_extract_forms_from_sparql_valid_file(tmp_path: Path) -> None:
+def test_wikidata_extract_forms_from_sparql_valid_file(tmp_path: Path) -> None:
sparql_file = tmp_path / "test.sparql"
# The pattern r"\s\sOPTIONAL\s*\{([^}]*)\}" requires exactly two spaces before OPTIONAL.
sparql_file.write_text(" OPTIONAL { form1 } OPTIONAL { form2 }")
@@ -404,7 +408,7 @@ def test_extract_forms_from_sparql_valid_file(tmp_path: Path) -> None:
assert result == [" form1 ", " form2 "]
-def test_extract_forms_from_sparql_no_matches(tmp_path: Path) -> None:
+def test_wikidata_extract_forms_from_sparql_no_matches(tmp_path: Path) -> None:
sparql_file = tmp_path / "test.sparql"
sparql_file.write_text("SELECT * WHERE { }")
result = check_query_forms.extract_forms_from_sparql(sparql_file)
@@ -412,7 +416,7 @@ def test_extract_forms_from_sparql_no_matches(tmp_path: Path) -> None:
@patch("builtins.open", side_effect=Exception("File error"))
-def test_extract_forms_from_sparql_exception(
+def test_wikidata_extract_forms_from_sparql_exception(
mock_open: MagicMock, capsys: pytest.CaptureFixture
) -> None:
result = check_query_forms.extract_forms_from_sparql(Path("nonexistent.sparql"))
@@ -424,13 +428,13 @@ def test_extract_forms_from_sparql_exception(
# MARK: extract_form_rep_label
-def test_extract_form_rep_label_valid() -> None:
+def test_wikidata_extract_form_rep_label_valid() -> None:
form_text = "ontolex:representation ?testLabel ;"
result = check_query_forms.extract_form_rep_label(form_text)
assert result == "testLabel"
-def test_extract_form_rep_label_no_match() -> None:
+def test_wikidata_extract_form_rep_label_no_match() -> None:
form_text = "invalid text"
result = check_query_forms.extract_form_rep_label(form_text)
assert result is None
@@ -439,7 +443,7 @@ def test_extract_form_rep_label_no_match() -> None:
# MARK: decompose_label_features
-def test_decompose_label_features_valid() -> None:
+def test_wikidata_decompose_label_features_valid() -> None:
label = "nominativeSingular"
with patch.object(
check_query_forms, "lexeme_form_labels_order", ["Nominative", "Singular"]
@@ -448,7 +452,7 @@ def test_decompose_label_features_valid() -> None:
assert result == ["Nominative", "Singular"]
-def test_decompose_label_features_invalid() -> None:
+def test_wikidata_decompose_label_features_invalid() -> None:
label = "unknownFeature"
with patch.object(
check_query_forms, "lexeme_form_labels_order", ["Nominative", "Singular"]
@@ -457,7 +461,7 @@ def test_decompose_label_features_invalid() -> None:
assert result == ["UnknownFeature"]
-def test_decompose_label_features_empty() -> None:
+def test_wikidata_decompose_label_features_empty() -> None:
label = ""
result = check_query_forms.decompose_label_features(label)
assert result == []
@@ -466,13 +470,13 @@ def test_decompose_label_features_empty() -> None:
# MARK: extract_form_qids
-def test_extract_form_qids_valid() -> None:
+def test_wikidata_extract_form_qids_valid() -> None:
form_text = "wikibase:grammaticalFeature wd:Q123, wd:Q456 ."
result = check_query_forms.extract_form_qids(form_text)
assert result == ["Q123", "Q456"]
-def test_extract_form_qids_no_match() -> None:
+def test_wikidata_extract_form_qids_no_match() -> None:
form_text = "invalid text"
result = check_query_forms.extract_form_qids(form_text)
assert result is None
@@ -481,25 +485,25 @@ def test_extract_form_qids_no_match() -> None:
# MARK: check_form_label
-def test_check_form_label_match() -> None:
+def test_wikidata_check_form_label_match() -> None:
form_text = "?lexeme ontolex:lexicalForm ?testForm .\n?testForm ontolex:representation ?test ;"
result = check_query_forms.check_form_label(form_text)
assert result is True
-def test_check_form_label_no_form_label() -> None:
+def test_wikidata_check_form_label_no_form_label() -> None:
form_text = "invalid text"
result = check_query_forms.check_form_label(form_text)
assert result is False
-def test_check_form_label_no_rep_label() -> None:
+def test_wikidata_check_form_label_no_rep_label() -> None:
form_text = "?lexeme ontolex:lexicalForm ?testForm ."
result = check_query_forms.check_form_label(form_text)
assert result is False
-def test_check_form_label_mismatch() -> None:
+def test_wikidata_check_form_label_mismatch() -> None:
form_text = "?lexeme ontolex:lexicalForm ?testForm .\n?testForm ontolex:representation ?other ;"
result = check_query_forms.check_form_label(form_text)
assert result is False
@@ -508,19 +512,19 @@ def test_check_form_label_mismatch() -> None:
# MARK: check_query_formatting
-def test_check_query_formatting_valid() -> None:
+def test_wikidata_check_query_formatting_valid() -> None:
form_text = "valid . text ;"
result = check_query_forms.check_query_formatting(form_text)
assert result is True
-def test_check_query_formatting_space_before_comma() -> None:
+def test_wikidata_check_query_formatting_space_before_comma() -> None:
form_text = "invalid , text"
result = check_query_forms.check_query_formatting(form_text)
assert result is False
-def test_check_query_formatting_nonspace_before_period() -> None:
+def test_wikidata_check_query_formatting_nonspace_before_period() -> None:
form_text = "invalid.text"
result = check_query_forms.check_query_formatting(form_text)
assert result is False
@@ -529,7 +533,7 @@ def test_check_query_formatting_nonspace_before_period() -> None:
# MARK: return_correct_form_label
-def test_return_correct_form_label_valid() -> None:
+def test_wikidata_return_correct_form_label_valid() -> None:
qids = ["Q123"]
with patch.object(check_query_forms, "lexeme_form_qid_order", ["Q123"]):
with patch.object(
@@ -541,12 +545,12 @@ def test_return_correct_form_label_valid() -> None:
assert result == "nominative"
-def test_return_correct_form_label_empty() -> None:
+def test_wikidata_return_correct_form_label_empty() -> None:
result = check_query_forms.return_correct_form_label([])
assert result == "Invalid query formatting found"
-def test_return_correct_form_label_not_included() -> None:
+def test_wikidata_return_correct_form_label_not_included() -> None:
qids = ["Q999"]
with patch.object(check_query_forms, "lexeme_form_qid_order", ["Q123"]):
result = check_query_forms.return_correct_form_label(qids)
@@ -620,7 +624,7 @@ def validate_forms(query_text: str) -> str:
# MARK: validate_forms
-def test_validate_forms_valid() -> None:
+def test_wikidata_validate_forms_valid() -> None:
# Ensure all variables in SELECT are defined in WHERE and order matches.
# Use ontolex:representation to define ?form so it matches forms_pattern.
query_text = """
@@ -640,13 +644,13 @@ def test_validate_forms_valid() -> None:
assert result == ""
-def test_validate_forms_no_select() -> None:
+def test_wikidata_validate_forms_no_select() -> None:
query_text = "WHERE { }"
result = check_query_forms.validate_forms(query_text)
assert result == "Invalid query format: no SELECT match"
-def test_validate_forms_duplicates() -> None:
+def test_wikidata_validate_forms_duplicates() -> None:
query_text = """
SELECT
?lexeme
@@ -665,7 +669,7 @@ def test_validate_forms_duplicates() -> None:
assert "Duplicate forms found in SELECT: form" in result
-def test_validate_forms_undefined() -> None:
+def test_wikidata_validate_forms_undefined() -> None:
query_text = """
SELECT
?lexeme
@@ -681,7 +685,7 @@ def test_validate_forms_undefined() -> None:
assert "Undefined forms found in SELECT: form" in result
-def test_validate_forms_unreturned() -> None:
+def test_wikidata_validate_forms_unreturned() -> None:
query_text = """
SELECT
?lexeme
@@ -699,7 +703,7 @@ def test_validate_forms_unreturned() -> None:
assert "Defined but unreturned forms found: formRep" in result
-def test_validate_forms_order_mismatch() -> None:
+def test_wikidata_validate_forms_order_mismatch() -> None:
# Ensure variables are defined, then create an order mismatch.
# Both ?form and ?formRep must be captured by forms_pattern.
query_text = """
@@ -728,13 +732,13 @@ def test_validate_forms_order_mismatch() -> None:
# MARK: check_docstring
-def test_check_docstring_valid() -> None:
+def test_wikidata_check_docstring_valid() -> None:
query_text = "# tool: scribe-data\n# All nouns (Q123) and verbs (Q456) and the given forms.\n# Enter this query at https://query.wikidata.org/.\n"
result = check_query_forms.check_docstring(query_text)
assert result is True
-def test_check_docstring_invalid_line1() -> None:
+def test_wikidata_check_docstring_invalid_line1() -> None:
query_text = "# wrong tool\n# All nouns (Q123) and verbs (Q456) and the given forms.\n# Enter this query at https://query.wikidata.org/.\n"
result = check_query_forms.check_docstring(query_text)
assert result == (False, "Error in line 1: # wrong tool")
@@ -743,7 +747,7 @@ def test_check_docstring_invalid_line1() -> None:
# MARK: check_forms_order
-def test_check_forms_order_valid() -> None:
+def test_wikidata_check_forms_order_valid() -> None:
query_text = """
SELECT
?lexeme
@@ -763,7 +767,7 @@ def test_check_forms_order_valid() -> None:
assert result is True
-def test_check_forms_order_invalid(capsys: pytest.CaptureFixture) -> None:
+def test_wikidata_check_forms_order_invalid(capsys: pytest.CaptureFixture) -> None:
query_text = """
SELECT
?lexeme
@@ -788,7 +792,7 @@ def test_check_forms_order_invalid(capsys: pytest.CaptureFixture) -> None:
# MARK: check_optional_qid_order
-def test_check_optional_qid_order_valid(tmp_path: Path) -> None:
+def test_wikidata_check_optional_qid_order_valid(tmp_path: Path) -> None:
sparql_file = tmp_path / "test.sparql"
sparql_file.write_text(
" OPTIONAL { ?lexeme ontolex:lexicalForm ?form . ?form ontolex:representation ?nominative ; wikibase:grammaticalFeature wd:Q123 . }"
@@ -798,7 +802,7 @@ def test_check_optional_qid_order_valid(tmp_path: Path) -> None:
assert result == ""
-def test_check_optional_qid_order_invalid(tmp_path: Path) -> None:
+def test_wikidata_check_optional_qid_order_invalid(tmp_path: Path) -> None:
sparql_file = tmp_path / "test.sparql"
sparql_file.write_text(
" OPTIONAL { ?lexeme ontolex:lexicalForm ?form . ?form ontolex:representation ?nominative ; wikibase:grammaticalFeature wd:Q456 . }"
@@ -814,7 +818,7 @@ def test_check_optional_qid_order_invalid(tmp_path: Path) -> None:
@patch("pathlib.Path.glob", return_value=[])
-def test_check_query_forms_no_files(
+def test_wikidata_check_query_forms_no_files(
mock_glob: MagicMock, capsys: pytest.CaptureFixture
) -> None:
# Mock WIKIDATA_QUERIES_ALL_DATA_DIR as a Path object with the patched glob.
@@ -827,7 +831,7 @@ def test_check_query_forms_no_files(
@patch("pathlib.Path.glob")
-def test_check_query_forms_with_errors(
+def test_wikidata_check_query_forms_with_errors(
mock_glob: MagicMock, tmp_path: Path, capsys: pytest.CaptureFixture
) -> None:
sparql_file = tmp_path / "test.sparql"
diff --git a/tests/cli/test_dump.py b/tests/wikidata/test_wikidata_dump.py
similarity index 93%
rename from tests/cli/test_dump.py
rename to tests/wikidata/test_wikidata_dump.py
index 70218c80a..f56be809c 100644
--- a/tests/cli/test_dump.py
+++ b/tests/wikidata/test_wikidata_dump.py
@@ -59,7 +59,9 @@ def lexeme_processor() -> LexemeProcessor:
)
-def test_lexeme_processor_initialization(lexeme_processor: LexemeProcessor) -> None:
+def test_wikidata_lexeme_processor_initialization(
+ lexeme_processor: LexemeProcessor,
+) -> None:
"""
Test LexemeProcessor initialization with basic parameters.
"""
@@ -71,7 +73,7 @@ def test_lexeme_processor_initialization(lexeme_processor: LexemeProcessor) -> N
@patch("builtins.open", new_callable=mock_open, read_data=Sample_Lexeme_Line)
@patch("bz2.open")
-def test_process_file(
+def test_wikidata_process_file(
mock_bz2_open: MagicMock, mock_file: MagicMock, lexeme_processor: LexemeProcessor
) -> None:
"""
@@ -89,7 +91,7 @@ def test_process_file(
@patch("scribe_data.wikidata.parse_dump.LexemeProcessor")
-def test_parse_dump(mock_processor: MagicMock) -> None:
+def test_wikidata_parse_dump(mock_processor: MagicMock) -> None:
"""
Test the parse_dump function.
"""
@@ -105,7 +107,7 @@ def test_parse_dump(mock_processor: MagicMock) -> None:
@patch("scribe_data.wikidata.wikidata_utils.Path")
@patch("scribe_data.wikidata.wikidata_utils.wd_lexeme_dump_download_wrapper")
@patch("scribe_data.wikidata.wikidata_utils.parse_dump")
-def test_parse_wd_lexeme_dump(
+def test_wikidata_parse_wd_lexeme_dump(
mock_parse_dump: MagicMock, mock_download: MagicMock, mock_path_class: MagicMock
) -> None:
"""
@@ -159,7 +161,7 @@ def test_parse_wd_lexeme_dump(
assert kwargs["data_types"] == ["nouns"]
-def test_parse_wd_lexeme_dump_no_file() -> None:
+def test_wikidata_parse_wd_lexeme_dump_no_file() -> None:
"""
Test parse_wd_lexeme_dump when no file is found.
"""
@@ -186,7 +188,7 @@ def test_parse_wd_lexeme_dump_no_file() -> None:
({"total": True}, True),
],
)
-def test_parse_types(test_input: dict[str, bool], expected: bool) -> None:
+def test_wikidata_parse_types(test_input: dict[str, bool], expected: bool) -> None:
"""
Test different parse types.
"""
diff --git a/tests/wikidata/test_query_data.py b/tests/wikidata/test_wikidata_query_data.py
similarity index 98%
rename from tests/wikidata/test_query_data.py
rename to tests/wikidata/test_wikidata_query_data.py
index f04298c8e..329d7551e 100644
--- a/tests/wikidata/test_query_data.py
+++ b/tests/wikidata/test_wikidata_query_data.py
@@ -16,7 +16,7 @@
class TestQueryData(unittest.TestCase):
@patch("subprocess.run")
@patch("sys.executable", return_value="python")
- def test_execute_formatting_script(
+ def test_wikidata_execute_formatting_script(
self, mock_executable: MagicMock, mock_run: MagicMock
) -> None:
"""
@@ -55,7 +55,7 @@ def test_execute_formatting_script(
"/output/dir", "German", "nouns"
) # should print error but not raise exceptions
- def test_query_data_multiple_intervals(self) -> None:
+ def test_wikidata_query_data_multiple_intervals(self) -> None:
"""
Test query_data with multiple query intervals.
"""
@@ -166,7 +166,7 @@ def test_query_data_multiple_intervals(self) -> None:
out.getvalue(),
)
- def test_query_data_single_query_error(self) -> None:
+ def test_wikidata_query_data_single_query_error(self) -> None:
"""
Test that query_data handles a single query returning None.
"""
@@ -242,7 +242,7 @@ def test_query_data_single_query_error(self) -> None:
# Check that execute_formatting_script is not called.
mock_exec.assert_not_called()
- def test_query_data_multiple_intervals_error(self) -> None:
+ def test_wikidata_query_data_multiple_intervals_error(self) -> None:
"""
Test query_data with multiple query intervals where the second query throws an HTTPError
and subsequent queries return None.
diff --git a/tests/wiktionary/test_parse_translations.py b/tests/wiktionary/test_wiktionary_parse_translations.py
similarity index 92%
rename from tests/wiktionary/test_parse_translations.py
rename to tests/wiktionary/test_wiktionary_parse_translations.py
index 27650f2db..cebb0b1a3 100644
--- a/tests/wiktionary/test_parse_translations.py
+++ b/tests/wiktionary/test_wiktionary_parse_translations.py
@@ -33,7 +33,7 @@ def setUp(self):
self.ru_config = get_wiktionary_config(source_iso="ru")
self.bn_config = get_wiktionary_config(source_iso="bn")
- def test_bangla_lang_header_pattern_matches_vasha_template(self):
+ def test_wiktionary_bangla_lang_header_pattern_matches_vasha_template(self):
"""
bnwiktionary marks the Bangla block with ``== {{ভাষা|bn}} ==``, not ``==বাংলা==``.
"""
@@ -48,7 +48,7 @@ def test_bangla_lang_header_pattern_matches_vasha_template(self):
self.assertIn("বিশেষ্য", section)
self.assertNotIn("other", section)
- def test_extract_translation_word_junk_filter(self):
+ def test_wiktionary_extract_translation_word_junk_filter(self):
"""
Words like 'literally' that appear in ignored_strings are filtered out.
"""
@@ -58,7 +58,7 @@ def test_extract_translation_word_junk_filter(self):
)
self.assertIsNone(word)
- def test_extract_junk_prefixes(self):
+ def test_wiktionary_extract_junk_prefixes(self):
"""
Words starting with an ignored prefix like 'see: ' are filtered out.
"""
@@ -68,7 +68,7 @@ def test_extract_junk_prefixes(self):
)
self.assertIsNone(word)
- def test_extract_named_parameters(self):
+ def test_wiktionary_extract_named_parameters(self):
"""
Named template params (1=lang, 2=word) are resolved the same as positional ones.
"""
@@ -80,7 +80,7 @@ def test_extract_named_parameters(self):
)
self.assertEqual(word, "Mädchen (n)")
- def test_grammar_trailing_tags(self):
+ def test_wiktionary_grammar_trailing_tags(self):
"""
Grammar tags from trailing positional params are appended in parentheses.
"""
@@ -94,7 +94,7 @@ def test_grammar_trailing_tags(self):
)
self.assertEqual(word, "Blitz (m)")
- def test_full_page_parse(self):
+ def test_wiktionary_full_page_parse(self):
"""
Multiple POS sections with trans-top blocks are each parsed into separate sense entries.
"""
@@ -129,7 +129,7 @@ def test_full_page_parse(self):
self.assertEqual(res["de"]["verb"]["1"]["translation"], "prüfen")
self.assertEqual(res["de"]["verb"]["1"]["description"], "to test")
- def test_french_template_headers_parse(self):
+ def test_wiktionary_french_template_headers_parse(self):
"""
French-style {{S|nom|fr}} headers inside section titles are resolved to the right POS.
"""
@@ -149,7 +149,7 @@ def test_french_template_headers_parse(self):
self.assertEqual(res["en"]["noun"]["1"]["translation"], "word")
self.assertEqual(res["en"]["noun"]["1"]["description"], "un type de mot")
- def test_spanish_eswiktionary_t1_and_pos_heading(self):
+ def test_wiktionary_spanish_eswiktionary_t1_and_pos_heading(self):
"""
eswiktionary uses ``{{lengua|es}}``, POS in template names (``sustantivo masculino``),
and ``{{t|lang|t1=…|g1=…}}`` without a positional lemma parameter.
@@ -177,7 +177,7 @@ def test_spanish_eswiktionary_t1_and_pos_heading(self):
"Bare {{trad-arriba}} has no gloss; description is still emitted empty.",
)
- def test_swedish_o_topp_parse(self):
+ def test_wiktionary_swedish_o_topp_parse(self):
"""
svwiktionary uses ``==Svenska==``, ``{{ö-topp}}`` / ``{{ö-botten}}``, and ``{{ö+|lang|word}}``.
"""
@@ -195,7 +195,7 @@ def test_swedish_o_topp_parse(self):
self.assertEqual(res["en"]["noun"]["1"]["translation"], "book")
self.assertEqual(res["en"]["noun"]["1"]["description"], "större mängd text")
- def test_portuguese_h1_tradini_parse(self):
+ def test_wiktionary_portuguese_h1_tradini_parse(self):
"""
ptwiktionary uses ``={{-pt-}}=`` (H1) and ``{{tradini}}`` / ``{{tradfim}}`` with ``{{trad|}}`` / ``{{t|}}``.
"""
@@ -215,7 +215,7 @@ def test_portuguese_h1_tradini_parse(self):
self.assertIn("de", res)
self.assertEqual(res["de"]["noun"]["1"]["translation"], "Buch")
- def test_italian_trad1_wikilink_format(self):
+ def test_wiktionary_italian_trad1_wikilink_format(self):
"""
itwiktionary uses ``Trad1`` / ``Trad2`` blocks where each line is
``:* {{lang_code}}: [[word1]], [[word2]]`` — bare wikilinks, NOT {{t|}} templates.
@@ -240,7 +240,7 @@ def test_italian_trad1_wikilink_format(self):
self.assertIn("de", res)
self.assertEqual(res["de"]["noun"]["1"]["translation"], "Buch")
- def test_italian_trad1_multi_sense(self):
+ def test_wiktionary_italian_trad1_multi_sense(self):
"""
Multiple Trad1/Trad2 blocks for the same POS produce separate sense indices.
"""
@@ -263,7 +263,7 @@ def test_italian_trad1_multi_sense(self):
self.assertEqual(noun_senses["1"]["translation"], "book")
self.assertEqual(noun_senses["2"]["translation"], "reservation, booking")
- def test_russian_h1_lang_section_boundary(self):
+ def test_wiktionary_russian_h1_lang_section_boundary(self):
"""
ruwiktionary language sections start with ``= {{-ru-}} =``; the next H1 closes the section.
"""
@@ -279,7 +279,7 @@ def test_russian_h1_lang_section_boundary(self):
self.assertIn("inside ru", section)
self.assertNotIn("english side", section)
- def test_german_ast_u_tabelle_parse(self):
+ def test_wiktionary_german_ast_u_tabelle_parse(self):
"""
The Ü-Tabelle format used by German Wiktionary is parsed correctly.
"""
@@ -298,7 +298,7 @@ def test_german_ast_u_tabelle_parse(self):
self.assertEqual(res["en"]["noun"]["1"]["translation"], "word")
self.assertEqual(res["fr"]["noun"]["1"]["translation"], "mot")
- def test_extract_source_lang_section(self):
+ def test_wiktionary_extract_source_lang_section(self):
"""
The correct language section is extracted and neighbouring sections are excluded.
"""
@@ -316,7 +316,7 @@ def test_extract_source_lang_section(self):
self.assertIsNotNone(section2)
self.assertIn("===Verb===", section2)
- def test_parse_page_worker_edge_cases(self):
+ def test_wiktionary_parse_page_worker_edge_cases(self):
"""
Worker returns None for empty or untranslated pages.
"""
@@ -325,7 +325,7 @@ def test_parse_page_worker_edge_cases(self):
_parse_page_worker(("test", "no translations", frozenset(), self.en_config))
)
- def test_parse_xml_dump_with_dummy_file(self):
+ def test_wiktionary_parse_xml_dump_with_dummy_file(self):
"""
Both single-process and multi-process paths produce correct output from a dummy XML file.
"""
@@ -385,7 +385,7 @@ def test_parse_xml_dump_with_dummy_file(self):
finally:
Path(tmp_path).unlink()
- def test_parse_xml_dump_not_found(self):
+ def test_wiktionary_parse_xml_dump_not_found(self):
with self.assertRaises(FileNotFoundError):
parse_xml_dump(
"does_not_exist.xml.bz2",
@@ -393,7 +393,7 @@ def test_parse_xml_dump_not_found(self):
progress=False,
)
- def test_empty_xml_parsing(self):
+ def test_wiktionary_empty_xml_parsing(self):
"""
An empty XML file returns an empty result without raising.
"""
@@ -412,7 +412,7 @@ def test_empty_xml_parsing(self):
finally:
Path(tmp_path).unlink()
- def test_resolve_dump_path(self):
+ def test_wiktionary_resolve_dump_path(self):
"""
Explicit paths are returned as-is; missing paths return None with a sensible ISO.
"""
@@ -443,7 +443,7 @@ def test_resolve_dump_path(self):
finally:
Path(tmp_path).unlink()
- def test_get_output_subdir(self):
+ def test_wiktionary_get_output_subdir(self):
"""
Top-level languages map to their lowercase name; sub-languages include their parent.
"""
@@ -460,7 +460,7 @@ def test_get_output_subdir(self):
self.assertEqual(_get_output_subdir("Mandarin", meta), "chinese/mandarin")
self.assertEqual(_get_output_subdir("German", meta), "german")
- def test_parse_wiktionary_translations_mock(self):
+ def test_wiktionary_parse_wiktionary_translations_mock(self):
"""
translations are written to the expected JSON file on disk.
"""