Python tool to extract sentences from po files and create language datasets for NLP machine learning
Project description
PO2Dataset
po2dataset is a python tool to extract sentences from po files and create language datasets for machine translation.
This command line tool is intended to create dataset packages suitable for Argos Train.
How to install
From pip
pip install po2dataset
Manual installation
Create a virtual environment using virtualenv
git clone https://github.com/urtzai/po2dataset.git
virtualenv po2dataset
cd po2dataset
source ./bin/activate
Quick start guide
Create Argos Train suitable dataset
po2dataset <path_to_po_file> --name <project_name> --source_code <source_lang_code> --target_code <target_lang_code> --ref "Some reference information of the project"
Where:
name
: The name of the projectsource_code
: Source language code (ISO 639)target_code
: Target language code (ISO 639)ref
: Some reference information of the project
Optional arguments:
format
: Extension name of the zip file (default argosdata)license
: License to add into the package (default CC0). Options are: CC0, CC-BY, CC-BY-SA
Usage Examples
Basic Dataset Creation
To create a dataset from a .po file for an English-Basque translation project, run:
po2dataset path/to/yourfile.po --name "MyProject" --source_code en --target_code eu --ref "Translation dataset for project X"
Specifying Format and License
For custom file format and license, use:
po2dataset path/to/yourfile.po --name "MyProject" --source_code en --target_code eu --format "customzip" --license "CC-BY"
These commands create language dataset packages, with customizable file formats and licensing options.
Support
Should you experience any issues do not hesistate to post an issue or contribute in this project pulling requests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file po2dataset-0.3.1.tar.gz
.
File metadata
- Download URL: po2dataset-0.3.1.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dffd99172ad9c8b506274832f402c0d8b272e692fe6f83ab8b31cd31722e7c47 |
|
MD5 | d636d6dad6325e8728a4ff3b3a473598 |
|
BLAKE2b-256 | b651133741414a8db472b1494280a686e1c7e52b3ec8ab46e0250faed04f6356 |
File details
Details for the file po2dataset-0.3.1-py3-none-any.whl
.
File metadata
- Download URL: po2dataset-0.3.1-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2b3cca4fbe37823cfd612d61d22b83bfd54effcdb21dc94e42bc31263b7dd8d |
|
MD5 | f77ca7bff61e9cd517367aae5772d978 |
|
BLAKE2b-256 | a6de38dcfb62f83bcb58839eaae3f902cf40c40159869d65d09491ac952144dc |