A tool for managing and automating translations with OpenAI and PO files
Project description
ai18n - AI-Powered PO Translation Toolkit
ai18n is a toolkit that enables you to tackle translations based on the GNU gettext
framework while working with and
augmenting your .po
files. It integrates with the
OpenAI API to automate translations either manually or as part of your CI/CD pipeline.
Key Features
- AI-Powered Translations: Automatically generate missing translations using OpenAI
models like
gpt-4
. - PO/YAML Syncing: Manage translations centrally in a YAML file while keeping
traditional
.po
files in sync. - Custom Reports: Generate reports showing translation coverage across languages and files.
- CLI Interface: Easy-to-use command-line interface for translating, syncing, and reporting on translations.
- Non-Destructive Workflow: Original translations remain intact, with AI suggestions
added on top, and you control the merging back into
.po
files.
How It Works
Through a simple CLI, ai18n
can load your existing .po
files into memory, call the
OpenAI API to generate translations where missing or incorrect, and save the progress
into a central YAML file. You can then choose to merge these AI-generated translations
back into your .po
files as needed.
The framework is non-destructive—your original translations are preserved, and
AI-generated translations are added incrementally. You have full control over merging
them into your .po
files based on your needs.
From an operational standpoint, we recommend that you commit the ai18n.yml
file along
with your po files. It contains the latest state of everything ai18n
-related, including
what's in your .po files, AI-generated translations from previous runs, and all sorts
of relevant metadata.
Usage
Configuration
While you can pass some of these as parameters to the commands, we suggest you prepare your environment by setting these environment variables:
# Set your target locales
export AI18N_TARGET_LANGUAGES=ar,de,es,fr,it,ja,ko,nl,pt,pt_BR,ru,sk,sl,tr,uk,zh,zh_TW
# Tell ai18n where your PO files are located
export AI18N_PO_FOLDER_ROOT="./translations/"
# This will be embedded into our prompt for extra context about your application / use cases
export AI18N_PROMPT_EXTRA_CONTEXT="These translation are part of Apache Superset, a Business Intelligence data exploration, visualization and dashboard open source application"
# If you want alter / customize the jinja template used to prompt AI
export AI18N_TEMPLATE_FOLDER="./templates/"
# Where your big yaml file with all your translations live
export AI18N_YAML_FILE=$AI18N_PO_FOLDER_ROOT/ai18n.yml
General Commands
# Set your OPENAI_API_KEY so that ai18n can use it
export OPENAI_API_KEY=<your-openai-api-key>
# Populate/sync the latest translations from PO files into the YAML file
ai18n po-pull
# Run this command to use OpenAI for filling in missing translations:
ai18n translate
# Show a report of translations coverage for each locale
ai18n report
# Push translations from the YAML back into PO files while prioritizing AI
# over original translations (default is to prefer original PO translations)
ai18n po-push --prefer-ai
# start clear - flush all existing AI-generated translation out of your yaml file
ai18n flush-ai
Force a po-translation
Depending on your environment, in cases where you have a translation in your PO file as
well as one contributed by AI, you'll have to decide on your precedence rule when pushing
translations back into your PO files. By default, ai18n po-push
will prefer the PO translation
over the AI one. If you prefer using the AI translations (where available) you can simply
ai18n po-push --prefer-ai
, which we believe people will want to use as AI translations get
better. Now, if you want to prefer the AI translations, there may be particular strings where
AI simply cannot do a great job, and you may want to set certain strings to force a
PO file string over the AI-contributed one. For this, you simply have to add the ai18n-force
flag into your PO file, as such:
#, ai18n-force
msgid "Dataset Name"
msgstr "Nom de l’ensemble de données"
This PO file comment will be extracted from your PO, and will force this string to be used,
even while running ai18n po-push --prefer-ai
.
Common workflow
While PO files are used as an interface in and out of ai18n
, we recommend that you
- check/commit the
ai18n.yaml
as a companion file in your code repository. Yes it is large and yes, it duplicates some i18n information in your repository, this is how the state and precious AI-generated translations get persisted and version-controlled as part of your app - keep your PO files intact in your repository, at least while figuring out your i18n workflows, this enables contributors to interface and contribute new strings where required
- add a
ai18n po pull && ai18n translate
to your release workflows, filling in gaps in your po files - add a
ai18n po-push
somewhere in your release pipeline, so that your software can use these "merged" po files to feed your app with translations
Eventually, and as you validate this approach, and as you mature your i18n workflow to take
ai18n
into account, you can decide to either:
- push the merged AI translated strings back into your app's PO files, and into your repository,
this would enable humans working with PO files to iterate on the latest/greatest/merged
translations, either to review translations or alter the bad ones using their favorite
PO file editor. Translators should be instructed to use the
#, ai18n-force
flag where they'd want to override AI-generated translations - OR, potentially, if you don't need to interface with PO files for data entry, get rid
of PO files in your repository altogether, and pivot to using the yaml file as the
source-of-truth for all your translations. To merge new strings into the yaml file, you would, as part
of your release process, run the PO file extraction, run
ai18n po-pull
to augment it with newly discovered/altered strings, and NOT persist PO files in your code repository. In this scenario, PO files are simply used as an ephemeral construct to interface between your app and theai18n
framework during your release process.
Example/default prompt
To better understand what's happening behind the scene, here's the prompt that we dynamically build and send to the openai API. Note that you can alter/customize this template to better fit your need.
Translate the following text for the UI of a software application using the GNU gettext framework.
This is in the context of a .po file, so please follow the appropriate formatting for pluralization if needed.
Other language translations are provided as a reference where available, but they may need improvement or correction.
Ensure the translation is appropriate for a technical audience and aligns with common UI/UX terminology.
Instructions:
- Provide the output in JSON format (no markdown) with the language code as a key and the translated string as the value.
- Provide translations for the following locales: ar, de, es, fr, it, ja, ko, nl, pt, pt_BR, ru, sk, sl, tr, uk, zh, zh_TW
- Follow the pluralization rules for the target language if applicable.
- Only pass the key to overwrite if your translation is significantly better than the existing one.
Original string to translate: 'A comma separated list of columns that should be parsed as dates'
Existing translations for reference:
ar: 'قائمة مفصولة بفواصل من الأعمدة التي يجب تحليلها كتواريخ'
de: 'Eine durch Kommas getrennte Liste von Spalten, die als Datumsangaben interpretiert werden sollen'
es: 'Una lista separada por comas de columnas que deben ser parseadas como fechas.'
fr: 'Une liste de colonnes séparées par des virgules qui doivent être analysées comme des dates.'
ja: '日付として解析する必要がある列のカンマ区切りのリスト'
nl: 'Een door komma's gescheiden lijst van kolommen die als datums moeten worden geïnterpreteerd'
pt_BR: 'Uma lista separada por vírgulas de colunas que devem ser analisadas como datas'
ru: 'Разделенный запятыми список столбцов, которые должны быть интерпретированы как даты.'
sl_SI: 'Z vejico ločen seznam stolpcev, v katerih bodo prepoznani datumi'
uk: 'Кома -розділений список стовпців, які слід проаналізувати як дати'
zh: '应作为日期解析的列的逗号分隔列表。'
zh_TW: '應作為日期解析的列的逗號分隔列表。'
Example response from openai
{
"ar": "قائمة مفصولة بفواصل من المخططات التي يسمح للملفات بالتحميل إليها.",
"de": "Eine durch Kommas getrennte Liste von Schemata, in die Dateien hochgeladen werden dürfen.",
"es": "Una lista separada por comas de esquemas que permiten la subida de archivos.",
"fr": "Une liste de schémas séparés par des virgules autorisés pour le téléversement.",
"it": "Un elenco di schemi separati da virgole a cui è consentito caricare file.",
"ja": "ファイルのアップロードを許可するスキーマのカンマ区切りのリスト。",
"ko": "파일 업로드가 허용되는 스키마의 쉼표로 구분된 목록입니다.",
"nl": "Een komma gescheiden lijst van schema's waar bestanden naar mogen uploaden.",
"pt": "Uma lista separada por vírgulas de esquemas para os quais é permitido fazer upload de arquivos.",
"pt_BR": "Uma lista separada por vírgulas de esquemas para os quais os arquivos têm permissão para fazer upload.",
"ru": "Разделенный запятыми список схем, в которые можно загружать файлы.",
"sk": "Zoznam schém oddelených čiarkami, do ktorých je povolené nahrávanie súborov.",
"sl": "Seznam shem, ločenih z vejicami, na katere je dovoljeno nalagati datoteke.",
"tr": "Dosyaların yüklenebileceği virgülle ayrılmış şema listesi.",
"uk": "Список схем, відокремлений комами, до яких файли дозволяють завантажувати.",
"zh": "允许文件上传的逗号分隔的模式列表。",
"zh_TW": "允許文件上傳的逗號分隔的模式列表。"
}
Sample of the ai18n.yml
messages:
- trimmed_msgid: '!= (Is not equal)'
msgid: '!= (Is not equal)'
occurances: []
po_translations:
ar: '!= (ﻎﻳﺭ ﻢﺘﺳﺍﻭ)'
de: '!= (Ist nicht gleich)'
es: '!= (No es igual)'
fr: '!= (N''est pas égal)'
it: ''
ja: '!= (等しくない)'
ko: ''
nl: '!= (Is niet gelijk)'
pt: ''
pt_BR: '!= (diferente)'
ru: '!= (не равно)'
sk: ''
sl_SI: '!= (ni enako)'
tr: ''
uk: '! = (Не рівний)'
zh: 不等于(!=)
zh_TW: 不等於(!=)
en: '!= (Is not equal)'
ai_translations:
ar: '!= (ﻎﻳﺭ ﻢﺘﺳﺍﻭ)'
de: '!= (Ist nicht gleich)'
es: '!= (No es igual)'
fr: '!= (N''est pas égal)'
it: '!= (Non è uguale)'
ja: '!= (等しくない)'
ko: '!= (같지 않음)'
nl: '!= (Is niet gelijk)'
pt: '!= (não é igual)'
pt_BR: '!= (diferente)'
ru: '!= (не равно)'
sk: '!= (nie je rovné)'
sl: '!= (ni enako)'
tr: '!= (eşit değil)'
uk: '!= (Не рівний)'
zh: 不等于(!=)
zh_TW: 不等於(!=)
metadata:
model_used: gpt-4-turbo
last_executed: '2024-09-30T18:07:14.542663'
Development Setup & Contributing
Clone the repository and install dependencies for both runtime and development:
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -e .
Run unit tests
pytest tests/
Author
Maxime Beauchemin is the original creator of Apache Superset, Apache Airflow and the founder and CEO of Preset.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.