Export UNIHAN data of Chinese, Japanese, Korean to CSV, JSON or YAML
Project description
unihan-etl
ETL tool for Unicode's Han Unification (UNIHAN) database releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the database from Unicode's website to a flat, tabular or structured, tree-like format.
unihan-etl can be used as a python library through its API, to retrieve data as a python object, or through the CLI to retrieve a CSV, JSON, or YAML file.
Part of the cihai project. Similar project: libUnihan.
UNIHAN Version compatibility (as of unihan-etl v0.10.0): 11.0.0 (released 2018-05-08, revision 25).
UNIHAN's data is dispersed across multiple files in the format of:
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type.
kHanyuPinyin maps Unicode codepoints to
Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn
represents
an entry. Complicating it further, more variations:
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
kHanyuPinyin supports multiple entries delimited by spaces. ":" (colon) separate locations in the work from pinyin readings. "," (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.
Tabular, "Flat" output
CSV (default), $ unihan-etl
:
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With $ unihan-etl -F yaml --no-expand
:
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
With $ unihan-etl -F json --no-expand
:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
"Structured" output
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.
Why not CSV?
Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.
JSON, $ unihan-etl -F json
:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": ["(same as U+4E18 丘) hillock or mound"],
"kCantonese": ["jau1"],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
"kCantonese": ["tim2"],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": ["tiàn"]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML $ unihan-etl -F yaml
:
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features
- automatically downloads UNIHAN from the internet
- strives for accuracy with the specifications described in UNIHAN's database design
- export to JSON, CSV and YAML (requires pyyaml) via
-F
- configurable to export specific fields via
-f
- accounts for encoding conflicts due to the Unicode-heavy content
- designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
- core component and dependency of cihai, a CJK library
- data package support
- expansion of multi-value delimited fields in YAML, JSON and python dictionaries
- supports >= 3.6 and pypy
If you encounter a problem or have a question, please create an issue.
Usage
unihan-etl
offers customizable builds via its command line arguments.
See unihan-etl CLI arguments for information on how you can specify columns, files, download URL's, and output destination.
To download and build your own UNIHAN export:
$ pip install --user unihan-etl
To output CSV, the default format:
$ unihan-etl
To output JSON:
$ unihan-etl -F json
To output YAML:
$ pip install --user pyyaml
$ unihan-etl -F yaml
To only output the kDefinition field in a csv:
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
Code layout
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
process.py # argparse, download, extract, transform UNIHAN's data
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
_compat.py # python 2/3 compatibility module
util.py # utility / helper functions
# test suite
tests/*
Developing
poetry is a required package to develop.
git clone https://github.com/cihai/unihan-etl.git
cd unihan-etl
poetry install -E "docs test coverage lint format"
Makefile commands prefixed with watch_
will watch files and rerun.
Tests
poetry run py.test
Helpers: make test
Rerun tests on file change: make watch_test
(requires
entr(1))
Documentation
Default preview server: http://localhost:8039
cd docs/
and make html
to build. make serve
to start http server.
Helpers: make build_docs
, make serve_docs
Rebuild docs on file change: make watch_docs
(requires entr(1))
Rebuild docs and run server via one terminal: make dev_docs
(requires above, and a make(1)
with
-J
support, e.g. GNU Make)
Formatting / Linting
The project uses black and isort
(one after the other) and runs flake8 via CI. See the configuration in
pyproject.toml and setup.cfg
:
make black isort
: Run black
first, then isort
to handle import nuances make flake8
, to watch
(requires entr(1)
): make watch_flake8
Releasing
As of 0.11, poetry handles virtualenv creation, package requirements, versioning, building, and publishing. Therefore there is no setup.py or requirements files.
Update __version__ in __about__.py and pyproject.toml
:
$ git commit -m 'build(unihan-etl): Tag v0.1.1'
$ git tag v0.1.1
$ git push
$ git push --tags
$ poetry build
$ poetry deploy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file unihan-etl-0.14.0a1.tar.gz
.
File metadata
- Download URL: unihan-etl-0.14.0a1.tar.gz
- Upload date:
- Size: 58.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fc5e7f8058b0ddd9082990d83ca290460ce0465d37e56869db3f1865cc1e29a |
|
MD5 | 8abe7a6a15d1c421afdfbb991e5e6256 |
|
BLAKE2b-256 | 0fc75dec040b359dc6c21549ec08bff2ad686bb7bd5ff1e5dd9944f3149287b2 |
File details
Details for the file unihan_etl-0.14.0a1-py3-none-any.whl
.
File metadata
- Download URL: unihan_etl-0.14.0a1-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fa3a72b2b057ca8510caab9b247c36005ac46aaee8e6760b57820f9f18e0790 |
|
MD5 | 3e5dbdaeb43e1e4a257def1d300c4d0e |
|
BLAKE2b-256 | 4b0b909713656637ede90b1392fd43cb6427a2da51d96e1bddce1246fbd62799 |