Export UNIHAN to Python, Data Package, CSV, JSON and YAML
Project description
unihan-etl - ETL tool for Unicode’s Han Unification (UNIHAN) database releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the database from Unicode’s website to a flat, tabular or structured, tree-like format.
unihan-etl can be used as a python library through its API, to retrieve data as a python object, or through the CLI to retrieve a CSV, JSON, or YAML file.
Part of the cihai project. Similar project: libUnihan.
UNIHAN Version compatibility (as of unihan-etl v0.10.0): 11.0.0 (released 2018-05-08, revision 25).
UNIHAN’s data is dispersed across multiple files in the format of:
U+3400 kCantonese jau1 U+3400 kDefinition (same as U+4E18 丘) hillock or mound U+3400 kMandarin qiū U+3401 kCantonese tim2 U+3401 kDefinition to lick; to taste, a mat, bamboo bark U+3401 kHanyuPinyin 10019.020:tiàn U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type. kHanyuPinyin maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents an entry. Complicating it further, more variations:
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
kHanyuPinyin supports multiple entries delimited by spaces. “:” (colon) separate locations in the work from pinyin readings. “,” (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.
Tabular, “Flat” output
CSV (default), $ unihan-etl:
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin 㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū 㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With $ unihan-etl -F yaml --no-expand:
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
With $ unihan-etl -F json --no-expand:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
“Structured” output
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.
JSON, $ unihan-etl -F json:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": [
"(same as U+4E18 丘) hillock or mound"
],
"kCantonese": [
"jau1"
],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": [
"to lick",
"to taste, a mat, bamboo bark"
],
"kCantonese": [
"tim2"
],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": [
"tiàn"
]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML $ unihan-etl -F yaml:
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features
automatically downloads UNIHAN from the internet
strives for accuracy with the specifications described in UNIHAN’s database design
export to JSON, CSV and YAML (requires pyyaml) via -F
configurable to export specific fields via -f
accounts for encoding conflicts due to the Unicode-heavy content
designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
core component and dependency of cihai, a CJK library
data package support
expansion of multi-value delimited fields in YAML, JSON and python dictionaries
supports python 2.7, >= 3.5 and pypy
If you encounter a problem or have a question, please create an issue.
Usage
unihan-etl offers customizable builds via its command line arguments.
See unihan-etl CLI arguments for information on how you can specify columns, files, download URL’s, and output destination.
To download and build your own UNIHAN export:
$ pip install --user unihan-etl
To output CSV, the default format:
$ unihan-etl
To output JSON:
$ unihan-etl -F json
To output YAML:
$ pip install --user pyyaml $ unihan-etl -F yaml
To only output the kDefinition field in a csv:
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
Code layout
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
process.py # argparse, download, extract, transform UNIHAN's data
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
_compat.py # python 2/3 compatibility module
util.py # utility / helper functions
# test suite
tests/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file unihan-etl-0.10.1.tar.gz
.
File metadata
- Download URL: unihan-etl-0.10.1.tar.gz
- Upload date:
- Size: 23.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed8278379756c1f3b5e8d902a4c2b43788853a0e0467b8668aee8628ae593b3e |
|
MD5 | 2e732ba1b969c9d8cac83ee77d66f107 |
|
BLAKE2b-256 | 382b0f90895a3b1be22eb9496592efbb3ec1b0fd56aa55e86b3dc41bd488b40c |