Builds a multisource English lexicon
Project description
🗽 CityLex: a free multisource English lexical database
CityLex is an English lexical database intended to replace or enhance databases like CELEX. It combines data from up to seven unique sources, including frequency norms, morphological analyses, and pronunciations. Since these have varying license conditions (some are proprietary, others restrict redistribution), we do not provide the database as is. Rather the user must generate a personal copy by executing a Python script, enabling whatever sources they wish to use.
Building your own CityLex
To install CityLex execute
pip install git+https://github.com/kylebgorman/citylex.git
To see the available data sources and options, execute citylex --help
.
To generate the lexicon, execute citylex
with at least one source enabled
using command-line flags. As most of the data is downloaded from outline
sources, an internet connection is normally required. The process takes roughly
four minutes with all sources enabled; much of the time is spent downloading
large files.
To generate a lexicon with all the sources that don't require manual downloads, execute
citylex --cmudict \
--elp \
--subtlex_uk \
--subtlex_us \
--udlexicons \
--unimorph \
--wikipron
File formats
Two files are produced. The first, by default citylex.tsv
, is a standard
wide-format "tab separated values" (TSV) file, of the sort that can be read into
Excel or R. Some fields (particularly pronunciations and morphological analyses)
can have multiple entries per wordform. In this case, they are separated using
the ^
character.
Advanced users may wish to make use of the second file, by default
citylex.textproto
, a
text-format
protocol buffer which
provides a better representation of the repeated fields. To parse this file in
Python, use the following snippet:
from google.protobuf import text_format
import citylex_pb2
lexicon = citylex_pb2.Lexicon()
with open("citylex.textproto", "r") as source:
text_format.ParseLines(source, lexicon)
This will parse the text-format data and populate lexicon
. One can then
iterate over lexicon.entry
like a Python dictionary.
Non-redistributable data sources
Not all CityLex data can be obtained automatically from online sources. If you wish to enable CELEX features, follow the instructions below.
This proprietary resource must be obtained from the Linguistic Data
Consortium as LDC96L14.tgz
. The file
should be decompressed using
tar -xzf LDC96L14.tgz
This will produce a directory named celex2
. To enable CELEX2 features, use
--celex
and pass the local path of this directory as an argument to
--celex_path
.
For more information
citylex.proto
for the protocol buffer data structurecitylex.bib
for references to the data sources used
For contributors
To regenerate citylex_pb2.py
you will need to install the Protocol Buffers
C++ runtime for your platform,
making sure the version number (e.g., the one returned by protoc --version
matches that of protobuf
in requirements.txt
. Then, run
protoc --python_out=. citylex.proto
.
License
The CityLex codebase are distributed under the Apache 2.0 license. Please see
License.txt
for details.
All other data sources bear their original licenses chosen by their creators;
see citylex --help
for more information.
Author
CityLex was created by Kyle Gorman.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file citylex-0.1.1.tar.gz
.
File metadata
- Download URL: citylex-0.1.1.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.32.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e7c9883585a5911849e1b231248df0839db184f4670be2abb7b853918fb9db0 |
|
MD5 | 49f8e4f73076c81b1016d774f06ed925 |
|
BLAKE2b-256 | b0bff3e635c8835e445ba698ddcccdfa322d6597a6cbcc3ff712b1a3a4691ca3 |