Skip to main content

Tools for working with word frequencies from various corpora.

Project description

Author: Rob Speer

## Installation

wordfreq requires Python 3 and depends on a few other Python modules (msgpack-python, langcodes, and ftfy). You can install it and its dependencies in the usual way, either by getting it from pip:

pip3 install wordfreq

or by getting the repository and running its setup.py:

python3 setup.py install

To handle word frequency lookups in Japanese, you need to additionally install mecab-python3, which itself depends on libmecab-dev. These commands will install them on Ubuntu:

sudo apt-get install mecab-ipadic-utf8 libmecab-dev pip3 install mecab-python3

## Unicode data

The tokenizers used to split non-Japanese phrases use regexes built using the unicodedata module from Python 3.4, which uses Unicode version 6.3.0. To update these regexes, run scripts/gen_regex.py.

## License

wordfreq is freely redistributable under the MIT license (see MIT-LICENSE.txt), and it includes data files that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/).

wordfreq contains data extracted from Google Books Ngrams (http://books.google.com/ngrams) and Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). The terms of use of this data are:

Ngram Viewer graphs and data may be freely used for any purpose, although acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a link to http://books.google.com/ngrams, would be appreciated.

It also contains data derived from the following Creative Commons-licensed sources:

Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter’s Developer Agreement & Policy. This software only gives statistics about words that are very commonly used on Twitter; it does not display or republish any Twitter content.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordfreq-1.0b4.tar.gz (5.7 MB view details)

Uploaded Source

File details

Details for the file wordfreq-1.0b4.tar.gz.

File metadata

  • Download URL: wordfreq-1.0b4.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for wordfreq-1.0b4.tar.gz
Algorithm Hash digest
SHA256 e9709ee4fcb8bcdfdb158e0d8a14ce415bac64c105164b821201984d94a402e6
MD5 bf08a3f383e6346e1e6b41b8ecacf6b6
BLAKE2b-256 f1447e4a9447917988e4955045b0275ab053acff2b74b722fe2f42a68aef273f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page