wordfreq

Tools for working with word frequencies from various corpora.

These details have not been verified by PyPI

Project links

Homepage

Project description

Author: Rob Speer

## Installation

wordfreq requires Python 3 and depends on a few other Python modules (msgpack-python, langcodes, and ftfy). You can install it and its dependencies in the usual way, either by getting it from pip:

pip3 install wordfreq

or by getting the repository and running its setup.py:

python3 setup.py install

To handle word frequency lookups in Japanese, you need to additionally install mecab-python3, which itself depends on libmecab-dev. These commands will install them on Ubuntu:

sudo apt-get install mecab-ipadic-utf8 libmecab-dev pip3 install mecab-python3

## Tokenization

wordfreq uses the Python package regex, which is a more advanced implementation of regular expressions than the standard library, to separate text into tokens that can be counted consistently. regex produces tokens that follow the recommendations in [Unicode Annex #29, Text Segmentation][uax29].

There are language-specific exceptions:

In Arabic, it additionally normalizes ligatures and removes combining marks.
In Japanese, instead of using the regex library, it uses the external library mecab-python3. This is an optional dependency of wordfreq, and compiling it requires the libmecab-dev system package to be installed.
It does not yet attempt to tokenize Chinese ideograms.

[uax29]: http://unicode.org/reports/tr29/

## License

wordfreq is freely redistributable under the MIT license (see MIT-LICENSE.txt), and it includes data files that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/).

wordfreq contains data extracted from Google Books Ngrams (http://books.google.com/ngrams) and Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). The terms of use of this data are:

Ngram Viewer graphs and data may be freely used for any purpose, although acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a link to http://books.google.com/ngrams, would be appreciated.

It also contains data derived from the following Creative Commons-licensed sources:

The Leeds Internet Corpus, from the University of Leeds Centre for Translation Studies (http://corpus.leeds.ac.uk/list.html)
The OpenSubtitles Frequency Word Lists, by Invoke IT Limited (https://invokeit.wordpress.com/frequency-word-lists/)
Wikipedia, the free encyclopedia (http://www.wikipedia.org)

Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter’s Developer Agreement & Policy. This software gives statistics about words that are commonly used on Twitter; it does not display or republish any Twitter content.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

3.1.1

Nov 21, 2023

3.1.0

Nov 21, 2023

3.0.3

Oct 25, 2022

3.0.2

Sep 26, 2022

3.0.1

Apr 1, 2022

3.0.0

Mar 11, 2022

2.5.1

Sep 2, 2021

2.5.0

Apr 15, 2021

2.4.2

Feb 23, 2021

2.4.1

Feb 23, 2021

2.3.2

May 1, 2020

2.2.2

Mar 2, 2020

2.2.1

Feb 5, 2019

2.2.0

Jul 24, 2018

2.1.0

Jun 26, 2018

2.0.1

May 1, 2018

2.0

Mar 15, 2018

1.7.0

Mar 7, 2018

1.6.1

May 16, 2017

1.5.1

Aug 19, 2016

1.5

Jul 29, 2016

1.4.1

Jul 1, 2016

1.4

May 19, 2016

1.3

Jan 21, 2016

1.2

Oct 27, 2015

This version

1.1

Aug 27, 2015

1.0

Jul 29, 2015

1.0b4 pre-release

Jul 28, 2015

0.5.0

Mar 13, 2015

0.4.1

Sep 8, 2014

0.3.0

Jan 29, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordfreq-1.1.tar.gz (5.6 MB view details)

Uploaded Aug 27, 2015 Source

File details

Details for the file wordfreq-1.1.tar.gz.

File metadata

Download URL: wordfreq-1.1.tar.gz
Upload date: Aug 27, 2015
Size: 5.6 MB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for wordfreq-1.1.tar.gz
Algorithm	Hash digest
SHA256	`65fcb7d2f567e18219a8f08146fcb8271d500aba92a001520be2bfef43f2f81f`
MD5	`61ec443bda50acedc9f99a8fa194a553`
BLAKE2b-256	`71d978d6a43af2f65626ab1b7c0b8c69399e9b9150d159da539c655f20ae94c4`