Skip to main content

A lightweight Python package to work with ngrams and skipgrams

Project description

PyPI version Supported Python versions CircleCI Build

nskipgrams is a lightweight Python package to work with ngrams and skipgrams. Fields of study using ngrams and skipgrams from sequential data, especially computational linguistics and natural language processing, will find this package helpful.

Highlights:

  • Simple: Store, access, and count ngrams and skipgrams – that’s it!

  • Memory-efficient: Tries are used for internal storage.

  • Hassle-free: No dependencies. Written in pure Python. Today is a great day.

Download and Install

To download and install the most recent version:

$ pip install --upgrade nskipgrams

Usage

The following are defined:

  • Ngrams
    • The class Ngrams handles a collection of ngrams.

    • The function ngrams_from_seq yields ngrams for a given sequence.

  • Skipgrams
    • The class Skipgrams handles a collection of skipgrams.

    • The function skipgrams_from_seq yields skipgrams for a given sequence.

Getting Ngrams from a Sequence

If you simply need ngrams from a sequence, ngrams_from_seq is what you’re looking for:

>>> from nskipgrams import ngrams_from_seq
>>> for ngram in ngrams_from_seq("abcdef", n=2):
...     print(ngram)
('a', 'b')
('b', 'c')
('c', 'd')
('d', 'e')
('e', 'f')

Initializing an Ngram Collection

>>> from nskipgrams import Ngrams
>>> char_ngrams = Ngrams(n=3)  # handles unigrams, bigrams, and trigrams

Adding Ngrams

>>> char_ngrams.add_from_seq("my cats")
>>> char_ngrams.add_from_seq("your cat", count=2)

Here, a sequence is anything that can be iterated over, and the corresponding ngrams are extracted from the individual elements off of the sequence. For example, if the sequence is a text string like "my cats" above, then the ngrams are character-based (hence the chosen variable name char_ngrams).

To add a single ngram:

>>> char_ngrams.add(("y", "o", "u"))

As a best practice, it is recommended that an ngram be represented as a tuple regardless of what the individual elements are, e.g., ("y", "o", "u") for character-based ngrams. As output examples show below, the tuple data type is also what this package uses to represent ngrams.

Accessing Ngrams

>>> for ngram, count in char_ngrams.ngrams_with_counts(n=1):  # unigrams
...     print(ngram, count)
...
('m',), 1
('y',), 3
(' ',), 3
('c',), 3
('a',), 3
('t',), 3
('s',), 1
('o',), 2
('u',), 2
('r',), 2
>>>
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=2):  # bigrams
...     print(ngram, count)
...
('m', 'y'), 1
('y', ' '), 1
('y', 'o'), 2
(' ', 'c'), 3
('c', 'a'), 3
('a', 't'), 3
('t', 's'), 1
('o', 'u'), 2
('u', 'r'), 2
('r', ' '), 2
>>>
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=3):  # trigrams
...     print(ngram, count)
...
('m', 'y', ' '), 1
('y', ' ', 'c'), 1
('y', 'o', 'u'), 3
(' ', 'c', 'a'), 3
('c', 'a', 't'), 3
('a', 't', 's'), 1
('o', 'u', 'r'), 2
('u', 'r', ' '), 2
('r', ' ', 'c'), 2

Accessing Ngrams with a Specific Prefix

>>> for ngram, count in char_ngrams.ngrams_with_counts(n=3, prefix=("y",)):
...     print(ngram, count)
...
('y', ' ', 'c'), 1
('y', 'o', 'u'), 3

Accessing the Count of a Specific Ngram

>>> char_ngrams.count(("c", "a", "t"))
3

Checking Membership

To check if an ngram has an exact match in the collection so far:

>>> ("c", "a", "t") in char_ngrams
True

Combining Collections of Ngrams

To combine collections of ngrams (e.g., when you process data sources in parallel and have multiple Ngrams objects):

>>> char_ngrams1 = Ngrams(n=2)
>>> char_ngrams1.add_from_seq("my cat")
>>> set(char_ngrams1.ngrams_with_counts(n=2))
{((' ', 'c'), 1),
 (('a', 't'), 1),
 (('c', 'a'), 1),
 (('m', 'y'), 1),
 (('y', ' '), 1)}
>>>
>>> char_ngrams2 = Ngrams(n=2)
>>> char_ngrams2.add_from_seq("your cats")
>>> set(char_ngrams2.ngrams_with_counts(n=2))
{((' ', 'c'), 1),
 (('a', 't'), 1),
 (('c', 'a'), 1),
 (('o', 'u'), 1),
 (('r', ' '), 1),
 (('t', 's'), 1),
 (('u', 'r'), 1),
 (('y', 'o'), 1)}
>>>
>>> char_ngrams3 = Ngrams(n=2)
>>> char_ngrams3.add_from_seq("her cats")
>>> set(char_ngrams3.ngrams_with_counts(n=2))
{((' ', 'c'), 1),
 (('a', 't'), 1),
 (('c', 'a'), 1),
 (('e', 'r'), 1),
 (('h', 'e'), 1),
 (('r', ' '), 1),
 (('t', 's'), 1)}
>>>
>>> char_ngrams1.combine(char_ngrams2, char_ngrams3)  # `combine` takes as many Ngrams objects as desired
>>> set(char_ngrams1.ngrams_with_counts(n=2))
{((' ', 'c'), 3),
 (('a', 't'), 3),
 (('c', 'a'), 3),
 (('e', 'r'), 1),
 (('h', 'e'), 1),
 (('m', 'y'), 1),
 (('o', 'u'), 1),
 (('r', ' '), 2),
 (('t', 's'), 2),
 (('u', 'r'), 1),
 (('y', ' '), 1),
 (('y', 'o'), 1)}

If you don’t want to mutate any of the Ngrams instances (the combine method works in-place and mutates these_ngrams when these_ngrams.combine is called), then you can create an empty ngram collection and combine into it all of your ngrams:

>>> collections = [char_ngrams1, char_ngrams2, char_ngrams3]
>>> all_ngrams = Ngrams(n=2)  # A new, empty collection of ngrams
>>> all_ngrams.combine(*collections)

Any “Sequences” and their Corresponding “Ngrams” Work

While the examples above use text strings as sequences and character-based ngrams, another common usage in computational linguistics and NLP is to have segmented phrases/sentences as sequences and word-based ngrams:

>>> from nskipgrams import Ngrams
>>> word_ngrams = Ngrams(n=2)
>>> word_ngrams.add_from_seq(("in", "the", "beginning"))
>>> word_ngrams.add_from_seq(("in", "the", "end"))
>>> for ngram, count in word_ngrams.ngrams_with_counts(n=2):
...     print(ngram, count)
...
('in', 'the'), 2
('the', 'beginning'), 1
('the', 'end'), 1

Skipgrams

Ngrams are a special case of skipgrams, with skip = 0. The class Skipgrams works the same as Ngrams, with the following differences:

  • Skipgrams has the method skipgrams_with_counts rather than ngrams_with_counts. skipgrams_with_counts also has the keyword argument skip (in addition to n and prefix).

  • For Skipgrams, the methods add and count, as well as collection instantiation (i.e., __init__), also have a meaningful skip keyword argument.

The function skipgrams_from_seq works the same as ngrams_from_seq, but has the skip keyword argument (in addition to seq and n).

License

MIT License. Please see LICENSE.txt in the GitHub source code for details.

Changelog

Please see CHANGELOG.md in the GitHub source code.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nskipgrams-0.3.0.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

nskipgrams-0.3.0-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file nskipgrams-0.3.0.tar.gz.

File metadata

  • Download URL: nskipgrams-0.3.0.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for nskipgrams-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a03b22b65b5b0da7606a20144232c624ea99d95156e76aabea32d7e881550549
MD5 b0bed8ac405145cbeb04038b28e0a23e
BLAKE2b-256 4a926444c6792a93a2dc7104f2ed01b71f9b174428e27723c3a6cd413e6ec746

See more details on using hashes here.

File details

Details for the file nskipgrams-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: nskipgrams-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for nskipgrams-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4479871644cde73aa8a90db830da528349e50b03fab8074d89db3b7d638c910e
MD5 a883c57da0885793cb73975a1bdc0808
BLAKE2b-256 c5c657ec7c850801c8fec7cf6a93b933318271ed22a9897ef8197a81a6444f6c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page