Skip to main content

Super-fast, efficiently stored Trie for Python.

Project description

datrie

Super-fast, efficiently stored Trie for Python (2.x and 3.x). Uses libdatrie.

Installation

pip install datrie

Usage

Create a new trie capable of storing items with lower-case ascii keys:

>>> import string
>>> import datrie
>>> trie = datrie.Trie(string.ascii_lowercase)

trie variable is a dict-like object that can have unicode keys of certain ranges and Python objects as values.

In addition to implementing the mapping interface, tries facilitate finding the items for a given prefix, and vice versa, finding the items whose keys are prefixes of a given string. As a common special case, finding the longest-prefix item is also supported.

Add some values to it (datrie keys must be unicode; the examples are for Python 2.x):

>>> trie[u'foo'] = 5
>>> trie[u'foobar'] = 10
>>> trie[u'bar'] = 'bar value'
>>> trie.setdefault(u'foobar', 15)
10

Check if u’foo’ is in trie:

>>> u'foo' in trie
True

Get a value:

>>> trie[u'foo']
5

Find all prefixes of a word:

>>> trie.prefixes(u'foobarbaz')
[u'foo', u'foobar']

>>> trie.prefix_items(u'foobarbaz')
[(u'foo', 5), (u'foobar', 10)]

>>> trie.iter_prefixes(u'foobarbaz')
<generator object ...>

>>> trie.iter_prefix_items(u'foobarbaz')
<generator object ...>

Find the longest prefix of a word:

>>> trie.longest_prefix(u'foo')
u'foo'

>>> trie.longest_prefix(u'foobarbaz')
u'foobar'

>>> trie.longest_prefix(u'gaz')
KeyError: u'gaz'

>>> trie.longest_prefix(u'gaz', default=u'vasia')
u'vasia'

>>> trie.longest_prefix_item(u'foobarbaz')
(u'foobar', 10)

Check if the trie has keys with a given prefix:

>>> trie.has_keys_with_prefix(u'fo')
True

>>> trie.has_keys_with_prefix(u'FO')
False

Get all items with a given prefix from a trie:

>>> trie.keys(u'fo')
[u'foo', u'foobar']

>>> trie.items(u'ba')
[(u'bar', 'bar value')]

>>> trie.values(u'foob')
[10]

Get all suffixes of certain word starting with a given prefix from a trie:

>>> trie.suffixes()
[u'pro', u'producer', u'producers', u'product', u'production', u'productivity', u'prof']
>>> trie.suffixes(u'prod')
[u'ucer', u'ucers', u'uct', u'uction', u'uctivity']

Save & load a trie (values must be picklable):

>>> trie.save('my.trie')
>>> trie2 = datrie.Trie.load('my.trie')

Trie and BaseTrie

There are two Trie classes in datrie package: datrie.Trie and datrie.BaseTrie. datrie.BaseTrie is slightly faster and uses less memory but it can store only integer numbers -2147483648 <= x <= 2147483647. datrie.Trie is a bit slower but can store any Python object as a value.

If you don’t need values or integer values are OK then use datrie.BaseTrie:

import datrie
import string
trie = datrie.BaseTrie(string.ascii_lowercase)

Custom iteration

If the built-in trie methods don’t fit you can use datrie.State and datrie.Iterator to implement custom traversal.

For example, let’s find all suffixes of 'fo' for our trie and get the values:

>>> state = datrie.State(trie)
>>> state.walk(u'foo')
>>> it = datrie.Iterator(state)
>>> while it.next():
...     print(it.key())
...     print(it.data))
o
5
obar
10

Performance

Performance is measured for datrie.Trie against Python’s dict with 100k unique unicode words (English and Russian) as keys and ‘1’ numbers as values.

datrie.Trie uses about 5M memory for 100k words; Python’s dict uses about 22M for this according to my unscientific tests.

This trie implementation is 2-6 times slower than python’s dict on __getitem__. Benchmark results (macbook air i5 1.8GHz, “1.000M ops/sec” == “1 000 000 operations per second”):

Python 2.6:
dict __getitem__: 7.107M ops/sec
trie __getitem__: 2.478M ops/sec

Python 2.7:
dict __getitem__: 6.550M ops/sec
trie __getitem__: 2.474M ops/sec

Python 3.2:
dict __getitem__: 8.185M ops/sec
trie __getitem__: 2.684M ops/sec

Python 3.3:
dict __getitem__: 7.050M ops/sec
trie __getitem__: 2.755M ops/sec

Looking for prefixes of a given word is almost as fast as __getitem__ (results are for Python 3.3):

trie.iter_prefix_items (hits):      0.461M ops/sec
trie.prefix_items (hits):           0.743M ops/sec
trie.prefix_items loop (hits):      0.629M ops/sec
trie.iter_prefixes (hits):          0.759M ops/sec
trie.iter_prefixes (misses):        1.538M ops/sec
trie.iter_prefixes (mixed):         1.359M ops/sec
trie.has_keys_with_prefix (hits):   1.896M ops/sec
trie.has_keys_with_prefix (misses): 2.590M ops/sec
trie.longest_prefix (hits):         1.710M ops/sec
trie.longest_prefix (misses):       1.506M ops/sec
trie.longest_prefix (mixed):        1.520M ops/sec
trie.longest_prefix_item (hits):    1.276M ops/sec
trie.longest_prefix_item (misses):  1.292M ops/sec
trie.longest_prefix_item (mixed):   1.379M ops/sec

Looking for all words starting with a given prefix is mostly limited by overall result count (this can be improved in future because a lot of time is spent decoding strings from utf_32_le to Python’s unicode):

trie.items(prefix="xxx"), avg_len(res)==415:        0.609K ops/sec
trie.keys(prefix="xxx"), avg_len(res)==415:         0.642K ops/sec
trie.values(prefix="xxx"), avg_len(res)==415:       4.974K ops/sec
trie.items(prefix="xxxxx"), avg_len(res)==17:       14.781K ops/sec
trie.keys(prefix="xxxxx"), avg_len(res)==17:        15.766K ops/sec
trie.values(prefix="xxxxx"), avg_len(res)==17:      96.456K ops/sec
trie.items(prefix="xxxxxxxx"), avg_len(res)==3:     75.165K ops/sec
trie.keys(prefix="xxxxxxxx"), avg_len(res)==3:      77.225K ops/sec
trie.values(prefix="xxxxxxxx"), avg_len(res)==3:    320.755K ops/sec
trie.items(prefix="xxxxx..xx"), avg_len(res)==1.4:  173.591K ops/sec
trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4:   180.678K ops/sec
trie.values(prefix="xxxxx..xx"), avg_len(res)==1.4: 503.392K ops/sec
trie.items(prefix="xxx"), NON_EXISTING:             2023.647K ops/sec
trie.keys(prefix="xxx"), NON_EXISTING:              1976.928K ops/sec
trie.values(prefix="xxx"), NON_EXISTING:            2060.372K ops/sec

Random insert time is very slow compared to dict, this is the limitation of double-array tries; updates are quite fast. If you want to build a trie, consider sorting keys before the insertion:

dict __setitem__ (updates):            6.497M ops/sec
trie __setitem__ (updates):            2.633M ops/sec
dict __setitem__ (inserts, random):    5.808M ops/sec
trie __setitem__ (inserts, random):    0.053M ops/sec
dict __setitem__ (inserts, sorted):    5.749M ops/sec
trie __setitem__ (inserts, sorted):    0.624M ops/sec
dict setdefault (updates):             3.455M ops/sec
trie setdefault (updates):             1.910M ops/sec
dict setdefault (inserts):             3.466M ops/sec
trie setdefault (inserts):             0.053M ops/sec

Other results (note that len(trie) is currently implemented using trie traversal):

dict __contains__ (hits):    6.801M ops/sec
trie __contains__ (hits):    2.816M ops/sec
dict __contains__ (misses):  5.470M ops/sec
trie __contains__ (misses):  4.224M ops/sec
dict __len__:                334336.269 ops/sec
trie __len__:                22.900 ops/sec
dict values():               406.507 ops/sec
trie values():               20.864 ops/sec
dict keys():                 189.298 ops/sec
trie keys():                 2.773 ops/sec
dict items():                48.734 ops/sec
trie items():                2.611 ops/sec

Please take this benchmark results with a grain of salt; this is a very simple benchmark and may not cover your use case.

Current Limitations

  • keys must be unicode (no implicit conversion for byte strings under Python 2.x, sorry);

  • there are no iterator versions of keys/values/items (this is not implemented yet);

  • it is painfully slow and maybe buggy under pypy;

  • library is not tested with narrow Python builds.

Contributing

Development happens at github and bitbucket:

The main issue tracker is at github.

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

Running tests and benchmarks

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6, 2.7 and 3.2.

$ tox -c tox-bench.ini

runs benchmarks.

If you’ve changed anything in the source code then make sure cython is installed and run

$ update_c.sh

before each tox command.

Please note that benchmarks are not included in the release tar.gz’s because benchmark data is large and this saves a lot of bandwidth; use source checkouts from github or bitbucket for the benchmarks.

Authors & Contributors

This module is based on libdatrie C library by Theppitak Karoonboonyanan and is inspired by fast_trie Ruby bindings, PyTrie pure Python implementation and Tree::Trie Perl implementation; some docs and API ideas are borrowed from these projects.

License

Licensed under LGPL v2.1.

CHANGES

0.7.1 (2016-03-12)

  • updated the bundled C library to version 0.2.9;

  • implemented Trie.__len__ in terms of trie_enumerate;

  • rebuilt Cython wrapper with Cython 0.23.4;

  • changed Trie to implement collections.abc.MutableMapping;

  • fixed Trie pickling, which segfaulted on Python2.X.

0.7 (2014-02-18)

  • bundled libdatrie C library is updated to version 0.2.8;

  • new .suffixes() method (thanks Ahmed T. Youssef);

  • wrapper is rebuilt with Cython 0.20.1.

0.6.1 (2013-09-21)

  • fixed build for Visual Studio (thanks Gabi Davar).

0.6 (2013-07-09)

  • datrie is rebuilt with Cython 0.19.1;

  • iter_prefix_values, prefix_values and longest_prefix_value methods for datrie.BaseTrie and datrie.Trie (thanks Jared Suttles).

0.5.1 (2013-01-30)

  • Recently introduced memory leak in longest_prefix and longest_prefix_item is fixed.

0.5 (2013-01-29)

  • longest_prefix and longest_prefix_item methods are fixed;

  • datrie is rebuilt with Cython 0.18;

  • misleading benchmark results in README are fixed;

  • State._walk is renamed to State.walk_char.

0.4.2 (2012-09-02)

  • Update to latest libdatrie; this makes .keys() method a bit slower but removes a keys length limitation.

0.4.1 (2012-07-29)

  • cPickle is used for saving/loading datrie.Trie if it is available.

0.4 (2012-07-27)

  • libdatrie improvements and bugfixes, including C iterator API support;

  • custom iteration support using datrie.State and datrie.Iterator.

  • speed improvements: __length__, keys, values and items methods should be up to 2x faster.

  • keys longer than 32768 are not supported in this release.

0.3 (2012-07-21)

There are no new features or speed improvements in this release.

  • datrie.new is deprecated; use datrie.Trie with the same arguments;

  • small test & benchmark improvements.

0.2 (2012-07-16)

  • datrie.Trie items can have any Python object as a value (Trie from 0.1.x becomes datrie.BaseTrie);

  • longest_prefix and longest_prefix_items are fixed;

  • save & load are rewritten;

  • setdefault method.

0.1.1 (2012-07-13)

  • Windows support (upstream libdatrie changes are merged);

  • license is changed from LGPL v3 to LGPL v2.1 to match the libdatrie license.

0.1 (2012-07-12)

Initial release.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datrie-0.7.1.tar.gz (193.5 kB view details)

Uploaded Source

Built Distributions

datrie-0.7.1-cp35-cp35m-win_amd64.whl (99.7 kB view details)

Uploaded CPython 3.5m Windows x86-64

datrie-0.7.1-cp35-cp35m-win32.whl (84.0 kB view details)

Uploaded CPython 3.5m Windows x86

datrie-0.7.1-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (267.5 kB view details)

Uploaded CPython 3.5m macOS 10.10+ intel macOS 10.10+ x86-64 macOS 10.6+ intel macOS 10.9+ intel macOS 10.9+ x86-64

datrie-0.7.1-cp34-cp34m-win_amd64.whl (98.5 kB view details)

Uploaded CPython 3.4m Windows x86-64

datrie-0.7.1-cp34-cp34m-win32.whl (86.7 kB view details)

Uploaded CPython 3.4m Windows x86

datrie-0.7.1-cp34-cp34m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (272.7 kB view details)

Uploaded CPython 3.4m macOS 10.10+ intel macOS 10.10+ x86-64 macOS 10.6+ intel macOS 10.9+ intel macOS 10.9+ x86-64

datrie-0.7.1-cp33-cp33m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (272.4 kB view details)

Uploaded CPython 3.3m macOS 10.10+ intel macOS 10.10+ x86-64 macOS 10.6+ intel macOS 10.9+ intel macOS 10.9+ x86-64

datrie-0.7.1-cp27-cp27m-win_amd64.whl (98.6 kB view details)

Uploaded CPython 2.7m Windows x86-64

datrie-0.7.1-cp27-cp27m-win32.whl (85.1 kB view details)

Uploaded CPython 2.7m Windows x86

datrie-0.7.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (269.2 kB view details)

Uploaded CPython 2.7m macOS 10.10+ intel macOS 10.10+ x86-64 macOS 10.6+ intel macOS 10.9+ intel macOS 10.9+ x86-64

File details

Details for the file datrie-0.7.1.tar.gz.

File metadata

  • Download URL: datrie-0.7.1.tar.gz
  • Upload date:
  • Size: 193.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for datrie-0.7.1.tar.gz
Algorithm Hash digest
SHA256 7a11371cc2dbbad71d6dfef57ced6e8b384bb377eeb847c63d58f8dc8e8b2023
MD5 d61280eff0ead9cdf03d286ce4228057
BLAKE2b-256 445fbf7e4711f6aa95edb2216b3487eeac719645802259643d341668e65636db

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp35-cp35m-win_amd64.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 81d8059fdeb548f1c3e5c3d68f7768be124730d65ca68a6bb49b17a2d8d72375
MD5 591bf5bbec20bcbca64923f2e0e0ea0c
BLAKE2b-256 7b603ea140361c9fb4cf633153fb9f2a9de33c0c869eb8a5bdb5805790dc0782

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp35-cp35m-win32.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 caf32926bf822cd0213c0b7f2d20179f638eb8845e3b97ad15c4df35d5b3b504
MD5 7b9e326368500c2a84eb70d1ef99dae2
BLAKE2b-256 0afab2e9f90d7ce99679ae931b3f050502af2e841a56a7615932efe2364ef75e

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 740f510869536fe05b35ca5ae82a107a77ebb298a9829650d4c776d418737454
MD5 d0315c9e6cc8fb72350d86cbe0c5c202
BLAKE2b-256 e03771ef5fe15958ac172f3ab2e07be1b94da86c7a33f86a067e52aafcb4fbb4

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp34-cp34m-win_amd64.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp34-cp34m-win_amd64.whl
Algorithm Hash digest
SHA256 ec3405496cc859b573b3c1c082229a7559fdab54a6a3d5a457e49a6e39df0e3e
MD5 88169390be3da6d3df733d6dd84f1a3f
BLAKE2b-256 f63abcd695ce304734f95e7bb1da50a86924fc3e54aa6c401cd599b771e01d30

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp34-cp34m-win32.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp34-cp34m-win32.whl
Algorithm Hash digest
SHA256 21a07b4754284cba3f45f3fd618a5d7d2b52d9609258a32177be553d900eb482
MD5 153fc323a3cb6aed24fef7ce253dd2c4
BLAKE2b-256 5a1e8a9125b394e2c782ddfd4cf590e3966041e22046ad4a654917baec3bf497

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp34-cp34m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp34-cp34m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 7edf889caf0afa625a797c5e19aea3290f202493838119e7ce0015801c8ad38a
MD5 5ecb6a452bedd3f32a3eca20c75c60ea
BLAKE2b-256 c7710e82bd9ea799d4a280face290036604d9ed37e305049bbd3718ca3dd542c

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp33-cp33m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp33-cp33m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 0f5ff34c21c1d04b19d2fb1f07a1eff8790f29887ed19e5ddd3ebcf3c4727c12
MD5 56e6ec090a64359f4cfc55227e7945d7
BLAKE2b-256 1e9470da9a58022d6982754bfa0467627a6faf437fdbf6cec8974170092216ed

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp27-cp27m-win_amd64.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp27-cp27m-win_amd64.whl
Algorithm Hash digest
SHA256 03c5a8516fc878aeed313c3d3ab71a3c254140fe7d43721809f5b260ed4a7cae
MD5 7f80142a4bfacf337ebdb80e136952b6
BLAKE2b-256 55da7109264acfef5e4365fbeb806f4cdac3fe87d942356518cf0bc1161a4085

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp27-cp27m-win32.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp27-cp27m-win32.whl
Algorithm Hash digest
SHA256 84dd2f2f390ff477f341feb6e71fad5bb5782b03df350135fde8698f9acecf89
MD5 7bd116906c192a7702cebe327cebfe6a
BLAKE2b-256 606657a3cf558cdc946e179d148fc1827c5769e2bbd622bab613b7073de9c318

See more details on using hashes here.

File details

Details for the file datrie-0.7.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl.

File metadata

File hashes

Hashes for datrie-0.7.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 7f2caf5a4f96af984d01b75b625bb6239eef0cab7e0ae99b6b9461bb97b25730
MD5 088164e51b337a1ccd8ac18f18d9e8e6
BLAKE2b-256 4ad8db88c863ba730b462823b7cf9020c3981a9cb13698d5453268719220bd4c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page