Super-fast, efficiently stored Trie for Python
Project description
datrie
Super-fast, efficiently stored Trie for Python (2.x and 3.x). Uses libdatrie.
Installation
pip install datrie
Usage
Create a new trie capable of storing items with lower-case ascii keys:
>>> import string >>> import datrie >>> trie = datrie.Trie(string.ascii_lowercase)
trie variable is a dict-like object that can have unicode keys of certain ranges and Python objects as values.
In addition to implementing the mapping interface, tries facilitate finding the items for a given prefix, and vice versa, finding the items whose keys are prefixes of a given string. As a common special case, finding the longest-prefix item is also supported.
Add some values to it (datrie keys must be unicode; the examples are for Python 2.x):
>>> trie[u'foo'] = 5 >>> trie[u'foobar'] = 10 >>> trie[u'bar'] = 'bar value' >>> trie.setdefault(u'foobar', 15) 10
Check if u’foo’ is in trie:
>>> u'foo' in trie True
Get a value:
>>> trie[u'foo'] 5
Find all prefixes of a word:
>>> trie.prefixes(u'foobarbaz') [u'foo', u'foobar'] >>> trie.prefix_items(u'foobarbaz') [(u'foo', 5), (u'foobar', 10)] >>> trie.iter_prefixes(u'foobarbaz') <generator object ...> >>> trie.iter_prefix_items(u'foobarbaz') <generator object ...>
Find the longest prefix of a word:
>>> trie.longest_prefix(u'foo') u'foo' >>> trie.longest_prefix(u'foobarbaz') u'foobar' >>> trie.longest_prefix(u'gaz') KeyError: u'gaz' >>> trie.longest_prefix(u'gaz', default=u'vasia') u'vasia' >>> trie.longest_prefix_item(u'foobarbaz') (u'foobar', 10)
Check if the trie has keys with a given prefix:
>>> trie.has_keys_with_prefix(u'fo') True >>> trie.has_keys_with_prefix(u'FO') False
Get all items with a given prefix from a trie:
>>> trie.keys(u'fo') [u'foo', u'foobar'] >>> trie.items(u'ba') [(u'bar', 'bar value')] >>> trie.values(u'foob') [10]
Save & load a trie (values must be picklable):
>>> trie.save('my.trie') >>> trie2 = datrie.Trie.load('my.trie')
Trie and BaseTrie
There are two Trie classes in datrie package: datrie.Trie and datrie.BaseTrie. datrie.BaseTrie is slightly faster and uses less memory but it can store only integer numbers 0 <= x <= 2147483647. datrie.Trie is a bit slower but can store any Python object as a value; it is the trie created by datrie.new by default.
If you don’t need values or integer values are OK then use datrie.BaseTrie:
import datrie import string trie = datrie.BaseTrie(string.ascii_lowercase)
Custom iteration
If the built-in trie methods don’t fit you can use datrie.State and datrie.Iterator to implement custom traversal.
For example, let’s find all suffixes of 'fo' for our trie and get the values:
>>> state = datrie.State(trie) >>> state.walk(u'foo') >>> it = datrie.Iterator(state) >>> while it.next(): ... print(it.key()) ... print(it.data)) o 5 obar 10
Performance
Performance is measured for datrie.Trie against Python’s dict with 100k unique unicode words (English and Russian) as keys and ‘1’ numbers as values.
datrie.Trie uses about 5M memory for 100k words; Python’s dict uses about 22M for this according to my unscientific tests.
This trie implementation is 2-6 times slower than python’s dict on __getitem__. Benchmark results (macbook air i5 1.7GHz, “1.000M ops/sec” == “1 000 000 operations per second”):
Python 2.6: dict __getitem__: 6.024M ops/sec trie __getitem__: 2.272M ops/sec Python 2.7: dict __getitem__: 6.693M ops/sec trie __getitem__: 2.357M ops/sec Python 3.2: dict __getitem__: 3.628M ops/sec trie __getitem__: 1.980M ops/sec Python 3.3b1: dict __getitem__: 6.721M ops/sec trie __getitem__: 2.584M ops/sec
Looking for prefixes of a given word is almost as fast as __getitem__ (results are for Python 3.2, this is the slowest supported Python, results are better for 2.6, 2.7 and 3.3):
trie.iter_prefix_items (hits): 0.461M ops/sec trie.prefix_items (hits): 0.743M ops/sec trie.prefix_items loop (hits): 0.629M ops/sec trie.iter_prefixes (hits): 0.759M ops/sec trie.iter_prefixes (misses): 1.538M ops/sec trie.iter_prefixes (mixed): 1.359M ops/sec trie.has_keys_with_prefix (hits): 1.896M ops/sec trie.has_keys_with_prefix (misses): 2.590M ops/sec trie.longest_prefix (hits): 1.710M ops/sec trie.longest_prefix (misses): 1.506M ops/sec trie.longest_prefix (mixed): 1.520M ops/sec trie.longest_prefix_item (hits): 1.276M ops/sec trie.longest_prefix_item (misses): 1.292M ops/sec trie.longest_prefix_item (mixed): 1.379M ops/sec
Looking for all words starting with a given prefix is mostly limited by overall result count (this can be improved in future because a lot of time is spent decoding strings from utf_32_le to Python’s unicode):
trie.items(prefix="xxx"), avg_len(res)==415: 0.721K ops/sec trie.keys(prefix="xxx"), avg_len(res)==415: 0.723K ops/sec trie.values(prefix="xxx"), avg_len(res)==415: 4.870K ops/sec trie.items(prefix="xxxxx"), avg_len(res)==17: 18.084K ops/sec trie.keys(prefix="xxxxx"), avg_len(res)==17: 18.279K ops/sec trie.values(prefix="xxxxx"), avg_len(res)==17: 98.668K ops/sec trie.items(prefix="xxxxxxxx"), avg_len(res)==3: 87.141K ops/sec trie.keys(prefix="xxxxxxxx"), avg_len(res)==3: 90.251K ops/sec trie.values(prefix="xxxxxxxx"), avg_len(res)==3: 346.981K ops/sec trie.items(prefix="xxxxx..xx"), avg_len(res)==1.4: 202.346K ops/sec trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 216.588K ops/sec trie.values(prefix="xxxxx..xx"), avg_len(res)==1.4: 532.858K ops/sec trie.items(prefix="xxx"), NON_EXISTING: 1864.411K ops/sec trie.keys(prefix="xxx"), NON_EXISTING: 1857.531K ops/sec trie.values(prefix="xxx"), NON_EXISTING: 1822.818K ops/sec
Insert time is very slow compared to dict, this is the limitation of double-array tries; updates are quite fast:
dict __setitem__ (updates): 3.489M ops/sec trie __setitem__ (updates): 1.862M ops/sec dict __setitem__ (inserts): 3.628M ops/sec trie __setitem__ (inserts): 0.050M ops/sec dict setdefault (updates): 2.575M ops/sec trie setdefault (updates): 1.600M ops/sec dict setdefault (inserts): 2.596M ops/sec trie setdefault (inserts): 0.050M ops/sec
Other results (note that len(trie) is currently implemented using trie traversal):
dict __contains__ (hits): 3.905M ops/sec trie __contains__ (hits): 2.017M ops/sec dict __contains__ (misses): 3.296M ops/sec trie __contains__ (misses): 2.633M ops/sec dict __len__: 199728.762 ops/sec trie __len__: 22.007 ops/sec dict values(): 360.243 ops/sec trie values(): 19.703 ops/sec dict keys(): 180.307 ops/sec trie keys(): 3.361 ops/sec dict items(): 49.029 ops/sec trie items(): 3.172 ops/sec
Please take this benchmark results with a grain of salt; this is a very simple benchmark and may not cover your use case.
Current Limitations
keys must be unicode (no implicit conversion for byte strings under Python 2.x, sorry);
keys longer than 32768 are unsupported;
there are no iterator versions of keys/values/items (this is not implemented yet);
it doesn’t work under pypy+MacOS X (some obscure error which I don’t understand);
library is not tested with narrow Python builds.
Contributing
Development happens at github and bitbucket:
The main issue tracker is at github.
Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.
Running tests and benchmarks
Make sure tox is installed and run
$ tox
from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.
$ tox -c tox-bench.ini
runs benchmarks.
If you’ve changed anything in the source code then make sure cython is installed and run
$ update_c.sh
before each tox command.
Please note that benchmarks are not included in the release tar.gz’s because benchmark data is large and this saves a lot of bandwidth; use source checkouts from github or bitbucket for the benchmarks.
License
Licensed under LGPL v2.1.
CHANGES
0.4 (2012-07-27)
libdatrie improvements and bugfixes, including C iterator API support;
custom iteration support using datrie.State and datrie.Iterator.
speed improvements: __length__, keys, values and items methods should be up to 2x faster.
keys longer than 32768 are not supported in this release.
0.3 (2012-07-21)
There are no new features or speed improvements in this release.
datrie.new is deprecated; use datrie.Trie with the same arguments;
small test & benchmark improvements.
0.2 (2012-07-16)
datrie.Trie items can have any Python object as a value (Trie from 0.1.x becomes datrie.BaseTrie);
longest_prefix and longest_prefix_items are fixed;
save & load are rewritten;
setdefault method.
0.1.1 (2012-07-13)
Windows support (upstream libdatrie changes are merged);
license is changed from LGPL v3 to LGPL v2.1 to match the libdatrie license.
0.1 (2012-07-12)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file datrie-0.4.tar.gz
.
File metadata
- Download URL: datrie-0.4.tar.gz
- Upload date:
- Size: 141.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e7b8d0c37af7c46f528659031f0bda9f7a0d5fb193767bb01c24cf7afafb433 |
|
MD5 | b09984e60add19cbb92dbeb9c07cc49f |
|
BLAKE2b-256 | c89f0988135482b26795c4dbdd9930d2557f73561eb29e07bca31470af6ec451 |