Static memory-efficient & fast Trie structures for Python (based on marisa-trie C++ library)
Project description
marisa-trie
Static memory-efficient Trie structures for Python (2.x and 3.x).
String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.
Based on marisa-trie C++ library.
Installation
pip install marisa-trie
Usage
There are several Trie classes in this package:
marisa_trie.Trie - read-only trie-based data structure that maps unicode keys to auto-generated unique IDs;
marisa_trie.RecordTrie - read-only trie-based data structure that maps unicode keys to lists of data tuples. All tuples must be of the same format (the data is packed with using python struct module).
marisa_trie.BytesTrie - read-only Trie that maps unicode keys to lists of bytes objects.
marisa_trie.Trie
Create a new trie:
>>> import marisa_trie >>> trie = marisa_trie.Trie([u'key1', u'key2', u'key12'])
Check if key is in trie:
>>> u'key1' in trie True >>> u'key20' in trie False
Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys; you can use this ID to store a value in a separate structure (e.g. python list):
>>> trie.key_id(u'key2') 1
Key can be reconstructed from the ID:
>>> trie.restore_key(1) u'key2'
Find all prefixes of a given key:
>>> trie.prefixes(u'key12') [u'key1', u'key12']
There is also a generator version of .prefixes method called .iter_prefixes.
Find all keys from this trie that starts with a given prefix:
>> trie.keys(u'key1') [u'key1', u'key12']
(iterator version .iterkeys(prefix) is also available).
marisa_trie.RecordTrie
Create a new trie:
>>> keys = [u'foo', u'bar', u'foobar', u'foo'] >>> values = [(1, 2), (2, 1), (3, 3), (2, 1)] >>> fmt = "<HH" # a tuple with 2 short integers >>> trie = marisa_trie.RecordTrie(fmt, zip(keys, values))
Trie initial data must be an iterable of tuples (unicode_key, data_tuple). Data tuples will be converted to bytes with struct.pack(fmt, *data_tuple).
Take a look at http://docs.python.org/library/struct.html#format-strings for the format string specification.
Duplicate keys are allowed.
Check if key is in trie:
>>> u'foo' in trie True >>> u'spam' in trie False
Get a values list:
>>> trie[u'bar'] [(2, 1)] >>> trie[u'foo'] [(1, 2), (2, 1)] >>> trie.get(u'bar', 123) [(2, 1)] >>> trie.get(u'BAAR', 123) # default value 123
Find all prefixes of a given key:
>>> trie.prefixes(u'foobarz') [u'foo', u'foobar']
Find all keys from this trie that starts with a given prefix:
>> trie.keys(u'fo') [u'foo', u'foo', u'foobar']
Find all items from this trie that starts with a given prefix:
>> trie.items(u'fo') [(u'foo', (1, 2)), (u'foo', (2, 1), (u'foobar', (3, 3))]
marisa_trie.BytesTrie
BytesTrie is similar to RecordTrie, but the values are raw bytes, not tuples:
>>> keys = [u'foo', u'bar', u'foobar', u'foo'] >>> values = [b'foo-value', b'bar-value', b'foobar-value', b'foo-value2'] >>> trie = marisa_trie.BytesTrie(zip(keys, values)) >>> trie[u'bar'] [b'bar-value']
Persistence
Trie objects supports saving/loading, pickling/unpickling and memory mapped I/O.
Write trie to a stream:
>>> with open('my_trie.marisa', 'w') as f: ... trie.write(f)
Save trie to a file:
>>> trie.save('my_trie_copy.marisa')
Read trie from stream:
>>> trie2 = marisa_trie.Trie() >>> with open('my_trie.marisa', 'r') as f: ... trie.read(f)
Load trie from file:
>>> trie2.load('my_trie.marisa')
Trie objects are picklable:
>>> import pickle >>> data = pickle.dumps(trie) >>> trie3 = pickle.loads(data)
You may also build a trie using marisa-build command-line utility (provided by underlying C++ library; it should be downloaded and compiled separately) and then load the trie from the resulting file using .load() method.
Memory mapped I/O
It is possible to use memory mapped file as data source:
>>> trie = marisa_trie.RecordTrie(fmt).mmap('my_record_trie.marisa')
This way the whole dictionary won’t be loaded to memory; memory mapped I/O is an easy way to share dictionary data among processes.
Trie storage options
marisa-trie C++ library provides some configuration options for trie storage; check http://marisa-trie.googlecode.com/svn/trunk/docs/readme.en.html page (scroll down to “Enumeration Constants” section) to get an idea.
These options are exposed as order, num_tries, cache_size and binary keyword arguments for trie constructors.
For example, set order to marisa_trie.LABEL_ORDER in order to make trie functions return results in alphabetical oder:
>>> trie = marisa_trie.RecordTrie(fmt, data, order=marisa_trie.LABEL_ORDER)
Benchmarks
My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):
dict(unicode words -> word lenghts): about 600M
list(unicode words) : about 300M
BaseTrie from datrie library: about 70M
marisa_trie.RecordTrie : 11M
marisa_trie.Trie: 7M
Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):
dict __getitem__ (hits): 4.090M ops/sec Trie __getitem__ (hits): not supported BytesTrie __getitem__ (hits): 0.469M ops/sec RecordTrie __getitem__ (hits): 0.373M ops/sec dict get() (hits): 2.792M ops/sec Trie get() (hits): not supported BytesTrie get() (hits): 0.434M ops/sec RecordTrie get() (hits): 0.369M ops/sec dict get() (misses): 2.867M ops/sec Trie get() (misses): not supported BytesTrie get() (misses): 0.817M ops/sec RecordTrie get() (misses): 0.824M ops/sec dict __contains__ (hits): 4.036M ops/sec Trie __contains__ (hits): 0.910M ops/sec BytesTrie __contains__ (hits): 0.573M ops/sec RecordTrie __contains__ (hits): 0.591M ops/sec dict __contains__ (misses): 3.346M ops/sec Trie __contains__ (misses): 1.643M ops/sec BytesTrie __contains__ (misses): 0.976M ops/sec RecordTrie __contains__ (misses): 1.017M ops/sec dict items(): 58.316 ops/sec Trie items(): not supported BytesTrie items(): 2.456 ops/sec RecordTrie items(): 2.254 ops/sec dict keys(): 211.194 ops/sec Trie keys(): 3.341 ops/sec BytesTrie keys(): 2.308 ops/sec RecordTrie keys(): 2.184 ops/sec Trie.prefixes (hits): 0.176M ops/sec Trie.prefixes (mixed): 0.956M ops/sec Trie.prefixes (misses): 1.035M ops/sec RecordTrie.prefixes (hits): 0.106M ops/sec RecordTrie.prefixes (mixed): 0.451M ops/sec RecordTrie.prefixes (misses): 0.173M ops/sec Trie.iter_prefixes (hits): 0.170M ops/sec Trie.iter_prefixes (mixed): 0.799M ops/sec Trie.iter_prefixes (misses): 0.898M ops/sec Trie.keys(prefix="xxx"), avg_len(res)==415: 0.825K ops/sec Trie.keys(prefix="xxxxx"), avg_len(res)==17: 19.934K ops/sec Trie.keys(prefix="xxxxxxxx"), avg_len(res)==3: 85.239K ops/sec Trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 136.476K ops/sec Trie.keys(prefix="xxx"), NON_EXISTING: 1073.719K ops/sec
Tries from marisa_trie uses less memory, tries from datrie are faster.
Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.
Current limitations
The library is known not to install on Windows with mingw32 compiler;
.prefixes() method implementation is sub-optimal;
read() and write() methods don’t work with file-like objects (they work only with real files; pickling works fine for file-like objects);
iterator versions of methods are not always implemented;
there are keys() and items() methods but no values() method.
Contributions are welcome!
Contributing
Development happens at github and bitbucket:
The main issue tracker is at github: https://github.com/kmike/marisa-trie/issues
Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.
If you found a bug in a C++ part please report it to the original bug tracker.
How is source code organized
There are 4 folders in repository:
bench - benchmarks & benchmark data;
lib - original unmodified marisa-trie C++ library which is bundled for easier distribution; if something is have to be fixed in this library consider fixing it in the original repo ;
src - wrapper code; src/marisa_trie.pyx is a wrapper implementation; src/*.pxd files are Cython headers for corresponding C++ headers; src/*.cpp files are the pre-built extension code and shouldn’t be modified directly (they should be updated via update_cpp.sh script).
tests - the test suite.
Running tests and benchmarks
Make sure tox is installed and run
$ tox
from the source checkout. Tests should pass under python 2.6, 2.7, 3.2.
In order to run benchmarks, type
$ tox -c bench.ini
License
Wrapper code is licensed under MIT License. Bundled marisa-trie C++ library is licensed under BSD license.
CHANGES
0.3.5 (2012-08-30)
Pickling of RecordTrie is fixed (thanks lazarou for the report);
error messages should become more useful.
0.3.4 (2012-08-29)
Issues with mingw32 should be resolved (thanks Susumu Yata).
0.3.3 (2012-08-27)
.get(key, default=None) method for BytesTrie and RecordTrie;
small README improvements.
0.3.2 (2012-08-26)
Small code cleanup;
load, read and mmap methods returns ‘self’;
I can’t run tests (via tox) under Python 3.3 so it is removed from supported versions for now.
0.3.1 (2012-08-23)
.prefixes() support for RecordTrie and BytesTrie.
0.3 (2012-08-23)
RecordTrie and BytesTrie are introduced;
IntTrie class is removed (probably temporary?);
dumps/loads methods are renamed to tobytes/frombytes;
benchmark & tests improvements;
support for MARISA-trie config options is added.
0.2 (2012-08-19)
Pickling/unpickling support;
dumps/loads methods;
python 3.3 workaround;
improved tests;
benchmarks.
0.1 (2012-08-17)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file marisa-trie-0.3.5.tar.gz
.
File metadata
- Download URL: marisa-trie-0.3.5.tar.gz
- Upload date:
- Size: 167.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc23a421d451f219a293bda9a2fae14059031dfac286c8f247cddf240c2c22f1 |
|
MD5 | 439a99b9737ee02d4b9d17c4747c3f04 |
|
BLAKE2b-256 | 1868c40b1d9431176663cc1b04578940fc329a26703f2119a4ab79857593b817 |