Python bindings to marisa-trie (unofficial)
Project description
marisa-trie
MARISA-Trie structure for Python (2.x and 3.x). Uses marisa-trie C++ library.
MARISA-Trie is a read-only trie that is very memory efficient.
There are official SWIG-based Python bindings included in C++ library distribution; this package provides an alternative unofficial Cython-based pip-installable Python bindings.
Installation
pip install marisa-trie
Usage
Create a new trie:
>>> import marisa_trie >>> trie = marisa_trie.Trie()
Build a trie:
>>> trie.build([u'key1', u'key2', u'key12']) <marisa_trie.Trie at ...>
Check if key is in trie:
>>> u'key1' in trie True >>> u'key20' in trie False
Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys; you can use this ID to store a value in a separate structure (e.g. python list):
>>> trie.key_id(u'key2') 1
Key can be reconstructed from the ID:
>>> trie.restore_key(1) u'key2'
Find all prefixes of a given key:
>>> trie.prefixes(u'key12') [u'key1', u'key12']
There is also a generator version of .prefixes method called .iter_prefixes.
Find all keys from this trie that starts with a given prefix:
>> trie.keys(u'key1') [u'key1', u'key12']
(iterator version .iterkeys(prefix) is also available).
It is possible to save a trie to a file:
>>> with open('my_trie.marisa', 'w') as f: ... trie.write(f)
or:
>>> trie.save('my_trie_copy.marisa')
Load a trie:
>>> trie2 = marisa.Trie() >>> with open('my_trie.marisa', 'r') as f: ... trie.load(f)
or:
>>> trie2.load('my_trie.marisa')
Alternatively, you could build a trie using marisa-build command-line utility (provided by underlying C library; it should be downloaded and compiled separately) and then load it from a file.
Benchmarks
There are no dedicated benchmarks for this package yet.
My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):
list(unicode words) : about 300M
Trie from datrie library: about 70M
marisa_trie.Trie: 7M
Lookup speed seems to be about 2x slower than with datrie, but I haven’t checked this with a good benchmark suite.
Contributing
Development happens at github and bitbucket:
The main issue tracker is at github: https://github.com/kmike/marisa-trie/issues
Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.
If you found a bug in a C++ part please report it to the original bug tracker.
Running tests and benchmarks
Make sure tox is installed and run
$ tox
from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.
$ tox -c bench.ini
runs benchmarks.
License
Wrapper code is licensed under MIT License. Bundled marisa-trie C++ library is licensed under BSD license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file marisa-trie-0.1.tar.gz
.
File metadata
- Download URL: marisa-trie-0.1.tar.gz
- Upload date:
- Size: 105.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8125d63e37942d7cf714d68e37e43ba4bb0137bf42841b838f7de04bd17d278 |
|
MD5 | 5ba8c21f0ae60e321b728a604fd3f302 |
|
BLAKE2b-256 | 81f6051ef465d728ab2997ccc0031acb286b21befb1c5ff126b04399d357cbfe |