Pure-python reader for DAWGs created by dawgdic C++ library or DAWG Python extension.
Project description
DAWG-Python
This pure-python package provides read-only access for files created by dawgdic C++ library and DAWG python package.
This package is not capable of creating DAWGs. It works with DAWGs built by dawgdic C++ library or DAWG Python extension module. The main purpose of DAWG-Python is to provide an access to DAWGs without requiring compiled extensions. It is also quite fast under PyPy (see benchmarks).
Installation
pip install DAWG-Python
Usage
The aim of DAWG-Python is to be API- and binary-compatible with DAWG when it is possible.
First, you have to create a dawg using DAWG module:
import dawg d = dawg.DAWG(data) d.save('words.dawg')
And then this dawg can be loaded without requiring C extensions:
import dawg_python d = dawg_python.DAWG().load('words.dawg')
Please consult DAWG docs for detailed usage. Some features (like constructor parameters or save method) are intentionally unsupported.
Benchmarks
Benchmark results (100k unicode words, integer values (lenghts of the words), PyPy 1.9, macbook air i5 1.8 Ghz):
dict __getitem__ (hits): 11.090M ops/sec DAWG __getitem__ (hits): not supported BytesDAWG __getitem__ (hits): 0.493M ops/sec RecordDAWG __getitem__ (hits): 0.376M ops/sec dict get() (hits): 10.127M ops/sec DAWG get() (hits): not supported BytesDAWG get() (hits): 0.481M ops/sec RecordDAWG get() (hits): 0.402M ops/sec dict get() (misses): 14.885M ops/sec DAWG get() (misses): not supported BytesDAWG get() (misses): 1.259M ops/sec RecordDAWG get() (misses): 1.337M ops/sec dict __contains__ (hits): 11.100M ops/sec DAWG __contains__ (hits): 1.317M ops/sec BytesDAWG __contains__ (hits): 1.107M ops/sec RecordDAWG __contains__ (hits): 1.095M ops/sec dict __contains__ (misses): 10.567M ops/sec DAWG __contains__ (misses): 1.902M ops/sec BytesDAWG __contains__ (misses): 1.873M ops/sec RecordDAWG __contains__ (misses): 1.862M ops/sec dict items(): 44.401 ops/sec DAWG items(): not supported BytesDAWG items(): 3.226 ops/sec RecordDAWG items(): 2.987 ops/sec dict keys(): 426.250 ops/sec DAWG keys(): not supported BytesDAWG keys(): 6.050 ops/sec RecordDAWG keys(): 6.363 ops/sec DAWG.prefixes (hits): 0.756M ops/sec DAWG.prefixes (mixed): 1.965M ops/sec DAWG.prefixes (misses): 1.773M ops/sec RecordDAWG.keys(prefix="xxx"), avg_len(res)==415: 1.429K ops/sec RecordDAWG.keys(prefix="xxxxx"), avg_len(res)==17: 36.994K ops/sec RecordDAWG.keys(prefix="xxxxxxxx"), avg_len(res)==3: 121.897K ops/sec RecordDAWG.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 265.015K ops/sec RecordDAWG.keys(prefix="xxx"), NON_EXISTING: 2450.898K ops/sec
Under CPython expect it to be about 50x slower. Memory consumption of DAWG-Python should be the same as of DAWG.
Current limitations
This package is not capable of creating DAWGs;
IntDAWG is not implemented;
all the limitations of DAWG apply.
Contributions are welcome!
Contributing
Development happens at github and bitbucket:
The main issue tracker is at github: https://github.com/kmike/DAWG-Python/issues
Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.
Running tests and benchmarks
Make sure tox is installed and run
$ tox
from the source checkout. Tests should pass under python 2.6, 2.7, 3.2, 3.3 and PyPy >= 1.9.
In order to run benchmarks, type
$ tox -c bench.ini -e pypy
This runs benchmarks under PyPy (they are about 50x slower under CPython).
License
This package is licensed under MIT License.
Changes
0.6 (2013-03-23)
Use less shared state internally. This should fix thread-safety bugs and make iterkeys/iteritems reenterant.
0.5.1 (2013-03-01)
Internal tweaks: memory usage is reduced; something is a bit faster, something is a bit slower.
0.5 (2012-10-08)
Storage scheme is updated to match DAWG==0.5. This enables the alphabetical ordering of BytesDAWG and RecordDAWG items.
In order to read BytesDAWG or RecordDAWG created with versions of DAWG < 0.5 use payload_separator constructor argument:
>>> BytesDAWG(payload_separator=b'\xff').load('old.dawg')
0.3.1 (2012-10-01)
Bug with empty DAWGs is fixed.
0.3 (2012-09-26)
iterkeys and iteritems methods.
0.2 (2012-09-24)
prefixes support.
0.1 (2012-09-20)
Initial release.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file DAWG-Python-0.6.tar.gz
.
File metadata
- Download URL: DAWG-Python-0.6.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce29b0b870ea5167f8ddffb09b7ead61490e992fe772bf774a93870e883f6a00 |
|
MD5 | ffc3afaa1c0f7c8436eea7eef5dfcb43 |
|
BLAKE2b-256 | 0303784cacaa92a89b398d4200d4caaea813697a641aa73617723142f5233f57 |