Skip to main content

Fast and memory efficient DAWG for Python

Project description

DAWG

This package provides DAWG-based dictionary-like read-only objects for Python (2.x and 3.x).

String data in a DAWG (Directed Acyclic Word Graph) may take 200x less memory than in a standard Python dict or list and the raw lookup speed is comparable; it also provides fast advanced methods like prefix search.

Based on dawgdic C++ library.

Installation

pip install DAWG

Usage

There are several DAWG classes in this package:

  • dawg.DAWG - basic DAWG wrapper; it can store unicode keys and do exact lookups;

  • dawg.CompletionDAWG - dawg.DAWG subclass that supports key completion and prefix lookups (but requires more memory);

  • dawg.BytesDAWG - dawg.CompletionDAWG subclass that maps unicode keys to lists of bytes objects.

  • dawg.RecordDAWG - dawg.BytesDAWG subclass that maps unicode keys to lists of data tuples. All tuples must be of the same format (the data is packed using python struct module).

  • dawg.IntDAWG - dawg.DAWG subclass that maps unicode keys to integer values.

DAWG and CompletionDAWG

DAWG and CompletionDAWG are useful when you need fast & memory efficient simple string storage. These classes does not support assigning values to keys.

DAWG and CompletionDAWG constructors accept an iterable with keys:

>>> import dawg
>>> words = [u'foo', u'bar', u'foobar', u'foö', u'bör']
>>> base_dawg = dawg.DAWG(words)
>>> completion_dawg = dawg.CompletionDAWG(words)

It is then possible to check if the key is in a DAWG:

>>> u'foo' in base_dawg
True
>>> u'baz' in completion_dawg
False

It is possible to find all keys that starts with a given prefix in a CompletionDAWG:

>>> completion_dawg.keys(u'foo')
>>> [u'foo', u'foobar']

and to find all prefixes of a given key:

>>> base_dawg.prefixes(u'foobarz')
[u'foo', u'foobar']

Iterator versions are also available:

>>> for key in completion_dawg.iterkeys(u'foo'):
...     print(key)
foo
foobar
>>> for prefix in base_dawg.iterprefixes(u'foobarz'):
...     print(prefix)
foo
foobar

It is possible to find all keys similar to a given key (using a one-way char translation table):

>>> replaces = dawg.DAWG.compile_replaces({u'o': u'ö'})
>>> base_dawg.similar_keys(u'foo', replaces)
[u'foo', u'foö']
>>> base_dawg.similar_keys(u'foö', replaces)
[u'foö']
>>> base_dawg.similar_keys(u'bor', replaces)
[u'bör']

BytesDAWG

BytesDAWG is a CompletionDAWG subclass that can store binary data for each key.

BytesDAWG constructor accepts an iterable with (unicode_key, bytes_value) tuples:

>>> data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', b'value3')]
>>> bytes_dawg = dawg.BytesDAWG(data)

There can be duplicate keys; all unique values are stored in this case:

>>> bytes_dawg[u'key1']
[b'value1, b'value3']

For unique keys a list with a single value is returned for consistency:

>>> bytes_dawg[u'key2']
[b'value2']

KeyError is raised for missing keys; use get method if you need a default value instead:

>>> bytes_dawg.get(u'foo', None)
None

BytesDAWG support keys, items, iterkeys and iteritems methods (they all accept optional key prefix). There is also support for similar_keys, similar_items and similar_item_values methods.

RecordDAWG

RecordDAWG is a BytesDAWG subclass that automatically packs & unpacks the binary data from/to Python objects using struct module from the standard library.

First, you have to define a format of the data. Consult Python docs (http://docs.python.org/library/struct.html#format-strings) for the format string specification.

For example, let’s store 3 short unsigned numbers (in a Big-Endian byte order) as values:

>>> format = ">HHH"

RecordDAWG constructor accepts an iterable with (unicode_key, value_tuple). Let’s create such iterable using zip function:

>>> keys = [u'foo', u'bar', u'foobar', u'foo']
>>> values = [(1, 2, 3), (2, 1, 0), (3, 3, 3), (2, 1, 5)]
>>> data = zip(keys, values)
>>> record_dawg = RecordDAWG(format, data)

As with BytesDAWG, there can be several values for the same key:

>>> record_dawg['foo']
[(1, 2, 3), (2, 1, 5)]
>>> record_dawg['foobar']
[(3, 3, 3)]

BytesDAWG and RecordDAWG implementation details

BytesDAWG and RecordDAWG stores data at the end of the keys:

<utf8-encoded key><separator><base64-encoded data>

Data is encoded to base64 because dawgdic C++ library doesn’t allow zero bytes in keys (it uses null-terminated strings) and such keys are very likely in binary data.

In DAWG versions prior to 0.5 <separator> was chr(255) byte. It was chosen because keys are stored as UTF8-encoded strings and chr(255) is guaranteed not to appear in valid UTF8, so the end of text part of the key is not ambiguous.

But chr(255) was proven to be problematic: it changes the order of the keys. Keys are naturally returned in lexicographical order by DAWG. But if chr(255) appears at the end of each text part of a key then the visible order would change. Imagine 'foo' key with some payload and 'foobar' key with some payload. 'foo' key would be greater than 'foobar' key: values compared would be 'foo<sep>' and 'foobar<sep>' and ord(<sep>)==255 is greater than ord(<any other character>).

So now the default <separator> is chr(1). This is the lowest allowed character and so it preserves the alphabetical order.

It is not strictly correct to use chr(1) as a separator because chr(1) is a valid UTF8 character. But I think in practice this won’t be an issue: such control character is very unlikely in text keys, and binary keys are not supported anyway because dawgdic doesn’t support keys containing chr(0).

If you can’t guarantee chr(1) is not a part of keys, lexicographical order is not important to you or there is a need to read a BytesDAWG/RecordDAWG created by DAWG < 0.5 then pass payload_separator argument to the constructor:

>>> BytesDAWG(payload_separator=b'\xff').load('old.dawg')

The storage scheme has one more implication: values of BytesDAWG and RecordDAWG are also sorted lexicographically.

For RecordDAWG there is a gotcha: in order to have meaningful ordering of numeric values store them in big-endian format:

>>> data = [('foo', (3, 2, 256)), ('foo', (3, 2, 1)), ('foo', (3, 2, 3))]
>>> d = RecordDAWG("3H", data)
>>> d.items()
[(u'foo', (3, 2, 256)), (u'foo', (3, 2, 1)), (u'foo', (3, 2, 3))]

>>> d2 = RecordDAWG(">3H", data)
>>> d2.items()
[(u'foo', (3, 2, 1)), (u'foo', (3, 2, 3)), (u'foo', (3, 2, 256))]

IntDAWG

IntDAWG is a {unicode -> int} mapping. It is possible to use RecordDAWG for this, but IntDAWG is natively supported by dawgdic C++ library and so __getitem__ is much faster.

Unlike BytesDAWG and RecordDAWG, IntDAWG doesn’t support having several values for the same key.

IntDAWG constructor accepts an iterable with (unicode_key, integer_value) tuples:

>>> data = [ (u'foo', 1), (u'bar', 2) ]
>>> int_dawg = dawg.IntDAWG(data)

It is then possible to get a value from the IntDAWG:

>>> int_dawg[u'foo']
1

Persistence

All DAWGs support saving/loading and pickling/unpickling.

Write DAWG to a stream:

>>> with open('words.dawg', 'wb') as f:
...     d.write(f)

Save DAWG to a file:

>>> d.save('words.dawg')

Load DAWG from a file:

>>> d = dawg.DAWG()
>>> d.load('words.dawg')

Read DAWG from a stream:

>>> d = dawg.RecordDAWG(format_string)
>>> with open('words.record-dawg', 'rb') as f:
...     d.read(f)

DAWG objects are picklable:

>>> import pickle
>>> data = pickle.dumps(d)
>>> d2 = pickle.loads(data)

Benchmarks

For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):

  • dict(unicode words -> word lenghts): about 600M

  • list(unicode words) : about 300M

  • marisa_trie.RecordTrie : 11M

  • marisa_trie.Trie: 7M

  • dawg.DAWG: 2M

  • dawg.CompletionDAWG: 3M

  • dawg.IntDAWG: 2.7M

  • dawg.RecordDAWG: 4M

Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):

dict __getitem__ (hits)           8.427M ops/sec
DAWG __getitem__ (hits)           not supported
BytesDAWG __getitem__ (hits)      1.897M ops/sec
RecordDAWG __getitem__ (hits)     1.036M ops/sec
IntDAWG __getitem__ (hits)        3.990M ops/sec
dict get() (hits)                 4.409M ops/sec
DAWG get() (hits)                 not supported
BytesDAWG get() (hits)            1.566M ops/sec
RecordDAWG get() (hits)           0.910M ops/sec
IntDAWG get() (hits)              3.070M ops/sec
dict get() (misses)               4.998M ops/sec
DAWG get() (misses)               not supported
BytesDAWG get() (misses)          3.300M ops/sec
RecordDAWG get() (misses)         3.194M ops/sec
IntDAWG get() (misses)            3.752M ops/sec

dict __contains__ (hits)          8.270M ops/sec
DAWG __contains__ (hits)          4.419M ops/sec
BytesDAWG __contains__ (hits)     3.762M ops/sec
RecordDAWG __contains__ (hits)    3.743M ops/sec
IntDAWG __contains__ (hits)       4.374M ops/sec

dict __contains__ (misses)        6.596M ops/sec
DAWG __contains__ (misses)        5.530M ops/sec
BytesDAWG __contains__ (misses)   5.411M ops/sec
RecordDAWG __contains__ (misses)  5.418M ops/sec
IntDAWG __contains__ (misses)     5.563M ops/sec

dict items()                      56.471 ops/sec
DAWG items()                      not supported
BytesDAWG items()                 16.129 ops/sec
RecordDAWG items()                10.370 ops/sec
IntDAWG items()                   not supported

dict keys()                       207.690 ops/sec
DAWG keys()                       not supported
BytesDAWG keys()                  23.898 ops/sec
RecordDAWG keys()                 23.504 ops/sec
IntDAWG keys()                    not supported

DAWG.prefixes (hits)              0.242M ops/sec
DAWG.prefixes (mixed)             1.627M ops/sec
DAWG.prefixes (misses)            2.890M ops/sec
DAWG.iterprefixes (hits)          0.159M ops/sec
DAWG.iterprefixes (mixed)         0.457M ops/sec
DAWG.iterprefixes (misses)        0.523M ops/sec

RecordDAWG.keys(prefix="xxx"), avg_len(res)==415        5.826K ops/sec
RecordDAWG.keys(prefix="xxxxx"), avg_len(res)==17       128.452K ops/sec
RecordDAWG.keys(prefix="xxxxxxxx"), avg_len(res)==3     535.808K ops/sec
RecordDAWG.keys(prefix="xxxxx..xx"), avg_len(res)==1.4  832.864K ops/sec
RecordDAWG.keys(prefix="xxx"), NON_EXISTING             4038.162K ops/sec

Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.

Current limitations

  • IntDAWG is currently a subclass of DAWG and so it doesn’t support keys() and items() methods;

  • read() method reads the whole stream (DAWG must be the last or the only item in a stream if it is read with read() method) - pickling doesn’t have this limitation;

  • DAWGs loaded with read() and unpickled DAWGs uses 3x-4x memory compared to DAWGs loaded with load() method;

  • there are keys() and items() methods but no values() method;

  • iterator versions of methods are not always implemented;

  • BytesDAWG and RecordDAWG has a limitation: values larger than 8KB are unsupported;

  • the maximum number of DAWG units is limited: number of DAWG units (and thus transitions - but not elements) should be less than 2^29; this mean that it may be impossible to build an especially huge DAWG (you may split your data into several DAWGs or try marisa-trie in this case).

Contributions are welcome!

Contributing

Development happens at github and bitbucket:

The main issue tracker is at github: https://github.com/kmike/DAWG/issues

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

If you found a bug in a C++ part please report it to the original bug tracker.

How is source code organized

There are 4 folders in repository:

  • bench - benchmarks & benchmark data;

  • lib - original unmodified dawgdic C++ library and a customized version of libb64 library. They are bundled for easier distribution; if something is have to be fixed in these libraries consider fixing it in the original repositories;

  • src - wrapper code; src/dawg.pyx is a wrapper implementation; src/*.pxd files are Cython headers for corresponding C++ headers; src/*.cpp files are the pre-built extension code and shouldn’t be modified directly (they should be updated via update_cpp.sh script).

  • tests - the test suite.

Running tests and benchmarks

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.

In order to run benchmarks, type

$ tox -c bench.ini

Authors & Contributors

This module is based on dawgdic C++ library by Susumu Yata & contributors.

base64 decoder is based on libb64 by Chris Venter.

License

Wrapper code is licensed under MIT License. Bundled dawgdic C++ library is licensed under BSD license. libb64 is Public Domain. 0.5.2 (2013-01-02) ——————

  • tests are included in source distribution;

  • benchmark results in README was nonrepresentative because of my broken (slow) Python 3.2 install;

  • installation is fixed under Python 3.x with LC_ALL=C (thanks Jakub Wilk).

0.5.1 (2012-10-11)

  • better error reporting while building DAWGs;

  • __contains__ is fixed for keys with zero bytes;

  • dawg.Error exception class;

  • building of BytesDAWG and RecordDAWG fails instead of producing incorrect results if some of the keys has unsupported characters.

0.5 (2012-10-08)

The storage scheme of BytesDAWG and RecordDAWG is changed in this release in order to provide the alphabetical ordering of items.

This is a backwards-incompatible release. In order to read BytesDAWG or RecordDAWG created with previous versions of DAWG use payload_separator constructor argument:

>>> BytesDAWG(payload_separator=b'\xff').load('old.dawg')

0.4.1 (2012-10-01)

  • Segfaults with empty DAWGs are fixed by updating dawgdic to latest svn.

0.4 (2012-09-26)

  • iterkeys, iteritems and iterprefixes methods (thanks Dan Blanchard).

0.3.2 (2012-09-24)

  • prefixes method for finding all prefixes of a given key.

0.3.1 (2012-09-20)

  • bundled dawgdic C++ library is updated to the latest version.

0.3 (2012-09-13)

  • similar_keys, similar_items and similar_item_values methods for more permissive lookups (they may be useful e.g. for umlaut handling);

  • load method returns self;

  • Python 3.3 support.

0.2 (2012-09-08)

Greatly improved memory usage for DAWGs loaded with load method.

There is currently a bug somewhere in a wrapper so DAWGs loaded with read() method or unpickled DAWGs uses 3x-4x memory compared to DAWGs loaded with load() method. load() is fixed in this release but other methods are not.

0.1 (2012-09-08)

Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DAWG-0.5.2.tar.gz (219.0 kB view details)

Uploaded Source

File details

Details for the file DAWG-0.5.2.tar.gz.

File metadata

  • Download URL: DAWG-0.5.2.tar.gz
  • Upload date:
  • Size: 219.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for DAWG-0.5.2.tar.gz
Algorithm Hash digest
SHA256 8f51e39b362c1233c97fb460e95c9ad668c13a898f1012b338acd7bbeb72aa9d
MD5 f698ae9c04d0d336f31f754d6a891526
BLAKE2b-256 a7b0274e3ef1a8c4c5a1497170fa42cd4121ef4ac135784ae12227e0eb41d18d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page