pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search. With the ``ahocorasick.Automaton`` class, you can find multiple key string occurrences at once in some input text. You can use it as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search. And pickle to disk for easy reuse of large automatons. Implemented in C and tested on Python 3.6+. Works on Linux, macOS and Windows. BSD-3-Cause license.
Project description
pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text. The strings “index” can be built ahead of time and saved (as a pickle) to disk to reload and reuse later. The library provides an ahocorasick Python module that you can use as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search.
pyahocorasick is implemented in C and tested on Python 3.8 and up. It works on 64 bits Linux, macOS and Windows.
The license is BSD-3-Clause. Some utilities, such as tests and the pure Python automaton are dedicated to the Public Domain.
Testimonials
Many thanks for this package. Wasn’t sure where to leave a thank you note but this package is absolutely fantastic in our application where we have a library of 100k+ CRISPR guides that we have to count in a stream of millions of DNA sequencing reads. This package does it faster than the previous C program we used for the purpose and helps us stick to just Python code in our pipeline.
Miika (AstraZeneca Functional Genomics Centre) https://github.com/WojciechMula/pyahocorasick/issues/145
Download and source code
You can fetch pyahocorasick from:
The documentation is published at https://pyahocorasick.readthedocs.io/
Quick start
This module is written in C. You need a C compiler installed to compile native CPython extensions. To install:
pip install pyahocorasick
Then create an Automaton:
>>> import ahocorasick >>> automaton = ahocorasick.Automaton()
You can use the Automaton class as a trie. Add some string keys and their associated value to this trie. Here we associate a tuple of (insertion index, original string) as a value to each key string we add to the trie:
>>> for idx, key in enumerate('he her hers she'.split()): ... automaton.add_word(key, (idx, key))
Then check if some string exists in the trie:
>>> 'he' in automaton True >>> 'HER' in automaton False
And play with the get() dict-like method:
>>> automaton.get('he') (0, 'he') >>> automaton.get('she') (3, 'she') >>> automaton.get('cat', 'not exists') 'not exists' >>> automaton.get('dog') Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError
Now convert the trie to an Aho-Corasick automaton to enable Aho-Corasick search:
>>> automaton.make_automaton()
Then search all occurrences of the keys (the needles) in an input string (our haystack).
Here we print the results and just check that they are correct. The Automaton.iter() method return the results as two-tuples of the end index where a trie key was found in the input string and the associated value for this key. Here we had stored as values a tuple with the original string and its trie insertion order:
>>> for end_index, (insert_order, original_value) in automaton.iter(haystack): ... start_index = end_index - len(original_value) + 1 ... print((start_index, end_index, (insert_order, original_value))) ... assert haystack[start_index:start_index + len(original_value)] == original_value ... (1, 2, (0, 'he')) (1, 3, (1, 'her')) (1, 4, (2, 'hers')) (4, 6, (3, 'she')) (5, 6, (0, 'he'))
You can also create an eventually large automaton ahead of time and pickle it to re-load later. Here we just pickle to a string. You would typically pickle to a file instead:
>>> import pickle >>> pickled = pickle.dumps(automaton) >>> B = pickle.loads(pickled) >>> B.get('he') (0, 'he')
See also:
FAQ and Who is using pyahocorasick? https://github.com/WojciechMula/pyahocorasick/wiki/FAQ#who-is-using-pyahocorasick
Documentation
The full documentation including the API overview and reference is published on readthedocs.
Overview
With an Aho-Corasick automaton you can efficiently search all occurrences of multiple strings (the needles) in an input string (the haystack) making a single pass over the input string. With pyahocorasick you can eventually build large automatons and pickle them to reuse them over and over as an indexed structure for fast multi pattern string matching.
One of the advantages of an Aho-Corasick automaton is that the typical worst-case and best-case runtimes are about the same and depends primarily on the size of the input string and secondarily on the number of matches returned. While this may not be the fastest string search algorithm in all cases, it can search for multiple strings at once and its runtime guarantees make it rather unique. Because pyahocorasick is based on a Trie, it stores redundant keys prefixes only once using memory efficiently.
A drawback is that it needs to be constructed and “finalized” ahead of time before you can search strings. In several applications where you search for several pre-defined “needles” in a variable “haystacks” this is actually an advantage.
Aho-Corasick automatons are commonly used for fast multi-pattern matching in intrusion detection systems (such as snort), anti-viruses and many other applications that need fast matching against a pre-defined set of string keys.
Internally an Aho-Corasick automaton is typically based on a Trie with extra data for failure links and an implementation of the Aho-Corasick search procedure.
Behind the scenes the pyahocorasick Python library implements these two data structures: a Trie and an Aho-Corasick string matching automaton. Both are exposed through the Automaton class.
In addition to Trie-like and Aho-Corasick methods and data structures, pyahocorasick also implements dict-like methods: The pyahocorasick Automaton is a Trie a dict-like structure indexed by string keys each associated with a value object. You can use this to retrieve an associated value in a time proportional to a string key length.
pyahocorasick is available in two flavors:
a CPython C-based extension, compatible with Python 3 only. Use older version 1.4.x for Python 2.7.x and 32 bits support.
a simpler pure Python module, compatible with Python 2 and 3. This is only available in the source repository (not on Pypi) under the etc/py/ directory and has a slightly different API.
Unicode and bytes
The type of strings accepted and returned by Automaton methods are either unicode or bytes, depending on a compile time settings (preprocessor definition of AHOCORASICK_UNICODE as set in setup.py).
The Automaton.unicode attributes can tell you how the library was built. On Python 3, unicode is the default.
Build and install from PyPi
To install for common operating systems, use pip. Pre-built wheels should be available on Pypi at some point in the future:
pip install pyahocorasick
To build from sources you need to have a C compiler installed and configured which should be standard on Linux and easy to get on MacOSX.
To build from sources, clone the git repository or download and extract the source archive.
Install pip (and its setuptools companion) and then run (in a virtualenv of course!):
pip install .
If compilation succeeds, the module is ready to use.
Support
Support is available through the GitHub issue tracker to report bugs or ask questions.
Contributing
You can submit contributions through GitHub pull requests.
There is a Makefile with a default target that builds and runs tests.
The tests can run with a pip installe -e .[testing] && pytest -vvs
See also the .github directory for CI tests and workflow
License
This library is licensed under very liberal BSD-3-Clause license. Some portions of the code are dedicated to the public domain such as the pure Python automaton and test code.
Full text of license is available in LICENSE file.
Other Aho-Corasick implementations for Python you can consider
While pyahocorasick tries to be the finest and fastest Aho Corasick library for Python you may consider these other libraries:
py_aho_corasick by Jan
Written in pure Python.
Poor performance.
ahocorapy by abusix
Written in pure Python.
Better performance than py-aho-corasick.
Using pypy, ahocorapy’s search performance is only slightly worse than pyahocorasick’s.
Performs additional suffix shortcutting (more setup overhead, less search overhead for suffix lookups).
Includes visualization tool for resulting automaton (using pygraphviz).
MIT-licensed, 100% test coverage, tested on all major python versions (+ pypy)
noaho by Jeff Donner
Written in C. Does not return overlapping matches.
Does not compile on Windows (July 2016).
No support for the pickle protocol.
acora by Stefan Behnel
Written in Cython.
Large automaton may take a long time to build (July 2016)
No support for a dict-like protocol to associate a value to a string key.
ahocorasick by Danny Yoo
Written in C.
seems unmaintained (last update in 2005).
GPL-licensed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for pyahocorasick-2.1.0-cp312-cp312-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ebe0d1e15afb782477e3d0aa1dce28ab9dad1200211fb785b9c1cc1208e6f04 |
|
MD5 | 61a4ee99862a327ab577125089a9509e |
|
BLAKE2b-256 | 8f028dceb0a63dbbc7c102eb0bd27504336ecb27164155c35d27a9943f2ce0dd |
Hashes for pyahocorasick-2.1.0-cp312-cp312-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8337af64c649223cff548c7204dda823e83622d63e5449bc51ae069efb2f240f |
|
MD5 | 3f4cb9495b80217a5d6c15ed20594997 |
|
BLAKE2b-256 | a6b3b486f5aa43a0e00e1bd6387fd3754b175a00f3a8cb5b4009e5433bb564ca |
Hashes for pyahocorasick-2.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f209796e7d354734781dd883c333596e482c70136fa76a4cb169f383e6c40bca |
|
MD5 | c906085955ee7b5f6b52fe6abbb3a93f |
|
BLAKE2b-256 | 007f1b0e2760d89926f2a4c51f74f21d7681b3543c689818e2de9325f763b8ba |
Hashes for pyahocorasick-2.1.0-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6dedb9fed92705b742d6aa3d87abb1ec999f57310ef32b962f65f4e42182fe0a |
|
MD5 | bc5e7fba8515e6e4d24492a5f1afe162 |
|
BLAKE2b-256 | bb8e2d398e29e5db80c7187b0fcd955289381c4cc16cba5115809d655333af16 |
Hashes for pyahocorasick-2.1.0-cp312-cp312-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82b0d20e82cc282fd29324e8df93809cebbffb345055214ce4b7873698df02c8 |
|
MD5 | c0672ad37a450cdcaabec902789d4dbd |
|
BLAKE2b-256 | b3c1380f6fa3ad55eb66104e9eab608e3bedb84df9f951fb31373238446cd711 |
Hashes for pyahocorasick-2.1.0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8254d6333df5eb400ed3ec8b24da9e3f5da8e28b94a71392391703a7aac568d |
|
MD5 | 206dc76e169d43ebbf44630befb48eff |
|
BLAKE2b-256 | 3676d83c60ec7a202cbfeffaa9649d0fee6ddcb974622e411b86211ff3572549 |
Hashes for pyahocorasick-2.1.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a58c44c407a45155dc7a3253274b5fd78ab00b579bd5685059610867cdb37142 |
|
MD5 | a0e05b033b43c59158d7c9813b2c4369 |
|
BLAKE2b-256 | c1dee33f32ceceafdd440c62a454f2d506b1119226d37135aa31940683d422c4 |
Hashes for pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9f2728ac77bab807ba65c6ef41be30358ef0c9bb6960c9fe070d43f7024cb91 |
|
MD5 | 864165509244257afb2944873f3ae15c |
|
BLAKE2b-256 | 313217ab57fe5abcf09d2f1ceb502143447be00658761d167118441e19a2b2c6 |
Hashes for pyahocorasick-2.1.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 658d55e51c7588a5dba57de674241a16a3c94bf57f3bfd70022c4d7defe2b0f4 |
|
MD5 | 7cf708d8ad55df791cabf9b2cc3a1fa5 |
|
BLAKE2b-256 | 96014e4c5e3ff80eeafee2d3f510a71558e1317a13893360dd2c68276bb7514a |
Hashes for pyahocorasick-2.1.0-cp311-cp311-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6e0da0a8fc78c694778dced537c1bfb8b2f178ec92a82d81539d2e35a15cba0 |
|
MD5 | 755df378e7151c8ead38b8f52671660b |
|
BLAKE2b-256 | f28be6baa0246d3126d509d56f55f8f8be7b9cd914d8f87d1277f25d9af55351 |
Hashes for pyahocorasick-2.1.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f8eba88fce34a1d8020638a4a8732c6241a5d85fe12be8669b7495d99d36b6a |
|
MD5 | 1f26be59b8f486750326e1655115ada1 |
|
BLAKE2b-256 | 44667abc274d852af6750bd7165d5fcf88997ffe0178c6c35f9a46f3e3761868 |
Hashes for pyahocorasick-2.1.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c860ad9cb59e56c31aed8a5d1ee9d83a0151277b09198d027ffce213697716ed |
|
MD5 | 8eebcb632056c67b4c9ff5a9c70f4dfb |
|
BLAKE2b-256 | bf276e74ed6731d14b2b6fce146155997bd14bc6d803f43a7ae5397ac931fc3c |
Hashes for pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 581e3d85043f1797543796f021e8d7d48c18e594529b72d86f70ea78abc88fff |
|
MD5 | 257a69c32441e5806619185f0f5ddc19 |
|
BLAKE2b-256 | 4d2261283c423676443dd2f96cf6b886b26f21db3c869ae73432bf00d128bec0 |
Hashes for pyahocorasick-2.1.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f15529c83b8c6e0548d7d3c5631fefa23fba5190e67be49d6c9e24a6358ff9c |
|
MD5 | f7571371ba597473c6314bc4f45f4bf3 |
|
BLAKE2b-256 | a7d088de0e86c552889740f618133e129044ab42a53b6c6301f2fc18db679fbd |
Hashes for pyahocorasick-2.1.0-cp310-cp310-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c46288044c4f71392efb4f5da0cb8abd160787a8b027afc85079e9c3d7551eb |
|
MD5 | 13253057ad873fae23de119f20db7063 |
|
BLAKE2b-256 | 9daf2fb0293772fa3d216d50d5ed022918fd875b35beb17b1f0646b5054a04aa |
Hashes for pyahocorasick-2.1.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05b7c2ef52da247efec6fb5a011113b7e943e961e22aaaf757cb9c15083440c9 |
|
MD5 | 6fa26cfe2d9027a7bc79c3ee3d6b660a |
|
BLAKE2b-256 | 357e8771875e666fffeda9e00850ef9678c3957b4e0cff01186296abf9ee04fd |
Hashes for pyahocorasick-2.1.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f44f96496aa773fc5bf302ddf968dd6b920fab34522f944392af8bde13cbe805 |
|
MD5 | a7db29c8c92c4dc4aaa35f441b1acf91 |
|
BLAKE2b-256 | 5d51905fb878cf5064391da4cfaadda72afebb3cff0b56a2b008513083b60e17 |
Hashes for pyahocorasick-2.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf4a4b19ac37e9a7087646b8bcc306acd7a91649355d59b866b756068e35d018 |
|
MD5 | 7184ea37c744f418c0a933c736196cd7 |
|
BLAKE2b-256 | 9c79c99c553a75e5c15becb5301d68cead6a927b20b191ee26b8e45447d75d87 |
Hashes for pyahocorasick-2.1.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 895ab1ff5384ee5325c74cbacafc419e534f1f110b9fb3c544cc56832ecce082 |
|
MD5 | 2ce97dcffd78d0502ac4513d5a538884 |
|
BLAKE2b-256 | 3efb5968cdd9b256d10147013cfd06678abf4c0d4fea0bcfade787c7d48216c5 |
Hashes for pyahocorasick-2.1.0-cp39-cp39-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 36491675a13fe4181a6b3bccfc9032a1a5d03bd3b0a151c06f8865c16ba44b42 |
|
MD5 | c5b81dc4263c8f5367cfff92ab5aaa16 |
|
BLAKE2b-256 | 1b71b91cdf790051c1d2f029216b3b03939a59612feeb0cbef885631c841e047 |
Hashes for pyahocorasick-2.1.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7034b26e145518610651339b8701568a3533a3114b00cf55f22bca80bff58e6d |
|
MD5 | 255e79fe669dff3c421c1be06b3cc489 |
|
BLAKE2b-256 | fd9d36a030056ce45ce7db08f0259c9112f455a26dc79b3bde99279dcc9eb77a |
Hashes for pyahocorasick-2.1.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23b183600e2087f16f6c5e6185d61525ad74335f2a5b693dd6d66bba2f6a4b05 |
|
MD5 | 3012e0dfe6fe8d9a056e6c87381b7c11 |
|
BLAKE2b-256 | 699aa5f60ec1a412cad010c66a849f111fb236eb4084160d4234e11f76956441 |
Hashes for pyahocorasick-2.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e92150849a3c13da37e37ca6374fa55960fd5c845029eca02d9b5846b26fe48 |
|
MD5 | b824c795f7f8e2a9c0689de32bdfc322 |
|
BLAKE2b-256 | 380e9feb94becb2dc62e081ad8c6850199c23329775043896cdbcf15b9d3787d |
Hashes for pyahocorasick-2.1.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3795ac922d21fbfea40a6b3a330762e8b38ce8ba511b1eb15bf9eeb9303b2662 |
|
MD5 | ff79332da5548ffc329f1b3bfcad050a |
|
BLAKE2b-256 | c8512bd342ab7bdc9bf72171920968ab1da77b115bfa49bd33b6007ab7fbb458 |
Hashes for pyahocorasick-2.1.0-cp38-cp38-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7454ba5fa528958ca9a1bc3143f8e980bd7817ea481f46495e6ffa89675ab93b |
|
MD5 | a65540d962a2639b553029b0a8a6fc0b |
|
BLAKE2b-256 | 3afa56f4d5121b2d8e491a7c1a063d235b76c933c690cf67ed00743e109fc106 |