pyahocorasick is a fast and memory efficient library for fast exact or approximate multi-pattern string search. It is implemented in C and tested on Python 3.4+.
Project description
Introduction
pyahocorasick is a Python module implements two kinds of data structures: trie and Aho-Corasick string matching automaton.
Trie is a dictionary indexed by strings, which allow to retrieve associated items in a time proportional to string length. Aho-Corasick automaton allows to find all occurrences of strings from given set in a single run over text.
(BTW in order to use Aho-Corasick automaton, a trie have to be created; this is the reason why these two distinct entities exist in a single module.)
There are two versions:
C extension, compatible only with Python3;
pure python module, compatible with Python 2 and 3.
Python module API is similar, but isn’t exactly the same as C extension.
License
Library is licensed under very liberal two-clauses BSD license. Some portions has been realased into public domain.
Full text of license is available in LICENSE file.
See also
Module ahocorasick by Danny Yoo — seems unmaintained (last update in 2005) and is licensed under GPL.
Article about different trie representations — this is an effect of experiments I made while working on this module.
Installation
Just run:
python setup.py install
If compilation succed, module is ready to use.
API
Module
Module ahocorasick contains several constants and class Automaton.
Unicode and bytes
Type of strings accepted and returned by Automaton methods can be either unicode or bytes, depending on compile time settings (preprocessor definition AHOCORASICK_UNICODE). Value of module member unicode informs about chosen type.
Constants
unicode — see Unicode and bytes
STORE_ANY, STORE_INTS, STORE_LENGTH — see Constructor
EMPTY, TRIE, AHOCORASICK — see Members
MATCH_EXACT_LENGTH, MATCH_AT_MOST_PREFIX, MATCH_AT_LEAST_PREFIX — see description of method keys
Automaton class
Automaton class is pickable (implements __reduce__()).
Members
- kind [readonly]
One of values:
- EMPTY
There are no words saved in automaton.
- TRIE
There are some words, but methods related to Aho-Corasick algorithm (find_all, iter) won’t work.
- AHOCORASICK
Aho-Corasick automaton has been constructed, full functionality is available for user.
Kind is maintained internally by Automaton object. Some methods are not available when automaton kind is EMPTY or isn’t an AHOCORASICK. When called then exception is raised, however testing this property could be better (faster, more elegant).
- store [readonly]
Type of values stored in trie. By default STORE_ANY is used, thus any python object could be used. When STORE_INTS or STORE_LENGTH is used then values are 32-bit integers and do not occupy additional memory. See add_word description for details.
Constructor
Constructor accepts just one argument, a type of values, one of constants:
- STORE_ANY
Any Python object (default).
- STORE_LENGTH
Length of string.
- STORE_INTS
32-bit integers.
Dictionary methods
- get(word[, default])
Returns value associated with word. Raises KeyError or returns default value if word isn’t present in dictionary.
- keys([prefix, [wildcard, [how]]]) => yield strings
Returns iterator that iterate through words.
If prefix (a string) is given, then only words sharing this prefix are yielded.
If wildcard (single character) is given, then prefix is treated as a simple pattern with selected wildcard. Optional parameter how controls which strings are matched:
- MATCH_EXACT_LENGTH [default]
Only strings with the same length as a pattern’s length are yielded. In other words, literally match a pattern.
- MATCH_AT_LEAST_PREFIX
Strings that have length greater or equal to a pattern’s length are yielded.
- MATCH_AT_MOST_PREFIX
Strings that have length less or equal to a pattern’s length are yielded.
See Example 2 and the section below.
- values([prefix, [wildcard, [how]]]) => yield object
Return iterator that iterate through values associated with words. Words are matched as in keys method.
- items([prefix, [wildcard, [how]]]) => yield tuple (string, object)
Return iterator that iterate through words and associated values. Words are matched as in keys method.
- iter() protocol
Equivalent to obj.keys()
- len() protocol
Returns number of distinct words.
Wildcards
Methods keys, values and items accept variant with wildcard. A wildcard character is equivalent to a question mark used in glob patterns (?) or a dot from regular expressions (.). In case of these function a programmer can pick any character.
It is not possible to escape a wildcard and thus match it exactly — simply select another char, not present in the pattern. For example:
automaton.keys("hi?", "?") # would match "him", "his" automaton.keys("XX?", "X") # would match "me?", "he?" or "it?"
Trie
- add_word(word, [value]) => bool
Add new word, a key, to dictionary and associate with value. Returns True if word didn’t exists earlier in dictionary.
If store == STORE_LENGTH then value is not allowed — len(word) is saved.
If store == STORE_INTS then value is optional. If present, then have to be an integer, otherwise defaults to len(automaton).
If store == STORE_ANY then value is required and could be any object.
This method invalidates all iterators only if new word was added (i.e. method returned True).
- clear() => None
Removes all words from dictionary.
This method invalidates all iterators.
- exists(word) => bool or word in ...
Returns if word is present in dictionary.
- match(word) => bool
Returns if there is a prefix (or word) equal to word. For example if word “example” is present in dictionary, then all match("e"), match("ex"), …, match("exampl"), match("example") are True. But exists() is True just for the last word.
- longest_prefix(word) => integer
Returns length of the longest prefix of word that exists in a dictionary.
Aho-Corasick
- make_automaton()
Creates Aho-Corasick automaton based on trie. This doesn’t require additional memory. After successful creation kind become AHOCORASICK.
This method invalidates all iterators.
- find_all(string, callback, [start, [end]])
Perform Aho-Corsick on string; start/end can be used to reduce string range. Callback is called with two arguments:
index of end of matched string
value associated with that string
(Method called with start/end does similar job as find_all(string[start:end], callback), except index values).
- iter(string, [start, [end]])
Returns iterator (object of class AutomatonSearchIter) that does the same thing as find_all, yielding tuples instead of calling a user function.
find_all method could be expressed as:
def find_all(self, string, callback): for index, value in self.iter(string): callback(index, value)
Other
- dump() => (list of nodes, list of edges, list of fail links)
Returns 3 lists describing a graph:
nodes: each item is a pair (node id, end of word marker)
edges: each item is a triple (node id, label char, child node id)
fail: each item is a pair (source node id, node if connected by fail node)
ID is a unique number and a label is a single byte.
Module package contains also program dump2dot.py that shows how to convert dump results to input file for graphviz tools.
- get_stats() => dict
Returns dictionary containing some statistics about underlaying trie:
nodes_count — total number of nodes
words_count — same as len(automaton)
longest_word — length of the longest word
links_count — number of edges
sizeof_node — size of single node in bytes
total_size — total size of trie in bytes (about nodes_count * size_of node + links_count * size of pointer). The real size occupied by structure could be larger, because of internal memory fragmentation occurred in memory manager.
AutomatonSearchIter class
Class isn’t available directly, object of this class is returned by iter method of Automaton. Iterator has additional method.
- set(string, [reset]) => None
Sets new string to process. When reset is False (default), then processing is continued, i.e internal state of automaton and index aren’t touched. This allow to process larger strings in chunks, for example:
it = automaton.iter(b"") while True: buffer = receive(server_address, 4096) if not buffer: break it.set(buffer) for index, value in it: print(index, '=>', value)
When reset is True then processing is restarted. For example this code:
for string in set: for index, value in automaton.iter(string) print(index, '=>', value)
Does the same job as:
it = automaton.iter(b"") for string in set: it.set(it, True) for index, value in it: print(index, '=>', value)
Example
>>> import ahocorasick >>> A = ahocorasick.Automaton() # add some words to trie >>> for index, word in enumerate("he her hers she".split()): ... A.add_word(word, (index, word)) # test is word exists in set >>> "he" in A True >>> "HER" in A False >>> A.get("he") (0, 'he') >>> A.get("she") (3, 'she') >>> A.get("cat", "<not exists>") '<not exists>' >>> A.get("dog") Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError >>> # convert trie to Aho-Corasick automaton A.make_automaton() # then find all occurrences in string for item in A.iter("_hershe_"): ... print(item) ... (2, (0, 'he')) (3, (1, 'her')) (4, (2, 'hers')) (6, (3, 'she')) (6, (0, 'he'))
Example 2
Demonstration of keys behaviour.
>>> import ahocorasick >>> A = ahocorasick.Automaton() # add some words to trie >>> for index, word in enumerate("cat catastropha rat rate bat".split()): ... A.add_word(word, (index, word)) # prefix >>> list(A.keys("cat")) ["cat", "catastropha"] # pattern >>> list(A.keys("?at", "?", ahocorasick.MATCH_EXACT_LENGTH)) ["bat", "cat", "rat"] >>> list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX)) ["bat", "cat", "rat", "rate"] >>> list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX)) ["rate"]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.