Fast, non-overlapping simultaneous multiple keyword search
Project description
Python Aho-Corasick implementation
noahong
is a Python implementation of the Aho-Corasick algorithm for string matching, based on a fork of the NoAho C++ implementation.
noahong
supports macOS, Linux, Windows on Python 3.6+.
API
The first thing to do is to instantiate a NoAho
object and add some keys to it (optionally with different payloads for each).
from noahong import NoAho
trie = NoAho()
# fill with .add()
trie.add("foo", "id_foo")
trie.add("foobar", "id_foobar")
# or fill with __setitem__
trie["bar"] = "id_bar"
Once you have added the different keys and their payloads, the NoAho
object needs to be compiled:
trie.compile()
Once it is compiled, keys can no longer be added to the NoAho
object.
noahong
then exposes four functions to find matching substrings in text:
find_short
trie.find_short(text)
finds the first substring of text
that is matched by a key added to the trie
.
It returns a tuple (start, stop, payload)
such that:
payload
is the object inserted withtrie.add()
start
andstop
are indices of the match in thetext
:text[start:stop] == key
For example, using the above trie
:
trie.find_short("something foo")
# returns (10, 13, 'id_foo')
# "something foo"[10:13] == "foo"
and returns the first match even though a longer match may start at the same position:
trie.find_short("something foobar")
# returns (10, 13, 'id_foo')
find_long
trie.find_long(text)
finds the first longest substring of text
that is matched by a key added to the trie
.
For example, using the above trie
:
trie.find_long("something foobar")
# returns (10, 16, 'id_foobar')
findall_*
Both find_short
and find_long
have a findall_short
and findall_long
counterparts that allow you to iterate on all non-overlapping matches found
in the text:
for x in trie.findall_long("something foo bar foobar"):
print(x)
# prints
# (10, 13, 'id_foo')
# (14, 17, 'id_bar')
# (18, 24, 'id_foobar')
Because matches are non-overlapping:
list(trie.findall_short("foobar")) == [(0, 3, "id_foo"), (3, 6, "id_bar")]
whereas:
list(trie.findall_long("foobar")) == [(0, 6, "id_foobar")]
Payloads
NoAho
tries accept any Python object as a payload:
trie = NoAho()
trie.add("foo", 0)
trie.add("bar", CustomClass())
trie.add("baz", lambda x: x)
The same payload can be associated with different keys.
Notice that the non-pickable lambda x: x
payload works because
there is no serialization involved here.
Length and inclusion
NoAho
trie objects also expose the number of keys with len
:
len(trie)
And, when they are compiled, they can be used to test for key inclusion:
"foo" in trie
The number of nodes in the underlying Trie can be recovered with
trie.nodes_count()
Mapped NoAho
In order to save memory, noahong
exposes a Mapped
matching object which can be written to disk and later loaded directly to memory to perform matches with a smaller memory footprint.
The Mapped
object exposes different finding methods and only supports integer payloads.
Construct it by adding keys and payloads to a NoAho
object:
from noahong import NoAho, Mapped
trie = NoAho()
trie.add("baz", "id_baz")
trie.compile()
trie.write("./test.matcher")
mapped_trie = Mapped("./test.matcher")
The mapped_trie
object exposes a findall_anchored
function that iterates over anchored matches, matches that can be found within boundaries set with a special "anchor" character \u001F
.
This is useful to restrict matches to be found only between, say, spaces:
trie = NoAho()
trie.add("foo", 0)
trie.add("bar", 1)
trie.compile()
trie.write("./test.matcher")
mapped_trie = Mapped("./test.matcher")
mapped_trie.findall_anchored("\u001Fbar\u001F\u001Ffoo\u001F\u001Ffoobar\u001F")
# returns [(1, 4, 1), (6, 9, 0), (11, 14, 0)]
Notice how "bar"
is not found in the final "foobar"
because it is not present between "anchor" characters.
It is possible to place anchor characters in the keys:
trie = NoAho()
trie.add("foo\u001F\u001Fbar", 0)
trie.add("foo", 1)
trie.add("bar", 2)
trie.compile()
trie.write("./test.matcher")
mapped_trie = Mapped("./test.matcher")
mapped_trie.findall_anchored("\u001Ffoo\u001F\u001Fbar\u001F")
# returns [(1, 9, 0)]
In this case, the longest key found between anchors is returned.
Installation
Devpi
noahong
is available on devpi:
pip install noahong
Python 3
noahong
can be installed manually:
python3 setup.py install
Legacy README
You can find more information on the package and C++ implementation by reading the legacy README found here.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file noahong-0.11.2.tar.gz
.
File metadata
- Download URL: noahong-0.11.2.tar.gz
- Upload date:
- Size: 118.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5960be6f24deb88ecffc3fd7679c7e995befcbfc715a22bacd3dd5841aedf8a |
|
MD5 | 52cc8e7766c00886b1b21efcd9e97147 |
|
BLAKE2b-256 | 21226e499b3e666a702c372cf22f0a138b4b714e6ec8fe8ccc749fe23e51f40b |