Skip to main content

Fast, non-overlapping simultaneous multiple keyword search

Project description

Python Aho-Corasick implementation

noahong is a Python implementation of the Aho-Corasick algorithm for string matching, based on a fork of the NoAho C++ implementation.

noahong supports macOS, Linux, Windows on Python 3.6+.

API

The first thing to do is to instantiate a NoAho object and add some keys to it (optionally with different payloads for each).

from noahong import NoAho

trie = NoAho()

# fill with .add()
trie.add("foo", "id_foo")
trie.add("foobar", "id_foobar")

# or fill with __setitem__
trie["bar"] = "id_bar"

Once you have added the different keys and their payloads, the NoAho object needs to be compiled:

trie.compile()

Once it is compiled, keys can no longer be added to the NoAho object.

noahong then exposes four functions to find matching substrings in text:

find_short

trie.find_short(text) finds the first substring of text that is matched by a key added to the trie.

It returns a tuple (start, stop, payload) such that:

  • payload is the object inserted with trie.add()
  • start and stop are indices of the match in the text: text[start:stop] == key

For example, using the above trie:

trie.find_short("something foo")
# returns (10, 13, 'id_foo')
# "something foo"[10:13] == "foo"

and returns the first match even though a longer match may start at the same position:

trie.find_short("something foobar")
# returns (10, 13, 'id_foo')

find_long

trie.find_long(text) finds the first longest substring of text that is matched by a key added to the trie.

For example, using the above trie:

trie.find_long("something foobar")
# returns (10, 16, 'id_foobar')

findall_*

Both find_short and find_long have a findall_short and findall_long counterparts that allow you to iterate on all non-overlapping matches found in the text:

for x in trie.findall_long("something foo bar foobar"): 
    print(x)       

# prints                          
# (10, 13, 'id_foo')
# (14, 17, 'id_bar')
# (18, 24, 'id_foobar')

Because matches are non-overlapping:

list(trie.findall_short("foobar")) == [(0, 3, "id_foo"), (3, 6, "id_bar")]

whereas:

list(trie.findall_long("foobar")) == [(0, 6, "id_foobar")]

Payloads

NoAho tries accept any Python object as a payload:

trie = NoAho()
trie.add("foo", 0)
trie.add("bar", CustomClass())
trie.add("baz", lambda x: x)

The same payload can be associated with different keys.

Notice that the non-pickable lambda x: x payload works because there is no serialization involved here.

Length and inclusion

NoAho trie objects also expose the number of keys with len:

len(trie)

And, when they are compiled, they can be used to test for key inclusion:

"foo" in trie

The number of nodes in the underlying Trie can be recovered with

trie.nodes_count()

Mapped NoAho

In order to save memory, noahong exposes a Mapped matching object which can be written to disk and later loaded directly to memory to perform matches with a smaller memory footprint.

The Mapped object exposes different finding methods and only supports integer payloads.

Construct it by adding keys and payloads to a NoAho object:

from noahong import NoAho, Mapped

trie = NoAho()
trie.add("baz", "id_baz")

trie.compile()
trie.write("./test.matcher")

mapped_trie = Mapped("./test.matcher")

The mapped_trie object exposes a findall_anchored function that iterates over anchored matches, matches that can be found within boundaries set with a special "anchor" character \u001F.

This is useful to restrict matches to be found only between, say, spaces:

trie = NoAho()
trie.add("foo", 0)
trie.add("bar", 1)

trie.compile()
trie.write("./test.matcher")

mapped_trie = Mapped("./test.matcher")
mapped_trie.findall_anchored("\u001Fbar\u001F\u001Ffoo\u001F\u001Ffoobar\u001F")

# returns [(1, 4, 1), (6, 9, 0), (11, 14, 0)]

Notice how "bar" is not found in the final "foobar" because it is not present between "anchor" characters.

It is possible to place anchor characters in the keys:

trie = NoAho()
trie.add("foo\u001F\u001Fbar", 0)
trie.add("foo", 1)
trie.add("bar", 2)

trie.compile()
trie.write("./test.matcher")

mapped_trie = Mapped("./test.matcher")
mapped_trie.findall_anchored("\u001Ffoo\u001F\u001Fbar\u001F")
# returns [(1, 9, 0)]

In this case, the longest key found between anchors is returned.

Installation

Devpi

noahong is available on devpi:

pip install noahong

Python 3

noahong can be installed manually:

python3 setup.py install

Legacy README

You can find more information on the package and C++ implementation by reading the legacy README found here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

noahong-0.11.0.tar.gz (118.3 kB view details)

Uploaded Source

File details

Details for the file noahong-0.11.0.tar.gz.

File metadata

  • Download URL: noahong-0.11.0.tar.gz
  • Upload date:
  • Size: 118.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for noahong-0.11.0.tar.gz
Algorithm Hash digest
SHA256 d637557d2704c8c6180d32ed873db65403e139f79e47a5493c8539b0fc479dc3
MD5 ea49a42e1da3451ea7f4674ed9020e4c
BLAKE2b-256 65dc7d895e01bdabcfe6d912d24b9bb14b550c2613b1e12fd68da2dddbe81227

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page