Skip to main content

library for fast approximate string matching using Jaro and Jaro-Winkler similarity

Project description

JaroWinkler

Continous Integration PyPI package version Python versions
GitHub license

JaroWinkler is a library to calculate the Jaro and Jaro-Winkler similarity. It is easy to use, is far more performant than all alternatives and is designed to integrate seemingless with RapidFuzz.

:zap: Quickstart

>>> from jarowinkler import *

>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297

>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037

🚀 Benchmarks

The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein.

Benchmark JaroWinkler

⚙️ Installation

You can install this library from PyPI with pip:

pip install jarowinkler

JaroWinkler provides binary wheels for all common platforms.

Source builds

For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.

pip install git+https://github.com/maxbachmann/JaroWinkler.git@main

📖 Usage

Any algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:

from jarowinkler import jarowinkler_similarity


jarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667

So as long as two objects have the same hash they are treated as similar. You can provide a __hash__ method for your own object instances.

class MyObject:
    def __init__(self, hash):
        self.hash = hash

    def __hash__(self):
        return self.hash

jarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111

All algorithms provide a score_cutoff parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:

jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0

jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297

JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.

from rapidfuzz import process

process.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1.       , 0.9037037],
       [0.9037037, 1.       ]], dtype=float32)

👍 Contributing

PRs are welcome!

  • Found a bug? Report it in form of an issue or even better fix it!
  • Can make something faster? Great! Just avoid external dependencies and remember that existing functionality should still work.
  • Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
  • Have no time to code? Tell your friends and subscribers about JaroWinkler. More users, more contributions, more amazing features.

Thank you :heart:

⚠️ License

Copyright 2021 - present maxbachmann. JaroWinkler is free and open-source software licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jarowinkler-2.0.1.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

jarowinkler-2.0.1-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file jarowinkler-2.0.1.tar.gz.

File metadata

  • Download URL: jarowinkler-2.0.1.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for jarowinkler-2.0.1.tar.gz
Algorithm Hash digest
SHA256 7640c79f8d2d5e9eed6691cb49e3018a23b2319daad9a2178df253368b5432b7
MD5 e785475492eedbe033156cfa351e5f26
BLAKE2b-256 e091a3111ac8c11b52497840fdc0b0256aab9e9c014817adb79921f8f492695a

See more details on using hashes here.

File details

Details for the file jarowinkler-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: jarowinkler-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for jarowinkler-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2c04d8e761caa643eb9801440ccba12498b958f53146f236aa73a884e66ef23c
MD5 44d1bd5da4af4299d4ee317ff01b10bb
BLAKE2b-256 e8efe6a3a716e5f5fbb32a55ab19384e62427907a37574dd75c4502b09146223

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page