Skip to main content

library for fast approximate string matching using Jaro and Jaro-Winkler similarity

Project description

JaroWinkler

Continous Integration PyPI package version Python versions
GitHub license

JaroWinkler is a library to calculate the Jaro and Jaro-Winkler similarity. It is easy to use, is far more performant than all alternatives and is designed to integrate seemingless with RapidFuzz.

:zap: Quickstart

>>> from jarowinkler import *

>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297

>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037

🚀 Benchmarks

The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein.

Benchmark JaroWinkler

⚙️ Installation

You can install this library from PyPI with pip:

pip install jarowinkler

JaroWinkler provides binary wheels for all common platforms.

Source builds

For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.

pip install git+https://github.com/maxbachmann/JaroWinkler.git@main

📖 Usage

Any algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:

from jarowinkler import jarowinkler_similarity


jarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667

So as long as two objects have the same hash they are treated as similar. You can provide a __hash__ method for your own object instances.

class MyObject:
    def __init__(self, hash):
        self.hash = hash

    def __hash__(self):
        return self.hash

jarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111

All algorithms provide a score_cutoff parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:

jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0

jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297

JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.

from rapidfuzz import process

process.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1.       , 0.9037037],
       [0.9037037, 1.       ]], dtype=float32)

👍 Contributing

PRs are welcome!

  • Found a bug? Report it in form of an issue or even better fix it!
  • Can make something faster? Great! Just avoid external dependencies and remember that existing functionality should still work.
  • Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
  • Have no time to code? Tell your friends and subscribers about JaroWinkler. More users, more contributions, more amazing features.

Thank you :heart:

⚠️ License

Copyright 2021 - present maxbachmann. JaroWinkler is free and open-source software licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jarowinkler-2.0.0.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

jarowinkler-2.0.0-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file jarowinkler-2.0.0.tar.gz.

File metadata

  • Download URL: jarowinkler-2.0.0.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for jarowinkler-2.0.0.tar.gz
Algorithm Hash digest
SHA256 7cfb5e5caef5d22bae44e9cadcd1ff7313970e70303feff31ff78e8b69c92648
MD5 65a1869fef152bb84738ae7ae4b5e1c7
BLAKE2b-256 b873076b3951578f89dbd2b8972421fed922613b63d43dfb794286c576da9940

See more details on using hashes here.

File details

Details for the file jarowinkler-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: jarowinkler-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for jarowinkler-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94c113030e73cf1f052c69ec100b1bc6fa3b90cd834a61df4c13d627f7678d20
MD5 27cd72bdba85e5dfc171c81c0a77dc8d
BLAKE2b-256 9c50a168042f7f0ee5f2a656a97099aa91e0fabbecd18c8a9a0cc14b3c0d0788

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page