textdistance

Compute distance between the two texts.

These details have not been verified by PyPI

Project links

Project description

TextDistance
============

**TextDistance** -- python library for compare distance between two or
more sequences by many algorithms.

Features:

- 30+ algorithms
- Pure python implementation
- Simple usage
- More than two sequences comparing
- Some algorithms have more than one implementation in one class.
- Optional numpy usage for maximum speed.

Algorithms
----------

Edit based
~~~~~~~~~~

+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+
| Algorithm | Class | Functions |
+================================================================================================+==========================+==============================+
| `Hamming <https://en.wikipedia.org/wiki/Hamming_distance>`__ | ``Hamming`` | ``hamming`` |
+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+
| `MLIPNS <http://www.sial.iias.spb.su/files/386-386-1-PB.pdf>`__ | ``Mlipns`` | ``mlipns`` |
+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+
| `Levenshtein <https://en.wikipedia.org/wiki/Levenshtein_distance>`__ | ``Levenshtein`` | ``levenshtein`` |
+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+
| `Damerau-Levenshtein <https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance>`__ | ``DamerauLevenshtein`` | ``damerau_levenshtein`` |
+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+
| `Jaro-Winkler <https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance>`__ | ``JaroWinkler`` | ``jaro_winkler``, ``jaro`` |
+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+
| `Strcmp95 <http://cpansearch.perl.org/src/SCW/Text-JaroWinkler-0.1/strcmp95.c>`__ | ``StrCmp95`` | ``strcmp95`` |
+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+
| `Needleman-Wunsch <https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm>`__ | ``NeedlemanWunsch`` | ``needleman_wunsch`` |
+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+
| `Gotoh <https://www.cs.umd.edu/class/spring2003/cmsc838t/papers/gotoh1982.pdf>`__ | ``Gotoh`` | ``gotoh`` |
+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+
| `Smith-Waterman <https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm>`__ | ``SmithWaterman`` | ``smith_waterman`` |
+------------------------------------------------------------------------------------------------+--------------------------+------------------------------+

Token based
~~~~~~~~~~~

+---------------------------------------------------------------------------------------------------------------------------+------------------+---------------------------------------------+
| Algorithm | Class | Functions |
+===========================================================================================================================+==================+=============================================+
| `Jaccard index <https://en.wikipedia.org/wiki/Jaccard_index>`__ | ``Jaccard`` | ``jaccard`` |
+---------------------------------------------------------------------------------------------------------------------------+------------------+---------------------------------------------+
| `Sørensen–Dice coefficient <https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient>`__ | ``Sorensen`` | ``sorensen``, ``sorensen_dice``, ``dice`` |
+---------------------------------------------------------------------------------------------------------------------------+------------------+---------------------------------------------+
| `Tversky index <https://en.wikipedia.org/wiki/Tversky_index>`__ | ``Tversky`` | ``tversky`` |
+---------------------------------------------------------------------------------------------------------------------------+------------------+---------------------------------------------+
| `Overlap coefficient <https://en.wikipedia.org/wiki/Overlap_coefficient>`__ | ``Overlap`` | ``overlap`` |
+---------------------------------------------------------------------------------------------------------------------------+------------------+---------------------------------------------+
| `Tanimoto distance <https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_similarity_and_distance>`__ | ``Tanimoto`` | ``tanimoto`` |
+---------------------------------------------------------------------------------------------------------------------------+------------------+---------------------------------------------+
| `Cosine similarity <https://en.wikipedia.org/wiki/Cosine_similarity>`__ | ``Cosine`` | ``cosine`` |
+---------------------------------------------------------------------------------------------------------------------------+------------------+---------------------------------------------+
| `Monge-Elkan <https://www.academia.edu/200314/Generalized_Monge-Elkan_Method_for_Approximate_Text_String_Comparison>`__ | ``MongeElkan`` | ``monge_elkan`` |
+---------------------------------------------------------------------------------------------------------------------------+------------------+---------------------------------------------+
| `Bag distance <https://github.com/Yomguithereal/talisman/blob/master/src/metrics/distance/bag.js>`__ | ``Bag`` | ``bag`` |
+---------------------------------------------------------------------------------------------------------------------------+------------------+---------------------------------------------+

Sequence based
~~~~~~~~~~~~~~

+--------------+----------+--------------+
| Algorithm | Class | Functions |
+==============+==========+==============+
| `longest | ``LCSSeq | ``lcsseq`` |
| common | `` | |
| subsequence | | |
| similarity < | | |
| https://en.w | | |
| ikipedia.org | | |
| /wiki/Longes | | |
| t_common_sub | | |
| sequence_pro | | |
| blem>`__ | | |
+--------------+----------+--------------+
| `longest | ``LCSStr | ``lcsstr`` |
| common | `` | |
| substring | | |
| similarity < | | |
| https://docs | | |
| .python.org/ | | |
| 2/library/di | | |
| fflib.html#d | | |
| ifflib.Seque | | |
| nceMatcher>` | | |
| __ | | |
+--------------+----------+--------------+
| `Ratcliff-Ob | ``Ratcli | ``ratcliff_o |
| ershelp | ffObersh | bershelp`` |
| similarity | elp`` | |
| similarity < | | |
| http://colla | | |
| boration.cmc | | |
| .ec.gc.ca/sc | | |
| ience/rpn/bi | | |
| blio/ddj/Web | | |
| site/article | | |
| s/DDJ/1988/8 | | |
| 807/8807c/88 | | |
| 07c.htm>`__ | | |
+--------------+----------+--------------+

Compression based
~~~~~~~~~~~~~~~~~

Work in progress. Now all algorithms compare two strings as array of
bits, not by chars.

``NCD`` - normalized compression distance.

Functions:

1. ``bz2_ncd``
2. ``lzma_ncd``
3. ``arith_ncd``
4. ``rle_ncd``
5. ``bwtrle_ncd``
6. ``zlib_ncd``

Phonetic
~~~~~~~~

+-----------------------------------------------------------------------------------+--------------+--------------+
| Algorithm | Class | Functions |
+===================================================================================+==============+==============+
| `MRA <https://en.wikipedia.org/wiki/Match_rating_approach>`__ | ``MRA`` | ``mra`` |
+-----------------------------------------------------------------------------------+--------------+--------------+
| `Editex <https://anhaidgroup.github.io/py_stringmatching/v0.3.x/Editex.html>`__ | ``Editex`` | ``editex`` |
+-----------------------------------------------------------------------------------+--------------+--------------+

Simple
~~~~~~

+-----------------------+----------------+----------------+
| Algorithm | Class | Functions |
+=======================+================+================+
| Prefix similarity | ``Prefix`` | ``prefix`` |
+-----------------------+----------------+----------------+
| Postfix similarity | ``Postfix`` | ``postfix`` |
+-----------------------+----------------+----------------+
| Length distance | ``Length`` | ``length`` |
+-----------------------+----------------+----------------+
| Identity similarity | ``Identity`` | ``identity`` |
+-----------------------+----------------+----------------+
| Matrix similarity | ``Matrix`` | ``matrix`` |
+-----------------------+----------------+----------------+

Usage
-----

All algorithms have 2 interfaces:

1. Class which can get some algorithm-specific params by init.
2. Class instance with default init params for quick and simple usage.

All algorithms have some common methods:

1. ``.distance(*sequences)`` -- calculate distance between sequences.
2. ``.similarity(*sequences)`` -- calculate similarity for sequences.
3. ``.maximum(*sequences)`` -- maximum possible value for distance and
similarity. ``distance + similarity == maximum``.
4. ``.normalized_distance(*sequences)`` -- normalized distance between
sequences. The return value is a float between 0 and 1, where 0 means
equal, and 1 totally different.
5. ``.normalized_distance(*sequences)`` -- normalized similarity for
sequences. The return value is a float between 0 and 1, where 0 means
totally different, and 1 equal.

Most common init arguments:

1. ``qval`` -- q-value for split sequences into q-grams. Possible
values:

- 1 (default) -- compare sequences by chars.
- 2 or more -- transform sequences to q-grams.
- None -- split sequences by words.

2. ``as_set`` -- for token-based algorithms:

- True -- ``t`` and ``ttt`` is equal.
- False (default) -- ``t`` and ``ttt`` is different.

Example
-------

.. code:: python

import textdistance

textdistance.hamming('test', 'text')
# 1

textdistance.hamming.distance('test', 'text')
# 1

textdistance.hamming.similarity('test', 'text')
# 3

textdistance.hamming.normalized_distance('test', 'text')
# 0.25

textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

textdistance.Hamming(qval=2).distance('test', 'text')
# 2

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.6.3

Jul 16, 2024

4.6.2

Apr 24, 2024

4.6.1

Dec 29, 2023

4.6.0

Sep 28, 2023

4.5.0

Sep 18, 2022

4.4.0

Aug 21, 2022

4.3.0

Jun 29, 2022

4.2.2

Oct 27, 2021

4.2.1

Jan 29, 2021

4.2.0

Apr 13, 2020

4.1.5

Oct 3, 2019

4.1.4

Aug 6, 2019

4.1.3

Apr 18, 2019

4.1.2

Mar 18, 2019

4.1.1

Mar 15, 2019

4.1.0

Mar 9, 2019

4.0.0

Mar 3, 2019

3.1.0

Jan 22, 2019

3.0.3

Apr 3, 2018

3.0.2

Mar 31, 2018

3.0.1

Mar 31, 2018

3.0.0

Mar 31, 2018

2.0.5

Apr 3, 2018

2.0.4

Apr 3, 2018

2.0.3

Mar 28, 2018

2.0.1

Feb 10, 2018

This version

2.0.0

Feb 10, 2018

1.0.0

May 5, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textdistance-2.0.0.tar.gz (9.7 kB view details)

Uploaded Feb 10, 2018 Source

File details

Details for the file textdistance-2.0.0.tar.gz.

File metadata

Download URL: textdistance-2.0.0.tar.gz
Upload date: Feb 10, 2018
Size: 9.7 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for textdistance-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`26ea623082a1b88ce8ec433e1a4260d9bd3471023faf08015fdb284437e3249f`
MD5	`1eaacf69f60f8ff6c11dd5f19d455435`
BLAKE2b-256	`6cdab3ea5ebe0a7fe75ec132686945792ce0efb35a4c76eff75eaac34eee59ec`