textdistance

Compute distance between the two texts.

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Environment
- Plugins
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python
Topic
- Scientific/Engineering :: Human Machine Interfaces

Project description

TextDistance logo

TextDistance – python library for comparing distance between two or more sequences by many algorithms.

Features:

30+ algorithms
Pure python implementation
Simple usage
More than two sequences comparing
Some algorithms have more than one implementation in one class.
Optional numpy usage for maximum speed.

Algorithms

Edit based

Algorithm	Class	Functions
Hamming	Hamming	hamming
MLIPNS	Mlipns	mlipns
Levenshtein	Levenshtein	levenshtein
Damerau-Levenshtein	DamerauLevenshtein	damerau_levenshtein
Jaro-Winkler	JaroWinkler	jaro_winkler, jaro
Strcmp95	StrCmp95	strcmp95
Needleman-Wunsch	NeedlemanWunsch	needleman_wunsch
Gotoh	Gotoh	gotoh
Smith-Waterman	SmithWaterman	smith_waterman

Token based

Algorithm	Class	Functions
Jaccard index	Jaccard	jaccard
Sørensen–Dice coefficient	Sorensen	sorensen, sorensen_dice, dice
Tversky index	Tversky	tversky
Overlap coefficient	Overlap	overlap
Tanimoto distance	Tanimoto	tanimoto
Cosine similarity	Cosine	cosine
Monge-Elkan	MongeElkan	monge_elkan
Bag distance	Bag	bag

Sequence based

Algorithm	Class	Functions
longest common subsequence similarity	LCSSeq	lcsseq
longest common substring similarity	LCSStr	lcsstr
Ratcliff-Obershelp similarity	RatcliffObershelp	ratcliff_obershelp

Compression based

Normalized compression distance with different compression algorithms.

Classic compression algorithms:

Algorithm	Class	Function
Arithmetic coding	ArithNCD	arith_ncd
RLE	RLENCD	rle_ncd
BWT RLE	BWTRLENCD	bwtrle_ncd

Normal compression algorithms:

Algorithm	Class	Function
Square Root	SqrtNCD	sqrt_ncd
Entropy	EntropyNCD	entropy_ncd

Work in progress algorithms that compare two strings as array of bits:

Algorithm	Class	Function
BZ2	BZ2NCD	bz2_ncd
LZMA	LZMANCD	lzma_ncd
ZLib	ZLIBNCD	zlib_ncd

See blog post for more details about NCD.

Phonetic

Algorithm	Class	Functions
MRA	MRA	mra
Editex	Editex	editex

Simple

Algorithm	Class	Functions
Prefix similarity	Prefix	prefix
Postfix similarity	Postfix	postfix
Length distance	Length	length
Identity similarity	Identity	identity
Matrix similarity	Matrix	matrix

Installation

Stable

Only pure python implementation:

pip install textdistance

With extra libraries for maximum speed:

pip install "textdistance[extras]"

With all libraries (required for benchmarking and testing):

pip install "textdistance[benchmark]"

With algorithm specific extras:

pip install "textdistance[Hamming]"

Algorithms with available extras: DamerauLevenshtein, Hamming, Jaro, JaroWinkler, Levenshtein.

Dev

Via pip:

pip install -e git+https://github.com/life4/textdistance.git#egg=textdistance

Or clone repo and install with some extras:

git clone https://github.com/life4/textdistance.git
pip install -e ".[benchmark]"

Usage

All algorithms have 2 interfaces:

Class with algorithm-specific params for customizing.
Class instance with default params for quick and simple usage.

All algorithms have some common methods:

.distance(*sequences) – calculate distance between sequences.
.similarity(*sequences) – calculate similarity for sequences.
.maximum(*sequences) – maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
.normalized_distance(*sequences) – normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
.normalized_similarity(*sequences) – normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.

Most common init arguments:

qval – q-value for split sequences into q-grams. Possible values:
- 1 (default) – compare sequences by chars.
- 2 or more – transform sequences to q-grams.
- None – split sequences by words.
as_set – for token-based algorithms:
- True – t and ttt is equal.
- False (default) – t and ttt is different.

Example

For example, Hamming distance:

import textdistance

textdistance.hamming('test', 'text')
# 1

textdistance.hamming.distance('test', 'text')
# 1

textdistance.hamming.similarity('test', 'text')
# 3

textdistance.hamming.normalized_distance('test', 'text')
# 0.25

textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

textdistance.Hamming(qval=2).distance('test', 'text')
# 2

Any other algorithms have same interface.

Extra libraries

For main algorithms textdistance try to call known external libraries (fastest first) if available (installed in your system) and possible (this implementation can compare this type of sequences). Install textdistance with extras for this feature.

You can disable this by passing external=False argument on init:

import textdistance
hamming = textdistance.Hamming(external=False)
hamming('text', 'testit')
# 3

Supported libraries:

Algorithms:

DamerauLevenshtein
Hamming
Jaro
JaroWinkler
Levenshtein

Benchmarks

Without extras installation:

algorithm	library	function	time
DamerauLeven shtein	jellyfish	damerau_le venshtein_ distance	0.00965 294
DamerauLeven shtein	pyxdamerau levenshtei n	damerau_le venshtein_ distance	0.15137 8
DamerauLeven shtein	pylev	damerau_le venshtein	0.76646 1
DamerauLeven shtein	textdist ance	DamerauLeve nshtein	4.13463
DamerauLeven shtein	abydos	damerau_le venshtein	4.3831
Hamming	Levenshtei n	hamming	0.00144 28
Hamming	jellyfish	hamming_di stance	0.00240 262
Hamming	distance	hamming	0.03625 3
Hamming	abydos	hamming	0.03839 33
Hamming	textdist ance	Hamming	0.17678 1
Jaro	Levenshtei n	jaro	0.00313 561
Jaro	jellyfish	jaro_dista nce	0.00518 85
Jaro	py_string matching	jaro	0.18062 8
Jaro	textdist ance	Jaro	0.27891 7
JaroWinkler	Levenshtei n	jaro_winkl er	0.00319 735
JaroWinkler	jellyfish	jaro_winkl er	0.00540 443
JaroWinkler	textdist ance	JaroWinkler	0.28962 6
Levenshtein	Levenshtei n	distance	0.00414 404
Levenshtein	jellyfish	levenshtein _distance	0.00601 647
Levenshtein	py_string matching	levenshtein	0.25290 1
Levenshtein	pylev	levenshtein	0.56918 2
Levenshtein	distance	levenshtein	1.15726
Levenshtein	abydos	levenshtein	3.68451
Levenshtein	textdist ance	Levenshtein	8.63674

Total: 24 libs.

Yeah, so slow. Use TextDistance on production only with extras.

Textdistance use benchmark’s results for algorithm’s optimization and try to call fastest external lib first (if possible).

You can run benchmark manually on your system:

pip install textdistance[benchmark]
python3 -m textdistance.benchmark

TextDistance show benchmarks results table for your system and save libraries priorities into libraries.json file in TextDistance’s folder. This file will be used by textdistance for calling fastest algorithm implementation. Default libraries.json already included in package.

Test

You can run tests via tox:

sudo pip3 install tox
tox

Project details

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Environment
- Plugins
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python
Topic
- Scientific/Engineering :: Human Machine Interfaces

Release history Release notifications | RSS feed

4.6.3

Jul 16, 2024

4.6.2

Apr 24, 2024

4.6.1

Dec 29, 2023

4.6.0

Sep 28, 2023

4.5.0

Sep 18, 2022

4.4.0

Aug 21, 2022

4.3.0

Jun 29, 2022

4.2.2

Oct 27, 2021

4.2.1

Jan 29, 2021

This version

4.2.0

Apr 13, 2020

4.1.5

Oct 3, 2019

4.1.4

Aug 6, 2019

4.1.3

Apr 18, 2019

4.1.2

Mar 18, 2019

4.1.1

Mar 15, 2019

4.1.0

Mar 9, 2019

4.0.0

Mar 3, 2019

3.1.0

Jan 22, 2019

3.0.3

Apr 3, 2018

3.0.2

Mar 31, 2018

3.0.1

Mar 31, 2018

3.0.0

Mar 31, 2018

2.0.5

Apr 3, 2018

2.0.4

Apr 3, 2018

2.0.3

Mar 28, 2018

2.0.1

Feb 10, 2018

2.0.0

Feb 10, 2018

1.0.0

May 5, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textdistance-4.2.0.tar.gz (34.5 kB view details)

Uploaded Apr 13, 2020 Source

Built Distribution

textdistance-4.2.0-py3-none-any.whl (29.1 kB view details)

Uploaded Apr 13, 2020 Python 3

File details

Details for the file textdistance-4.2.0.tar.gz.

File metadata

Download URL: textdistance-4.2.0.tar.gz
Upload date: Apr 13, 2020
Size: 34.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: DepHell/0.8.2

File hashes

Hashes for textdistance-4.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6d2a398815aeed453cfb38a3b62da74e33fa6a5f4e42845fd1d2c9611836befd`
MD5	`3d31f3930b0ce295f74c307040451ead`
BLAKE2b-256	`a94c96d7ff24f1bee11ade34b1daea9f70fc4c115781bbf380089470c053ef4d`

See more details on using hashes here.

File details

Details for the file textdistance-4.2.0-py3-none-any.whl.

File metadata

Download URL: textdistance-4.2.0-py3-none-any.whl
Upload date: Apr 13, 2020
Size: 29.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: DepHell/0.8.2

File hashes

Hashes for textdistance-4.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61ddcdd9a78da99eff11dc1219d444f72915212cf36947de3266a356f5e934f7`
MD5	`60ab7c6f34b2c53901b2d5772cb92d5a`
BLAKE2b-256	`357187133323736b9b0180f600d477507318dae0abde613a54df33bfd0248614`