Skip to main content

Hidden alignment conditional random field, a discriminative string edit distance

Project description

https://travis-ci.org/dedupeio/pyhacrf.svg?branch=master https://ci.appveyor.com/api/projects/status/kibqrd7wnsk2ilpf/branch/master?svg=true

Hidden alignment conditional random field for classifying string pairs - a learnable edit distance.

This package aims to implement the HACRF machine learning model with a sklearn-like interface. It includes ways to fit a model to training examples and score new example.

The model takes string pairs as input and classify them into any number of classes. In McCallum’s original paper the model was applied to the database deduplication problem. Each database entry was paired with every other entry and the model then classified whether the pair was a ‘match’ or a ‘mismatch’ based on training examples of matches and mismatches.

I also tried to use it as learnable string edit distance for normalizing noisy text. See A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance by McCallum, Bellare, and Pereira, and the report Conditional Random Fields for Noisy text normalisation by Dirko Coetsee.

Example

from pyhacrf import StringPairFeatureExtractor, Hacrf

training_X = [('helloooo', 'hello'), # Matching examples
              ('h0me', 'home'),
              ('krazii', 'crazy'),
              ('non matching string example', 'no really'), # Non-matching examples
              ('and another one', 'yep')]
training_y = ['match',
              'match',
              'match',
              'non-match',
              'non-match']

# Extract features
feature_extractor = StringPairFeatureExtractor(match=True, numeric=True)
training_X_extracted = feature_extractor.fit_transform(training_X)

# Train model
model = Hacrf(l2_regularization=1.0)
model.fit(training_X_extracted, training_y)

# Evaluate
from sklearn.metrics import confusion_matrix
predictions = model.predict(training_X_extracted)

print(confusion_matrix(training_y, predictions))
> [[0 3]
>  [2 0]]

print(model.predict_proba(training_X_extracted))
> [[ 0.94914812  0.05085188]
>  [ 0.92397711  0.07602289]
>  [ 0.86756034  0.13243966]
>  [ 0.05438812  0.94561188]
>  [ 0.02641275  0.97358725]]

Dependencies

This package depends on numpy. The LBFGS optimizer in pylbfgs is used, but alternative optimizers can be passed.

Install

Install by running:

python setup.py install

or from pypi:

pip install pyhacrf

Developing

Clone from repository, then

pip install -r requirements.txt
cython pyhacrf/*.pyx
python setup.py install

To deploy to pypi, make sure you have compiled the *.pyx files to *.c

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhacrf-datamade-0.2.5.tar.gz (290.8 kB view details)

Uploaded Source

Built Distributions

pyhacrf_datamade-0.2.5-cp38-cp38-win_amd64.whl (198.4 kB view details)

Uploaded CPython 3.8 Windows x86-64

pyhacrf_datamade-0.2.5-cp38-cp38-manylinux1_x86_64.whl (825.0 kB view details)

Uploaded CPython 3.8

pyhacrf_datamade-0.2.5-cp38-cp38-macosx_10_15_x86_64.whl (207.1 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

pyhacrf_datamade-0.2.5-cp37-cp37m-win_amd64.whl (195.7 kB view details)

Uploaded CPython 3.7m Windows x86-64

pyhacrf_datamade-0.2.5-cp37-cp37m-manylinux1_x86_64.whl (787.2 kB view details)

Uploaded CPython 3.7m

pyhacrf_datamade-0.2.5-cp37-cp37m-macosx_10_15_x86_64.whl (204.6 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

pyhacrf_datamade-0.2.5-cp36-cp36m-win_amd64.whl (195.2 kB view details)

Uploaded CPython 3.6m Windows x86-64

pyhacrf_datamade-0.2.5-cp36-cp36m-manylinux1_x86_64.whl (788.8 kB view details)

Uploaded CPython 3.6m

pyhacrf_datamade-0.2.5-cp36-cp36m-macosx_10_15_x86_64.whl (204.3 kB view details)

Uploaded CPython 3.6m macOS 10.15+ x86-64

File details

Details for the file pyhacrf-datamade-0.2.5.tar.gz.

File metadata

  • Download URL: pyhacrf-datamade-0.2.5.tar.gz
  • Upload date:
  • Size: 290.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.8.1

File hashes

Hashes for pyhacrf-datamade-0.2.5.tar.gz
Algorithm Hash digest
SHA256 cf9dc239090696cee301d5d32074e1ed881853c6da789ce05817c072b22c4f44
MD5 517de8b6990d56f4b03e107a7792c6f5
BLAKE2b-256 fdcad4d3f709306c75fe5285ef020635c6a417a67f20cc3596147ed9efd75479

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: pyhacrf_datamade-0.2.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 198.4 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.8.1

File hashes

Hashes for pyhacrf_datamade-0.2.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 3ffbb0c06a14e9720ec0fab4f222648bd7f087e90351a1d1c73bcd21708d20bd
MD5 58bc4d40fe10446d3f994920e07a3da8
BLAKE2b-256 56aa455a095a4a833aa20fa2fe0f8cb505089024009ee6488aa6763cc5285e6d

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.5-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyhacrf_datamade-0.2.5-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 825.0 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.8.1

File hashes

Hashes for pyhacrf_datamade-0.2.5-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5f7ad3fc24ba3d7ca28154114a7797ed0433d9cd31c5302bb8361dfaf3174022
MD5 91c3c8243558c89b304e096c4e494359
BLAKE2b-256 2ef1599f37c0d566bd04745617ec7499da72a7c81c7bf682372b663b67108998

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.5-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: pyhacrf_datamade-0.2.5-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 207.1 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.8.1

File hashes

Hashes for pyhacrf_datamade-0.2.5-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 298f33b49b24e791e8ee2bdaec17fcd0f54f22830a11be610b833727b010a732
MD5 ec4b67c659350885d91a289bf9f4c72f
BLAKE2b-256 706cbc6df9c67bfd6daf88dd58660d8a98c806416412f785edc60898f16df3a7

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.5-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: pyhacrf_datamade-0.2.5-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 195.7 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.7.6

File hashes

Hashes for pyhacrf_datamade-0.2.5-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 eab01dd7087babfb06571c2bbbd777430371c357ca2464972b46b4ed10a7153a
MD5 cf8cf7f91f120a3215065e1922cad725
BLAKE2b-256 700960068238ebae47dcb108a0ca877308ec3b11bb29c8170f44bf09ace44e49

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.5-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyhacrf_datamade-0.2.5-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 787.2 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.8.1

File hashes

Hashes for pyhacrf_datamade-0.2.5-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1a5c1591ac2026a6f49436a9cc1075d7b6a187077961bbbd7a79f60f7faabe05
MD5 d9a2d288dd9d501374559b45e9a3ae2e
BLAKE2b-256 fc8e58e39a130c86a0f2ae7bbb82685d1190405619463c261118b479cbc8113b

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.5-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: pyhacrf_datamade-0.2.5-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 204.6 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.7.6

File hashes

Hashes for pyhacrf_datamade-0.2.5-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 48f2da0d1860683fc7da3a3a807c0773ec43efca7dfa31316606e92d88dd720f
MD5 acec4fe3cef7e21a930ab3261c2fb002
BLAKE2b-256 f22a373d2f682b197e43b97e778ffc1a73647deb8423212312c0b43e2940ebc3

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.5-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: pyhacrf_datamade-0.2.5-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 195.2 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.6.8

File hashes

Hashes for pyhacrf_datamade-0.2.5-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 3c496bc2f01872ee269340c72f319c93b80f83af06359d6cc44e576f07556a26
MD5 9fc35616e6cb152a5b4b005c21a50c6f
BLAKE2b-256 97056ce061208fcf762ec834ce7772db4f9a8c368f3dd2f43efdf9bf8775033a

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.5-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyhacrf_datamade-0.2.5-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 788.8 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.8.1

File hashes

Hashes for pyhacrf_datamade-0.2.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8cf4c689e70ec13584f3fda7a645b72ced865e8647cc6a4404beb2b2f8961826
MD5 973021cd8d4bed03c1b8a5603cb78d15
BLAKE2b-256 84f5971e17a8b6686d5fc3d562e29e9c902743eb5f0f4436880b86cb11c0149c

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.5-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: pyhacrf_datamade-0.2.5-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 204.3 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.6.10

File hashes

Hashes for pyhacrf_datamade-0.2.5-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 394b97f06c19c0e9cb5350c6a2ba8cfb29790f043af1a516f01a02fa49ccfd68
MD5 32fb6bae244992d8eca9b620dfbc6459
BLAKE2b-256 a7db3b9a44d6fb46f7556939cdbb4671a89be06d01c0181c15db8a800bfb80c7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page