Skip to main content

Hidden alignment conditional random field, a discriminative string edit distance

Project description

https://travis-ci.org/datamade/pyhacrf.svg?branch=master https://ci.appveyor.com/api/projects/status/kibqrd7wnsk2ilpf/branch/master?svg=true

Hidden alignment conditional random field for classifying string pairs - a learnable edit distance.

This package aims to implement the HACRF machine learning model with a sklearn-like interface. It includes ways to fit a model to training examples and score new example.

The model takes string pairs as input and classify them into any number of classes. In McCallum’s original paper the model was applied to the database deduplication problem. Each database entry was paired with every other entry and the model then classified whether the pair was a ‘match’ or a ‘mismatch’ based on training examples of matches and mismatches.

I also tried to use it as learnable string edit distance for normalizing noisy text. See A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance by McCallum, Bellare, and Pereira, and the report Conditional Random Fields for Noisy text normalisation by Dirko Coetsee.

Example

from pyhacrf import StringPairFeatureExtractor, Hacrf

training_X = [('helloooo', 'hello'), # Matching examples
              ('h0me', 'home'),
              ('krazii', 'crazy'),
              ('non matching string example', 'no really'), # Non-matching examples
              ('and another one', 'yep')]
training_y = ['match',
              'match',
              'match',
              'non-match',
              'non-match']

# Extract features
feature_extractor = StringPairFeatureExtractor(match=True, numeric=True)
training_X_extracted = feature_extractor.fit_transform(training_X)

# Train model
model = Hacrf(l2_regularization=1.0)
model.fit(training_X_extracted, training_y)

# Evaluate
from sklearn.metrics import confusion_matrix
predictions = model.predict(training_X_extracted)

print(confusion_matrix(training_y, predictions))
> [[0 3]
>  [2 0]]

print(model.predict_proba(training_X_extracted))
> [[ 0.94914812  0.05085188]
>  [ 0.92397711  0.07602289]
>  [ 0.86756034  0.13243966]
>  [ 0.05438812  0.94561188]
>  [ 0.02641275  0.97358725]]

Dependencies

This package depends on numpy. The LBFGS optimizer in pylbfgs is used, but alternative optimizers can be passed.

Install

Install by running:

python setup.py install

or from pypi:

pip install pyhacrf

Developing

Clone from repository, then

pip install -r requirements-dev.txt
cython pyhacrf/*.pyx
python setup.py install

To deploy to pypi, make sure you have compiled the *.pyx files to *.c

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhacrf-datamade-0.2.0.tar.gz (246.5 kB view details)

Uploaded Source

Built Distributions

pyhacrf_datamade-0.2.0-py3.4-win-amd64.egg (159.8 kB view details)

Uploaded Source

pyhacrf_datamade-0.2.0-py3.4-win32.egg (139.3 kB view details)

Uploaded Source

pyhacrf_datamade-0.2.0-py2.7-win-amd64.egg (173.1 kB view details)

Uploaded Source

pyhacrf_datamade-0.2.0-py2.7-win32.egg (144.1 kB view details)

Uploaded Source

pyhacrf_datamade-0.2.0-cp36-cp36m-manylinux1_x86_64.whl (727.8 kB view details)

Uploaded CPython 3.6m

pyhacrf_datamade-0.2.0-cp36-cp36m-manylinux1_i686.whl (682.4 kB view details)

Uploaded CPython 3.6m

pyhacrf_datamade-0.2.0-cp36-cp36m-macosx_10_11_x86_64.whl (200.8 kB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

pyhacrf_datamade-0.2.0-cp35-cp35m-manylinux1_x86_64.whl (722.3 kB view details)

Uploaded CPython 3.5m

pyhacrf_datamade-0.2.0-cp35-cp35m-manylinux1_i686.whl (676.0 kB view details)

Uploaded CPython 3.5m

pyhacrf_datamade-0.2.0-cp35-cp35m-macosx_10_9_x86_64.whl (188.3 kB view details)

Uploaded CPython 3.5m macOS 10.9+ x86-64

pyhacrf_datamade-0.2.0-cp34-cp34m-win_amd64.whl (161.6 kB view details)

Uploaded CPython 3.4m Windows x86-64

pyhacrf_datamade-0.2.0-cp34-cp34m-win32.whl (141.1 kB view details)

Uploaded CPython 3.4m Windows x86

pyhacrf_datamade-0.2.0-cp34-cp34m-manylinux1_x86_64.whl (729.7 kB view details)

Uploaded CPython 3.4m

pyhacrf_datamade-0.2.0-cp34-cp34m-manylinux1_i686.whl (685.6 kB view details)

Uploaded CPython 3.4m

pyhacrf_datamade-0.2.0-cp27-cp27mu-manylinux1_x86_64.whl (713.0 kB view details)

Uploaded CPython 2.7mu

pyhacrf_datamade-0.2.0-cp27-cp27mu-manylinux1_i686.whl (664.7 kB view details)

Uploaded CPython 2.7mu

pyhacrf_datamade-0.2.0-cp27-cp27m-win_amd64.whl (175.0 kB view details)

Uploaded CPython 2.7m Windows x86-64

pyhacrf_datamade-0.2.0-cp27-cp27m-win32.whl (145.9 kB view details)

Uploaded CPython 2.7m Windows x86

pyhacrf_datamade-0.2.0-cp27-cp27m-manylinux1_x86_64.whl (712.9 kB view details)

Uploaded CPython 2.7m

pyhacrf_datamade-0.2.0-cp27-cp27m-manylinux1_i686.whl (664.8 kB view details)

Uploaded CPython 2.7m

pyhacrf_datamade-0.2.0-cp27-cp27m-macosx_10_11_x86_64.whl (208.1 kB view details)

Uploaded CPython 2.7m macOS 10.11+ x86-64

pyhacrf_datamade-0.2.0-cp27-cp27m-macosx_10_9_x86_64.whl (198.3 kB view details)

Uploaded CPython 2.7m macOS 10.9+ x86-64

File details

Details for the file pyhacrf-datamade-0.2.0.tar.gz.

File metadata

File hashes

Hashes for pyhacrf-datamade-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c71d2ea3789f1db4351d3e8e6bd50cb469db384fcdb88de020267f31f9ae24c7
MD5 83ec839c0b4bdfe7e52cc1eead3705f4
BLAKE2b-256 ea42a12ab88e8365b84c390659079eab536f3c03a30ad2beb64102b54d64e026

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-py3.4-win-amd64.egg.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-py3.4-win-amd64.egg
Algorithm Hash digest
SHA256 42c3789c08f4252e4be435c729a7c55d04e49225e781a9c128907d883085ea2a
MD5 5c9390d7bb4d624cab748ba82db2187c
BLAKE2b-256 41b9b777c5933381fbab5db52b2af0890e34cb74d34e52531b26d39ba521d004

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-py3.4-win32.egg.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-py3.4-win32.egg
Algorithm Hash digest
SHA256 d1ec384132796dff7e0db189de5c10bde925b15fc7ab54ff74637ec907d4da91
MD5 2ac294329aab8385e3f7198a6c9f5da5
BLAKE2b-256 a672deee6067cb806ddc72775eb811e91820cfe3bfca5352558ad6edc3cc2f6e

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-py2.7-win-amd64.egg.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-py2.7-win-amd64.egg
Algorithm Hash digest
SHA256 a621be96c739dd649d48573194a6f2adeec43d073f4a65f20833d84340c00b7d
MD5 efaf80bec11ad3c40d1a194787588c09
BLAKE2b-256 1f701a0f563be2d1aad4472b300b9a6b4b546fb1a323fbaab23cc36ad813e859

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-py2.7-win32.egg.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-py2.7-win32.egg
Algorithm Hash digest
SHA256 666d55274d9af279bb031005fa0971791de3bd6c7235e6d954f2f84c068521fb
MD5 45acabb7454d60b3a2abecd986af7a28
BLAKE2b-256 700f627fe822d08a4e808d7cc91784e1e1ab1919a57a4ae1df979a45649008d7

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4dca1e05924927b991721b2ba8e93b3129616b59f7114e2c8d2a73a3b9865b57
MD5 c23521c9ed5f6dc202a6ced7aec4991a
BLAKE2b-256 9d1f837ab796bf860450e33362cd0dcb913776e4a7730f19f73335582cd3866e

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp36-cp36m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 17cb9c62b02335b2a5429a70e9d6210621eadf30a92f32e31917520b0f508361
MD5 083923405c504eda6564c0d937c0eb0a
BLAKE2b-256 a528d55199b07788495b2a71548931e9fbf181e34bff06e2f01040a091cec30f

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 dd32c0e374eda589fec560c6a543180b89577f46cac5d738fa854561db512041
MD5 f173dbf9ec177e6ffa2b25965d7ccc67
BLAKE2b-256 0a68daed153cc6d5f133544072f77d3d8fb44b8bf70b405f8549b7ad2f08d354

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5a922b81dcfa8b6fc40c394be0f8e66cc9b7c4ef586ab77f426e1f19c51559bc
MD5 26145d1eb45c8955cf1384c70f7c3b8d
BLAKE2b-256 21ad8dab53c6f1e8644f37535f7b856d2f99593e3eed3e91ddbe06de49b96777

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp35-cp35m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 1451a44c81ef0df856a86bce3a52a9a9e4407a9f6939572d4d8d42a8adfebf6c
MD5 5fbe3f264ffbcd691f8734382cb2e8e3
BLAKE2b-256 cdf405afb72c616846365862ef965ff98a5e9317ebb228a78e6871310e15d1de

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d9b44bbd92f896008c8240bf56f1c127d916a24ecae602b06dc4fe034bf2351f
MD5 8f01a61d85354fa9a8f597b8d564e09a
BLAKE2b-256 61496c88ba3540497637371b072f7a62e09fcddcbf689d53639bbbcbbb597ecd

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp34-cp34m-win_amd64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp34-cp34m-win_amd64.whl
Algorithm Hash digest
SHA256 ea1cb8beb8f184fa09eedac4a5aef7655c24d69b87fe0574be23658e0b4e2b35
MD5 8902e203774ea0a1d75fbdeb01715609
BLAKE2b-256 04c7e1f0902b4d3796a3402300d8b5f54a61bb4f8c18b1648fca51bebd6b5706

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp34-cp34m-win32.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp34-cp34m-win32.whl
Algorithm Hash digest
SHA256 2805503e790322cedc199db5e39606ae424fa085f87090e8970cbbcffaa9f454
MD5 0cb297fadafef0d4e33a4b14cbae3ef0
BLAKE2b-256 e78a27c24cacb6ba417bb3c71321052acd6776b47a9d716f397b06e7395e9012

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 93bde58d24fec3b686de133a9588674ce6526e497907c9b2a11d48b3937f3bf1
MD5 14b82dee7c5d32bd24bf9647c88a9505
BLAKE2b-256 be67a44ce3901899b385ceb9094f7e0ef47b88d5aa5aa29f115070231cd63724

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp34-cp34m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 fc721dc2e0bd38c3a33cb7c0e46ce0b43e06044b58425d614acb137c7ff4c607
MD5 5dcbf15b9d06ddc44668eda5bab2233d
BLAKE2b-256 085bb1f41beb7d3c2c0263f14066693596f53b4f3060ea1de39e1fc634ee41f7

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0f8c6de8384544242c79ec8b969ac6960466161afbb78ed3c20d50cc861b3578
MD5 e251fa9e1181d50574ca341f44fb7b29
BLAKE2b-256 0c753446f24a39feb52f07650b23312768f4e1bc11d7846f8fd25fe3c8bef386

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp27-cp27mu-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 ae3ffbb2cf7a0574385ee2f8fe87b77f79abce374c99f84d8c3254fe0df310b6
MD5 090bec858808a5d141755591a48608a0
BLAKE2b-256 a711372f73326423aa80528c0397f698f29962fd0ab5859288e99108b870cfe6

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp27-cp27m-win_amd64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp27-cp27m-win_amd64.whl
Algorithm Hash digest
SHA256 e3f04d3ab28e36a6df2fe06e0bb05388b35e966204868af856ac2c47aada932e
MD5 050ebbd2e8fa975b929d736c4d4f1ebb
BLAKE2b-256 dfe51b759352eb9fa6e746ab40e945091008d885d864eb5b77bd71a4a1e03e8b

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp27-cp27m-win32.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp27-cp27m-win32.whl
Algorithm Hash digest
SHA256 05e160a583c044a8dc5ac6dd6e76a43348d097b66a11a71cb99ac1ec4cddb7bf
MD5 601a25d11feab87a4be0a033964455c5
BLAKE2b-256 cd2ae40228fe8188e2e3692f73e7d1111a6e2bad8816c8f3a2880b500fd3ae5c

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 648f6fbc0ba1523a571d35f47b86a5cc60de6fad89d6664cf12c4dbe2534aa7b
MD5 381373efdfa911ac46e425fc82d850ac
BLAKE2b-256 c06d1b2dec26732c53c5783ece9dd32c91aa4c59f1916447e57430cccbb18439

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp27-cp27m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 13da74dbc07a82916947a4354ea0629fdec0df736d2eb1f20a7cd24efe2e2906
MD5 82bb45549ced42ff7fa4f6d78aae2c8b
BLAKE2b-256 82a945a338e1c7167d552f80b9933051cabc861758cc1ac27cac1b02261e8b93

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp27-cp27m-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp27-cp27m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 850fc6d8672fac8dbd8836103e059c24b16d083e5d43c15065f10c511ab85d67
MD5 fce64799c0cf510093546b31ff9e094f
BLAKE2b-256 29bcfd63de85bb1e0f7e63bbdd78c549647df705e3c2a88218d59b82c61597a6

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.0-cp27-cp27m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.0-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b332a1110e147954af08235f452f8923e07a91161865615497ba1df5a8dcd9a4
MD5 bfb97b1b3bf8da56776b18237b74a4b3
BLAKE2b-256 6fc03e5a4c4725f06df5c80dfd090fc8a903dde383d36481737754361bc95347

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page