Skip to main content

Hidden alignment conditional random field, a discriminative string edit distance

Project description

https://travis-ci.org/datamade/pyhacrf.svg?branch=master https://ci.appveyor.com/api/projects/status/kibqrd7wnsk2ilpf/branch/master?svg=true

Hidden alignment conditional random field for classifying string pairs - a learnable edit distance.

This package aims to implement the HACRF machine learning model with a sklearn-like interface. It includes ways to fit a model to training examples and score new example.

The model takes string pairs as input and classify them into any number of classes. In McCallum’s original paper the model was applied to the database deduplication problem. Each database entry was paired with every other entry and the model then classified whether the pair was a ‘match’ or a ‘mismatch’ based on training examples of matches and mismatches.

I also tried to use it as learnable string edit distance for normalizing noisy text. See A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance by McCallum, Bellare, and Pereira, and the report Conditional Random Fields for Noisy text normalisation by Dirko Coetsee.

Example

from pyhacrf import StringPairFeatureExtractor, Hacrf

training_X = [('helloooo', 'hello'), # Matching examples
              ('h0me', 'home'),
              ('krazii', 'crazy'),
              ('non matching string example', 'no really'), # Non-matching examples
              ('and another one', 'yep')]
training_y = ['match',
              'match',
              'match',
              'non-match',
              'non-match']

# Extract features
feature_extractor = StringPairFeatureExtractor(match=True, numeric=True)
training_X_extracted = feature_extractor.fit_transform(training_X)

# Train model
model = Hacrf(l2_regularization=1.0)
model.fit(training_X_extracted, training_y)

# Evaluate
from sklearn.metrics import confusion_matrix
predictions = model.predict(training_X_extracted)

print(confusion_matrix(training_y, predictions))
> [[0 3]
>  [2 0]]

print(model.predict_proba(training_X_extracted))
> [[ 0.94914812  0.05085188]
>  [ 0.92397711  0.07602289]
>  [ 0.86756034  0.13243966]
>  [ 0.05438812  0.94561188]
>  [ 0.02641275  0.97358725]]

Dependencies

This package depends on numpy. The LBFGS optimizer in pylbfgs is used, but alternative optimizers can be passed.

Install

Install by running:

python setup.py install

or from pypi:

pip install pyhacrf

Developing

Clone from repository, then

pip install -r requirements.txt
cython pyhacrf/*.pyx
python setup.py install

To deploy to pypi, make sure you have compiled the *.pyx files to *.c

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhacrf-datamade-0.2.1.tar.gz (255.8 kB view details)

Uploaded Source

Built Distributions

pyhacrf_datamade-0.2.1-py3.4-win-amd64.egg (162.3 kB view details)

Uploaded Source

pyhacrf_datamade-0.2.1-py3.4-win32.egg (141.6 kB view details)

Uploaded Source

pyhacrf_datamade-0.2.1-py2.7-win-amd64.egg (176.0 kB view details)

Uploaded Source

pyhacrf_datamade-0.2.1-py2.7-win32.egg (146.9 kB view details)

Uploaded Source

pyhacrf_datamade-0.2.1-cp36-cp36m-manylinux1_x86_64.whl (727.8 kB view details)

Uploaded CPython 3.6m

pyhacrf_datamade-0.2.1-cp36-cp36m-manylinux1_i686.whl (682.4 kB view details)

Uploaded CPython 3.6m

pyhacrf_datamade-0.2.1-cp36-cp36m-macosx_10_11_x86_64.whl (200.8 kB view details)

Uploaded CPython 3.6m macOS 10.11+ x86-64

pyhacrf_datamade-0.2.1-cp35-cp35m-manylinux1_x86_64.whl (722.3 kB view details)

Uploaded CPython 3.5m

pyhacrf_datamade-0.2.1-cp35-cp35m-manylinux1_i686.whl (676.0 kB view details)

Uploaded CPython 3.5m

pyhacrf_datamade-0.2.1-cp34-cp34m-win_amd64.whl (164.2 kB view details)

Uploaded CPython 3.4m Windows x86-64

pyhacrf_datamade-0.2.1-cp34-cp34m-win32.whl (143.4 kB view details)

Uploaded CPython 3.4m Windows x86

pyhacrf_datamade-0.2.1-cp34-cp34m-manylinux1_x86_64.whl (729.7 kB view details)

Uploaded CPython 3.4m

pyhacrf_datamade-0.2.1-cp34-cp34m-manylinux1_i686.whl (685.6 kB view details)

Uploaded CPython 3.4m

pyhacrf_datamade-0.2.1-cp27-cp27mu-manylinux1_x86_64.whl (713.0 kB view details)

Uploaded CPython 2.7mu

pyhacrf_datamade-0.2.1-cp27-cp27mu-manylinux1_i686.whl (664.7 kB view details)

Uploaded CPython 2.7mu

pyhacrf_datamade-0.2.1-cp27-cp27m-win_amd64.whl (177.8 kB view details)

Uploaded CPython 2.7m Windows x86-64

pyhacrf_datamade-0.2.1-cp27-cp27m-win32.whl (148.7 kB view details)

Uploaded CPython 2.7m Windows x86

pyhacrf_datamade-0.2.1-cp27-cp27m-manylinux1_x86_64.whl (712.9 kB view details)

Uploaded CPython 2.7m

pyhacrf_datamade-0.2.1-cp27-cp27m-manylinux1_i686.whl (664.8 kB view details)

Uploaded CPython 2.7m

pyhacrf_datamade-0.2.1-cp27-cp27m-macosx_10_11_x86_64.whl (208.1 kB view details)

Uploaded CPython 2.7m macOS 10.11+ x86-64

File details

Details for the file pyhacrf-datamade-0.2.1.tar.gz.

File metadata

File hashes

Hashes for pyhacrf-datamade-0.2.1.tar.gz
Algorithm Hash digest
SHA256 70491393007b7df70ffecc3c6b9c3d81340de9140c34816cced94e80698e72ec
MD5 04abb6625512efb59cbf779498364a6e
BLAKE2b-256 c40847579904164dc6764a6a2ee0eac4ac51f28ede6b0f868ea802d5f87cb0a1

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-py3.4-win-amd64.egg.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-py3.4-win-amd64.egg
Algorithm Hash digest
SHA256 e31204bce4696d919583be5187c859d2268a928becad584a102b3d4bb769b3df
MD5 7fc290bea08c120a08ecb736e88afa8f
BLAKE2b-256 a6dac5631069bcdcf7b68117bf5571c0052ff981794f6744c7eee83de9556039

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-py3.4-win32.egg.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-py3.4-win32.egg
Algorithm Hash digest
SHA256 a04719b9e7fdd21c3fa991a80ef8e392392a3d9d6d075de0ddd5d27b1e85e91f
MD5 9f793381a0507ee81f784bfdee978b7d
BLAKE2b-256 bba18915cfae0dd2137ee45cbcf6db5266d11bb4ac2a15676db7be2c447bb544

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-py2.7-win-amd64.egg.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-py2.7-win-amd64.egg
Algorithm Hash digest
SHA256 24e6a0685491a119e27cf39e8f30624d40c818e06185472ca5f4288854b73733
MD5 6dab8c5499ad2be0a5064ce76d25edaf
BLAKE2b-256 d0f218ae20931a8cdda6080f20cb823b14d6d0377db5fcdcab6e63920e3e9347

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-py2.7-win32.egg.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-py2.7-win32.egg
Algorithm Hash digest
SHA256 f8118028d08bd4b48d67f1d861c5d28063e41e3981dcdbd763ce246dcd45b716
MD5 554b7fcffa22c7e46d1bf0125484d5b0
BLAKE2b-256 c88be04d4032897f6ed19bc6bdedbbeeef4166586b57d95528feabc6f85566e1

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bdfefc09674be4877856b1019e33f9183359919108f6a02dc611f9e441309a7a
MD5 289e378834c73e8012c0d67641c77666
BLAKE2b-256 e12364e21ea6d911d86ee08afaf0dee3e03af062aa063b53f8e4777bc559f68e

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp36-cp36m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 d1481176b24ccfd0ed91e0bf5bfc9cfcf27852c3aaa7a14368805d81cd84304d
MD5 1773c6749ce94b900dccf098e1663816
BLAKE2b-256 8490c68853cef700b6d2058762b828912bcf36e07f84e55b572fdab1477d08a0

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 c814925ed0a9ae35741b0673e50b1a98407b72ff872a6cd6103db8e9ecfa558b
MD5 b1d1262dc3351137c04ef19a193ce30b
BLAKE2b-256 587c033ad6b4e324020239ce241cfbcc27598b05b09d5a03c4cfd29a8c857809

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 42f4c64a306b8ff41707b5c7866c1d6c331cf2a8ed05b87cd2da52d11d5638e9
MD5 061c03163a89a93cc402332950233ee0
BLAKE2b-256 b7efcc9bf3f762e9cd962b8a48936ef143d96bb56ea1b0e680f8bc6bb1f6975b

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp35-cp35m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 a9f7a6f54a45378e8e6fccccc110a6c19a12b42bf81f81af63bebdf5a80234b3
MD5 0f14f966572957ba8ed5f68bd73c0735
BLAKE2b-256 b574fc895167a44cb2aec5559a178213520966750416f87cfe77fecaea59e4eb

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp34-cp34m-win_amd64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp34-cp34m-win_amd64.whl
Algorithm Hash digest
SHA256 eb1b0af254b5f8fe926a3c677ff6851739ec5f47f607f681dee29f3fb59a397a
MD5 c5727375748d1b2f751b9dc938ea790a
BLAKE2b-256 80425a88715b27451b599d5bf3de32d006ef7bccd526b9832025f40843f90386

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp34-cp34m-win32.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp34-cp34m-win32.whl
Algorithm Hash digest
SHA256 a2e72ff2f539ee5a70da8c38b28937f13edc7769cf78608b07435996f4eb05cf
MD5 517abe147dcb0c69e75d2d7b1d6d6c36
BLAKE2b-256 27394de649ec8b0198e501970729c8ecb95614d76425e329371ee179ed406eba

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2b21047949e9ea585533c39d7c5e4cbea95a0e61ebe8671c2ecd84b6510ea567
MD5 0eb54a4acac0094f580f1504054d1c69
BLAKE2b-256 ac9d8b0bf153ece4eda20e56930eaa946d469ad2c39dfcc411fa7591ff046de3

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp34-cp34m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 3409d785882618862f516bb9c57edffc54f39819e59f9d6b7b3036f5642e599c
MD5 5cff8029f7160e6e4e7c98bdec3780f8
BLAKE2b-256 d3ba3da1a9dcd715209734503afba5b553d7663a8d47f49509da391e6785a5ea

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f5426128c66d71e3edf73feb279c4cefb68e26b4228b152553c91ed5ab5da75a
MD5 0ecd44e5b486ce0437bc17f671657e33
BLAKE2b-256 447fc9c9098cc0e1578b8bad5c3e400c04cca59967d9dfc57771dee0486d6288

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp27-cp27mu-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 7ece66552962dff5993c7dbef75b2429cec9bf1a6534f26a852fd35fc74888c0
MD5 b23e0c2b00d726efaed05c813615dd3d
BLAKE2b-256 d754ae7d44cfc0a41323b802a315f4d5956ca9f62ffd656800c48d82a8816368

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp27-cp27m-win_amd64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp27-cp27m-win_amd64.whl
Algorithm Hash digest
SHA256 3b66ffc890de039a0e88bd8540befc4da40c9f4a81406f023d3c63f5d4faf4ac
MD5 b054fb279681528039a994e3684fee41
BLAKE2b-256 cfe6a1e159aa90f63fbf0f76265807ce44a9ed1d60bc3981f42fffded43c8ee0

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp27-cp27m-win32.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp27-cp27m-win32.whl
Algorithm Hash digest
SHA256 232d61903ebd44521fe0b95cd7f8bc277e6e262c52ce3e368546a92efc1d8b85
MD5 f54d17583ef42ee3c5a383858ca05921
BLAKE2b-256 66cfc2c5dd07c53b02037c78da22a95635bcfd4fa2d7b222501a0a50793004e4

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 16f7a14be1ecd16e5184c4debfd9d548a41794805702f2e8f192616a50390b78
MD5 fd933ab028dbdb0789654027a9665312
BLAKE2b-256 69e0976a43133e22fd927b9c1ff73fb8887e81e9d65599d87065119561d02995

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp27-cp27m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 d1475c272035c7c4cae4278d12f1a718968671e73721c9dbbc26666bcdb7327c
MD5 8283f5d2516a790e7a91cae885ea9966
BLAKE2b-256 52ffd121e0330b98f8daf3698babbc1289b2f384bd042299866e06f8344c72dd

See more details on using hashes here.

File details

Details for the file pyhacrf_datamade-0.2.1-cp27-cp27m-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for pyhacrf_datamade-0.2.1-cp27-cp27m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 e16b749203c6041eca6841532d294a0c4eed9aaf76279fe021ab391a101d3ec9
MD5 c02317bdb19aee36008aef4711fb769b
BLAKE2b-256 45af857b65304a46994404a4518c3166049cea25837037480c19ec6a3b6b4e81

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page