Skip to main content

Machine learning with dirty categories.

Project description

dirty_cat logo

py_ver pypi_var pypi_dl codecov circleci black

dirty_cat is a Python library that facilitates machine-learning on dirty categorical variables.

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].

If you like the package, please spread the word, and ⭐ the repository!

What can dirty_cat do?

dirty_cat provides tools (TableVectorizer, fuzzy_join…) and encoders (GapEncoder, MinHashEncoder…) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations

The first example notebook goes in-depth on how to identify and deal with dirty data using the dirty_cat library.

What dirty_cat does not

Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.

This kind of problem is tackled by Natural Language Processing methods.

dirty_cat can still help with handling typos and variations in this kind of setting.

Installation

dirty_cat can be easily installed via pip:

pip install dirty_cat

Dependencies

Dependencies and minimal versions are listed in the setup file.

Contributing

If you want to encourage development of dirty_cat, the best thing to do is to spread the word!

If you encounter an issue while using dirty_cat, please open an issue and/or submit a pull request. Don’t hesitate, you’re helping to make this project better for everyone!

Additional resources

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dirty_cat-0.4.0.tar.gz (97.7 kB view details)

Uploaded Source

Built Distribution

dirty_cat-0.4.0-py3-none-any.whl (116.6 kB view details)

Uploaded Python 3

File details

Details for the file dirty_cat-0.4.0.tar.gz.

File metadata

  • Download URL: dirty_cat-0.4.0.tar.gz
  • Upload date:
  • Size: 97.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/59.1.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for dirty_cat-0.4.0.tar.gz
Algorithm Hash digest
SHA256 9d075667d0a0dbf04e5fa9931c3fea43d0a310c4453726646fff6b75aba72dbe
MD5 7bea9ecb07c74ed6ce1bb9a3429c6237
BLAKE2b-256 552b5a1a4d829ecacbef7c833484dcebc279778cefc5473681241c4603913bb6

See more details on using hashes here.

File details

Details for the file dirty_cat-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: dirty_cat-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 116.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/59.1.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for dirty_cat-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4e39ce818663c1f0a220fb266bf8822b71034c84cd9faff2a12917e1fb83bbe4
MD5 2a18c3d6d692dd092f3aedc837137b66
BLAKE2b-256 b980d33605314ee3dbe33d67307f482a71ed7456e93534bfca34d9fc9aacd5d1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page