Machine learning with dirty categories.
Project description
dirty_cat is a Python library that facilitates machine-learning on dirty categorical variables.
For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].
If you like the package, please spread the word, and ⭐ the repository!
What can dirty_cat do?
dirty_cat provides tools (TableVectorizer, fuzzy_join…) and encoders (GapEncoder, MinHashEncoder…) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations
The first example notebook goes in-depth on how to identify and deal with dirty data using the dirty_cat library.
What dirty_cat does not
Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.
This kind of problem is tackled by Natural Language Processing methods.
dirty_cat can still help with handling typos and variations in this kind of setting.
Installation
dirty_cat can be easily installed via pip:
pip install dirty_cat
Dependencies
Dependencies and minimal versions are listed in the setup file.
Contributing
If you want to encourage development of dirty_cat, the best thing to do is to spread the word!
If you encounter an issue while using dirty_cat, please open an issue and/or submit a pull request. Don’t hesitate, you’re helping to make this project better for everyone!
Additional resources
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dirty_cat-0.4.1.tar.gz
.
File metadata
- Download URL: dirty_cat-0.4.1.tar.gz
- Upload date:
- Size: 106.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb290eeefc11b10197415b4ab7982e5864966723036659b1c7c17ecaf87b1a8d |
|
MD5 | 7f3e98b9891c7f94d5e301a14b1738a1 |
|
BLAKE2b-256 | b7437efedef79e812aedfbef0ca2d646575a97f473734ebed3718d653c9225d1 |
File details
Details for the file dirty_cat-0.4.1-py3-none-any.whl
.
File metadata
- Download URL: dirty_cat-0.4.1-py3-none-any.whl
- Upload date:
- Size: 125.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc20cb43eca89da0ab696826cd25aaefea3bd1b5c0ef5b39bc0252177114b96e |
|
MD5 | d565dbe3f681126b62eb545980b09321 |
|
BLAKE2b-256 | 9c5beb7a36519b1b853493837e9ff0d5c560b15e7181f0fa5ec4e8d05ca23987 |