Prepping tables for machine learning
Project description
skrub (formerly dirty_cat) is a Python library that facilitates prepping your tables for machine learning.
If you like the package, spread the word and ⭐ this repository! You can also join the discord server.
What can skrub do?
skrub provides data assembling tools (TableVectorizer, fuzzy_join…) and encoders (GapEncoder, MinHashEncoder…) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations
See our examples.
What skrub cannot do
Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.
This kind of problem is tackled by Natural Language Processing methods.
skrub can still help with handling typos and variations in this kind of setting.
For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].
Installation
The easiest way to install skrub is via pip:
pip install skrub -U
or conda:
conda install -c conda-forge skrub
The documentation includes more detailed installation instructions.
Dependencies
Dependencies and minimal versions are listed in the setup file.
Contributing
The best way to support the development of skrub is to spread the word!
Also, if you already are a skrub user, we would love to hear about your use cases and challenges in the Discussions section.
To report a bug or suggest enhancements, please open an issue and/or submit a pull request.
Additional resources
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file skrub-0.2.0.tar.gz
.
File metadata
- Download URL: skrub-0.2.0.tar.gz
- Upload date:
- Size: 5.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 851454ee55a6b3108a6d737825cdc061ee22e751647067e3bebcc6441c9a6b26 |
|
MD5 | 0c4ed1adcfc58123d76a374658435090 |
|
BLAKE2b-256 | ac9a98f3db33f6a2a62e403dc9d594ab0aa7f3bdeaa7a04880f09b7ff336a412 |
File details
Details for the file skrub-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: skrub-0.2.0-py3-none-any.whl
- Upload date:
- Size: 207.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7631396d96ae4a85d92a617bafcead30f663ced8567a9f20313e01a437ef42c3 |
|
MD5 | f4ee88b5bbe0790e62d7b4828e3151df |
|
BLAKE2b-256 | 65584b8883d493955c13afbe5fc39e58d7f3fdd21017a24c6c53522408783a83 |