Skip to main content

Prepping tables for machine learning

Project description

skrub logo

py_ver pypi_var pypi_dl codecov circleci black

skrub (formerly dirty_cat) is a Python library that facilitates prepping your tables for machine learning.

If you like the package, spread the word and ⭐ this repository! You can also join the discord server.

What can skrub do?

skrub provides data assembling tools (TableVectorizer, fuzzy_join…) and encoders (GapEncoder, MinHashEncoder…) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations

See our examples.

What skrub cannot do

Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.

This kind of problem is tackled by Natural Language Processing methods.

skrub can still help with handling typos and variations in this kind of setting.

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].

Installation

The easiest way to install skrub is via pip:

pip install skrub -U

or conda:

conda install -c conda-forge skrub

The documentation includes more detailed installation instructions.

Dependencies

Dependencies and minimal versions are listed in the setup file.

Contributing

The best way to support the development of skrub is to spread the word!

Also, if you already are a skrub user, we would love to hear about your use cases and challenges in the Discussions section.

To report a bug or suggest enhancements, please open an issue and/or submit a pull request.

Additional resources

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skrub-0.2.0rc1.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

skrub-0.2.0rc1-py3-none-any.whl (207.6 kB view details)

Uploaded Python 3

File details

Details for the file skrub-0.2.0rc1.tar.gz.

File metadata

  • Download URL: skrub-0.2.0rc1.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.8

File hashes

Hashes for skrub-0.2.0rc1.tar.gz
Algorithm Hash digest
SHA256 75cf93a30a20db0bfa5edbde07306bb5d47319960c6679fa7a2beeb0445c2c8d
MD5 41fea382201968a86a63833e3b64370c
BLAKE2b-256 15e689ad39408c1b2903dc539d3cda79d68489da2f5b1f0bc6da321e1b8ce53e

See more details on using hashes here.

File details

Details for the file skrub-0.2.0rc1-py3-none-any.whl.

File metadata

  • Download URL: skrub-0.2.0rc1-py3-none-any.whl
  • Upload date:
  • Size: 207.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.8

File hashes

Hashes for skrub-0.2.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 b50be428c83659d4416a8ad79a1bb41665cda080e71a6d527ebdd6d55c5182e4
MD5 af83a3ca6de74d53c3237dd26dc59677
BLAKE2b-256 eb6d0e78d028591bedd9580e49ae1060d9faf7fb4503e1db54227616db8f359d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page