Tools for cleaning pandas DataFrames
Project description
pyjanitor
is a Python implementation of the R package janitor
, and
provides a clean API for cleaning data.
Quick start
- Installation:
conda install -c conda-forge pyjanitor
. Read more installation instructions here. - Check out the collection of general functions.
Why janitor?
Originally a port of the R package,
pyjanitor
has evolved from a set of convenient data cleaning routines
into an experiment with the method chaining
paradigm.
Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).
The pandas
API has been invaluable for the Python data science ecosystem,
and implements method chaining of a subset of methods as part of the API.
For example, resetting indexes (.reset_index()
),
dropping null values (.dropna()
), and more,
are accomplished via the appropriate pd.DataFrame
method calls.
Inspired by the ease-of-use
and expressiveness of the dplyr
package
of the R statistical language ecosystem,
we have evolved pyjanitor
into a language
for expressing the data processing DAG for pandas
users.
Installation
pyjanitor
is currently installable from PyPI:
pip install pyjanitor
pyjanitor
also can be installed by the conda package manager:
conda install pyjanitor -c conda-forge
pyjanitor
can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:
pipenv install --pre pyjanitor
pyjanitor
requires Python 3.6+.
Functionality
Current functionality includes:
- Cleaning columns name (multi-indexes are possible!)
- Removing empty rows and columns
- Identifying duplicate entries
- Encoding columns as categorical
- Splitting your data into features and targets (for machine learning)
- Adding, removing, and renaming columns
- Coalesce multiple columns into a single column
- Date conversions (from matlab, excel, unix) to Python datetime format
- Expand a single column that has delimited, categorical values into dummy-encoded variables
- Concatenating and deconcatenating columns, based on a delimiter
- Syntactic sugar for filtering the dataframe based on queries on a column
- Experimental submodules for finance, biology, chemistry, engineering, and pyspark
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyjanitor-0.25.0.tar.gz
.
File metadata
- Download URL: pyjanitor-0.25.0.tar.gz
- Upload date:
- Size: 157.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c051976896aed3bd2ac7c5d0eb7f3607a45af923266eeb95edd6824610bd4b74 |
|
MD5 | 9850bfb2470c2712b98a242ccd7f39ae |
|
BLAKE2b-256 | 39e71701fece0ffca29b1a0e4a3853f50ee934879726ffdb7f0d576c62104f2f |
File details
Details for the file pyjanitor-0.25.0-py3-none-any.whl
.
File metadata
- Download URL: pyjanitor-0.25.0-py3-none-any.whl
- Upload date:
- Size: 171.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77f8ebe47e26a62ab58b0d46b7448093c3d5824aa29c9b6bbc69662b685f98d8 |
|
MD5 | 92936a09e96f20700e05feacb06012a9 |
|
BLAKE2b-256 | a8ae0a0287a4bbc8be4158c25d86a096feff13e8ae1564e9fcdb3c4cb997b712 |