Skip to main content

Tools for cleaning pandas DataFrames

Project description

pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data.

Quick start

  • Installation: conda install -c conda-forge pyjanitor. Read more installation instructions here.
  • Check out the collection of general functions.

Why janitor?

Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm.

Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).

The pandas API has been invaluable for the Python data science ecosystem, and implements method chaining of a subset of methods as part of the API. For example, resetting indexes (.reset_index()), dropping null values (.dropna()), and more, are accomplished via the appropriate pd.DataFrame method calls.

Inspired by the ease-of-use and expressiveness of the dplyr package of the R statistical language ecosystem, we have evolved pyjanitor into a language for expressing the data processing DAG for pandas users.

Installation

pyjanitor is currently installable from PyPI:

pip install pyjanitor

pyjanitor also can be installed by the conda package manager:

conda install pyjanitor -c conda-forge

pyjanitor can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:

pipenv install --pre pyjanitor

pyjanitor requires Python 3.6+.

Functionality

Current functionality includes:

  • Cleaning columns name (multi-indexes are possible!)
  • Removing empty rows and columns
  • Identifying duplicate entries
  • Encoding columns as categorical
  • Splitting your data into features and targets (for machine learning)
  • Adding, removing, and renaming columns
  • Coalesce multiple columns into a single column
  • Date conversions (from matlab, excel, unix) to Python datetime format
  • Expand a single column that has delimited, categorical values into dummy-encoded variables
  • Concatenating and deconcatenating columns, based on a delimiter
  • Syntactic sugar for filtering the dataframe based on queries on a column
  • Experimental submodules for finance, biology, chemistry, engineering, and pyspark

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyjanitor-0.29.0.tar.gz (204.9 kB view details)

Uploaded Source

Built Distribution

pyjanitor-0.29.0-py3-none-any.whl (204.2 kB view details)

Uploaded Python 3

File details

Details for the file pyjanitor-0.29.0.tar.gz.

File metadata

  • Download URL: pyjanitor-0.29.0.tar.gz
  • Upload date:
  • Size: 204.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for pyjanitor-0.29.0.tar.gz
Algorithm Hash digest
SHA256 b4ebf8dfbd2c50cadff81be7b909f1eaa649b21a084876b3a4d84682fc452790
MD5 c7a18c75b4ceaf639ba1bbf8251eab83
BLAKE2b-256 ec40ad4e17a3d4d4c5cc5be7934958548a229235579bb0b3a84fbd4bfa7a2e61

See more details on using hashes here.

File details

Details for the file pyjanitor-0.29.0-py3-none-any.whl.

File metadata

  • Download URL: pyjanitor-0.29.0-py3-none-any.whl
  • Upload date:
  • Size: 204.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for pyjanitor-0.29.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ff2835cf8b309a190a7097c79d9b6018650a2b7d9816c53994872115e822175f
MD5 092f5ad7e3a7c0dce71b92632198f185
BLAKE2b-256 0c09011c23ea8b39b6b787b4b617bd0ee0e3dc70f7e36ba5454f8dc6c213360e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page