No project description provided
Project description
datapatch
A Python library for defining rule-based overrides on messy data. Imagine, for example,
trying to import a dataset in each row is associated with a country - which have been
entered by humans. You might find country names like Northkorea
, or Greet Britain
that you want to normalise. datapatch
creates a mechanism to build a flexible lookup
table (usually stored as a YAML file) to catch and repair these data issues.
Installation
You can install datapatch
from the Python package index:
pip install datapatch
Example
Given a YAML file like this:
countries:
normalize: true
lowercase: true
asciify: true
options:
- match: Frankreich
value: France
- match:
- Northkorea
- Nordkorea
- Northern Korea
- NKorea
- DPRK
value: North Korea
- contains: Britain
value: Great Britain
The file can be used to apply the data patches against raw input:
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
# This will apply the patch or default to the original string if none exists:
for row in iter_data():
raw = row.get("Country")
row["Country"] = countries.get_value(raw, default=raw)
Extended options
There's a host of options available to configure the application of the data patches:
countries:
# If you mark a lookup as required, a value that matches no options will
# throw a `datapatch.exc:LookupException`.
required: true
# Normalisation will remove many special characters, remove multiple spaces
normalize: false
# By default normalize perform transliteration across alphabets (Путин -> Putin)
# set asciify to false if you want to keep non-ascii alphabets as is
asciify: false
options:
- match: Francois
value: France
# This is a shorthand for defining options that have just one `match` and
# one `value` defined:
map:
Luxemborg: Luxembourg
Lux: Luxembourg
Result objects
You can also have more details associated with a result and access them:
countries:
options:
- match: Frankreich
# These can be arbitrary attributes:
label: France
code: FR
This can be accessed as a result object with attributes:
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital
License
datapatch
is licensed under the terms of the MIT license, which is included as
LICENSE
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datapatch-1.2.0.tar.gz
.
File metadata
- Download URL: datapatch-1.2.0.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a08c7a0f33e88653b61088835fb2cd8ee8a65c2d81f92ae1210089a4d89d3061 |
|
MD5 | 25e15760edab9969c17adc498b742641 |
|
BLAKE2b-256 | a2e742394b1477d0543e5f17a1e2036ca85f0eabbad747475c2a3666bd6e4129 |
File details
Details for the file datapatch-1.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: datapatch-1.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6676a5b7e55fcae21d502a7cfea06101116e13b73cc1d30bb310f03ee6f9dce |
|
MD5 | 6adea57249b5a76bd50ebf82800df993 |
|
BLAKE2b-256 | 8d0a74df812e274f2af44a1392ab6ebe99349182059ed982f2e0ef3d20f8ab98 |