Skip to main content

No project description provided

Project description

datapatch

A Python library for defining rule-based overrides on messy data. Imagine, for example, trying to import a dataset in each row is associated with a country - which have been entered by humans. You might find country names like Northkorea, or Greet Britain that you want to normalise. datapatch creates a mechanism to build a flexible lookup table (usually stored as a YAML file) to catch and repair these data issues.

Installation

You can install datapatch from the Python package index:

pip install datapatch

Example

Given a YAML file like this:

countries:
  normalize: true
  lowercase: true
  asciify: true
  options:
    - match: Frankreich
      value: France
    - match:
        - Northkorea
        - Nordkorea
        - Northern Korea
        - NKorea
        - DPRK
      value: North Korea
    - contains: Britain
      value: Great Britain

The file can be used to apply the data patches against raw input:

from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

# This will apply the patch or default to the original string if none exists:
for row in iter_data():
    raw = row.get("Country")
    row["Country"] = countries.get_value(raw, default=raw)

Extended options

There's a host of options available to configure the application of the data patches:

countries:
  # If you mark a lookup as required, a value that matches no options will
  # throw a `datapatch.exc:LookupException`.
  required: true
  # Normalisation will remove many special characters, remove multiple spaces
  normalize: false
  # By default normalize perform transliteration across alphabets (Путин -> Putin)
  # set asciify to false if you want to keep non-ascii alphabets as is
  asciify: false
  options:
    - match: Francois
      value: France
  # This is a shorthand for defining options that have just one `match` and
  # one `value` defined:
  map:
    Luxemborg: Luxembourg
    Lux: Luxembourg

Result objects

You can also have more details associated with a result and access them:

countries:
  options:
    - match: Frankreich
      # These can be arbitrary attributes:
      label: France
      code: FR

This can be accessed as a result object with attributes:

from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital

License

datapatch is licensed under the terms of the MIT license, which is included as LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datapatch-1.2.0.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

datapatch-1.2.0-py2.py3-none-any.whl (8.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file datapatch-1.2.0.tar.gz.

File metadata

  • Download URL: datapatch-1.2.0.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for datapatch-1.2.0.tar.gz
Algorithm Hash digest
SHA256 a08c7a0f33e88653b61088835fb2cd8ee8a65c2d81f92ae1210089a4d89d3061
MD5 25e15760edab9969c17adc498b742641
BLAKE2b-256 a2e742394b1477d0543e5f17a1e2036ca85f0eabbad747475c2a3666bd6e4129

See more details on using hashes here.

File details

Details for the file datapatch-1.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: datapatch-1.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for datapatch-1.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a6676a5b7e55fcae21d502a7cfea06101116e13b73cc1d30bb310f03ee6f9dce
MD5 6adea57249b5a76bd50ebf82800df993
BLAKE2b-256 8d0a74df812e274f2af44a1392ab6ebe99349182059ed982f2e0ef3d20f8ab98

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page