Skip to main content

Wrapper for Great Expectations to fit the requirements of the Gemeente Amsterdam.

Project description

About dq-suite-amsterdam

This repository aims to be an easy-to-use wrapper for the data quality library Great Expectations (GX). All that is needed to get started is an in-memory Spark dataframe and a set of data quality rules - specified in a JSON file of particular formatting.

While the results of all validations are written to a data_quality schema in Unity Catalog, users can also choose to get notified via Slack or Microsoft Teams.

DISCLAIMER: The package is in MVP phase, so watch your step.

How to contribute

Want to help out? Great! Feel free to create a pull request addressing one of the open issues. Some notes for developers are located here.

Found a bug, or need a new feature? Add a new issue describing what you need.

Getting started

Following GX, we recommend installing dq-suite-amsterdam in a virtual environment. This could be either locally via your IDE, on your compute via a notebook in Databricks, or as part of a workflow.

  1. Run the following command:
pip install dq-suite-amsterdam
  1. Create the data_quality schema (and tables all results will be written to) by running the SQL notebook located here. All it needs is the name of the catalog - and the rights to create a schema within that catalog :)

  2. Get ready to validate your first table. To do so, define

  • dq_rule_json_path as a path to a JSON file, formatted in this way
  • df as a Spark dataframe containing the table that needs to be validated (e.g. via spark.read.csv or spark.read.table)
  • spark as a SparkSession object (in Databricks notebooks, this is by default called spark)
  • catalog_name as the name of your catalog ('dpxx_dev' or 'dpxx_prd')
  • table_name as the name of the table for which a data quality check is required. This name should also occur in the JSON file at dq_rule_json_path
  1. Finally, perform the validation by running
from dq_suite.validation import run_validation

run_validation(
    json_path=dq_rule_json_path,
    df=df, 
    spark_session=spark,
    catalog_name=catalog_name,
    table_name=table_name,
    validation_name="my_validation_name",
)

See the documentation of dq_suite.validation.run for what other parameters can be passed.

Known exceptions / issues

  • The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will result in an error, as it does not have the permissions that Great Expectations requires.

  • Since this project requires Python >= 3.10, the use of Databricks Runtime (DBR) >= 13.3 is needed (click). Older versions of DBR will result in errors upon install of the dq-suite-amsterdam library.

  • At time of writing (late Aug 2024), Great Expectations v1.0.0 has just been released, and is not (yet) compatible with Python 3.12. Hence, make sure you are using the correct version of Python as interpreter for your project.

  • The run_time value is defined separately from Great Expectations in validation.py. We plan on fixing this when Great Expectations has documented how to access it from the RunIdentifier object.

Updates

Version 0.1: Run a DQ check for a dataframe

Version 0.2: Run a DQ check for multiple dataframes

Version 0.3: Refactored I/O

Version 0.4: Added schema validation with Amsterdam Schema per table

Version 0.5: Export schema from Unity Catalog

Version 0.6: The results are written to tables in the "dataquality" schema

Version 0.7: Refactored the solution

Version 0.8: Implemented output historization

Version 0.9: Added dataset descriptions

Version 0.10: Switched to GX 1.0

Version 0.11: Stability and testability improvements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_suite_amsterdam-0.11.6.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

dq_suite_amsterdam-0.11.6-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file dq_suite_amsterdam-0.11.6.tar.gz.

File metadata

  • Download URL: dq_suite_amsterdam-0.11.6.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dq_suite_amsterdam-0.11.6.tar.gz
Algorithm Hash digest
SHA256 bbd29afd90d7dba30c4f82786e1a3ec93130c3d6a4629d25a77f54c3e29bd013
MD5 5d83e6ac869f2daa7650c78e4bd1290a
BLAKE2b-256 01f88c2018ca4afdde60f99173bdb6bf954cffffcdd88add76b5af5ae151260e

See more details on using hashes here.

Provenance

The following attestation bundles were made for dq_suite_amsterdam-0.11.6.tar.gz:

Publisher: publish-to-pypi.yml on Amsterdam/dq-suite-amsterdam

Attestations:

File details

Details for the file dq_suite_amsterdam-0.11.6-py3-none-any.whl.

File metadata

File hashes

Hashes for dq_suite_amsterdam-0.11.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f6faa4f6ae08d9781933c68b8e119d55d22449c41702eb350725bf312e8ef612
MD5 f9a13808fabdd668b3fe5f7109e58ee5
BLAKE2b-256 b615af4ff5c36baff107148576635d9b21df9a10a4d80c22aad215ee218e2443

See more details on using hashes here.

Provenance

The following attestation bundles were made for dq_suite_amsterdam-0.11.6-py3-none-any.whl:

Publisher: publish-to-pypi.yml on Amsterdam/dq-suite-amsterdam

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page