Wrapper for Great Expectations to fit the requirements of the Gemeente Amsterdam.
Project description
Introduction
This repository contains functions that will ease the use of Great Expectations. Users can input data and data quality rules and get results in return.
DISCLAIMER: The package is in MVP phase
Getting started
Install the dq suite on your compute, for example by running the following code in your workspace:
pip install dq-suite-amsterdam
To validate your first table:
- define
json_path
as a path to a JSON file, similar to shown in dq_rules_example.json in this repo - load the table requiring a data quality check into a PySpark dataframe
df
(e.g. viaspark.read.csv
orspark.read.table
)
import dq_suite
validation_settings_obj = dq_suite.ValidationSettings(spark_session=spark, catalog_name="dpxx_dev",
table_name="showcase_table",
check_name="showcase_check")
dq_suite.run(json_path=json_path, df=df, validation_settings_obj=validation_settings_obj)
Looping over multiple data frames may require a redefinition of the json_path
and validation_settings
variables.
Create data quality schema and tables (in respective catalog of data team)
for the first time installation create data quality schema and tables from the notebook from repo path scripts/data_quality_tables.sql
- open the notebook, connect to a cluster
- select the catalog of the data team and execute the notebook. It will check if schema is available if not it will create schema and same for tables.
Export the schema from Unity Catalog to the Input Form
In order to output the schema from Unity Catalog, use the following commands (using the required schema name):
schema_output = dq_suite.export_schema('schema_name', spark)
print(schema_output)
Copy the string to the Input Form to quickly ingest the schema in Excel.
Validate the schema of a table
It is possible to validate the schema of an entire table to a schema definition from Amsterdam Schema in one go. This is done by adding two fields to the "dq_rules" JSON when describing the table (See: https://github.com/Amsterdam/dq-suite-amsterdam/blob/main/dq_rules_example.json).
You will need:
- validate_table_schema: the id field of the table from Amsterdam Schema
- validate_table_schema_url: the url of the table or dataset from Amsterdam Schema
The schema definition is converted into column level expectations (expect_column_values_to_be_of_type) on run time.
Known exceptions
-
The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will result in an error, as it does not have the permissions that Great Expectations requires.
-
Since this project requires Python >= 3.10, the use of Databricks Runtime (DBR) >= 13.3 is needed (click). Older versions of DBR will result in errors upon install of the
dq-suite-amsterdam
library.
Contributing to this library
See the separate developers' readme.
Updates
Version 0.1: Run a DQ check for a dataframe
Version 0.2: Run a DQ check for multiple dataframes
Version 0.3: Refactored I/O
Version 0.4: Added schema validation with Amsterdam Schema per table
Version 0.5: Export schema from Unity Catalog
Version 0.6: The results are written to tables in the "dataquality" schema
Version 0.7: Refactored the solution
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dq_suite_amsterdam-0.7.1.tar.gz
.
File metadata
- Download URL: dq_suite_amsterdam-0.7.1.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9397aecbc2a8228c0c88ad6be4319172d0f07cc1588ee145caa58cccb0b56fb |
|
MD5 | 25f1aa4e392b165ec6f9cfb3c55b8c94 |
|
BLAKE2b-256 | 9bda95de0bc3564fcc038c4a516397634beccb8a7b13eb7b919d1cb9211cbff1 |
File details
Details for the file dq_suite_amsterdam-0.7.1-py3-none-any.whl
.
File metadata
- Download URL: dq_suite_amsterdam-0.7.1-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f5cbc1b6b9459739a0e93a7bbb4c83da121853bcdc3599e6386d0c9ab5d01273 |
|
MD5 | faced5c17a999c344160626f44dd29ac |
|
BLAKE2b-256 | 69b9869965b3bd03a876c648ef2b326cf02e632dbc3719bbe1e1906b5da8029e |