Skip to main content

Wrapper for Great Expectations to fit the requirements of the Gemeente Amsterdam.

Project description

Introduction

This repository contains functions that will ease the use of Great Expectations. Users can input data and data quality rules and get results in return.

DISCLAIMER: The package is in MVP phase

Getting started

Install the dq suite on your compute, for example by running the following code in your workspace:

pip install dq-suite-amsterdam
import dq_suite

Load your data in dataframes, give them a table_name, and create a list of all dataframes:

df = spark.read.csv(csv_path+file_name, header=True, inferSchema=True) #example using csv
df.table_name = "showcase_table"
dfs = [df]
  • Define 'dfs' as a list of dataframes that require a dq check
  • Define 'dq_rules' as a JSON as shown in dq_rules_example.json in this repo
  • Define a name for your dq check, in this case "showcase"
dq_suite.df_check(dfs, dq_rules, "dpxx_dev", "showcase", spark)

Create dataquality schema and tables (in respective catalog of data team)

for the first time installation create data quality schema and tables from the notebook from repo path scripts/data_quality_tables.sql

  • open the notebook, connect to a cluster
  • select the catalog of the data team and execute the notebook. It will check if schema is available if not it will create schema and same for tables.

Export the schema from Unity Catalog to the Input Form

In order to output the schema from Unity Catalog, use the following commands (using the required schema name):

schema_output = dq_suite.export_schema('schema_name', spark)
print(schema_output)

Copy the string to the Input Form to quickly ingest the schema in Excel.

Validate the schema of a table

It is possible to validate the schema of an entire table to a schema definition from Amsterdam Schema in one go. This is done by adding two fields to the "dq_rules" JSON when describing the table (See: https://github.com/Amsterdam/dq-suite-amsterdam/blob/main/dq_rules_example.json).

You will need:

  • validate_table_schema: the id field of the table from Amsterdam Schema
  • validate_table_schema_url: the url of the table or dataset from Amsterdam Schema

The schema definition is converted into column level expectations (expect_column_values_to_be_of_type) on run time.

Known exceptions

  • The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will result in an error, as it does not have the permissions that Great Expectations requires.

  • Since this project requires Python >= 3.10, the use of Databricks Runtime (DBR) >= 13.3 is needed (click). Older versions of DBR will result in errors upon install of the dq-suite-amsterdam library.

Contributing to this library

See the separate developers' readme.

Updates

Version 0.1: Run a DQ check for a dataframe

Version 0.2: Run a DQ check for multiple dataframes

Version 0.3: Refactored I/O

Version 0.4: Added schema validation with Amsterdam Schema per table

Version 0.5: Export schema from Unity Catalog

Version 0.6: The results are written to tables in the "dataquality" schema

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_suite_amsterdam-0.6.3.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

dq_suite_amsterdam-0.6.3-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file dq_suite_amsterdam-0.6.3.tar.gz.

File metadata

  • Download URL: dq_suite_amsterdam-0.6.3.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for dq_suite_amsterdam-0.6.3.tar.gz
Algorithm Hash digest
SHA256 64b56ea60a7954278d088efd59e418d79d78f95392e883a3b919d1ebb49f04bd
MD5 504119f1cfbd670bc184ce8129aa76ec
BLAKE2b-256 244dd329db6af562f6824691096fc1a46cafb8e53b65c444d85a794c897beadd

See more details on using hashes here.

File details

Details for the file dq_suite_amsterdam-0.6.3-py3-none-any.whl.

File metadata

File hashes

Hashes for dq_suite_amsterdam-0.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9b708c6403e18e597db4c683d3ad16f9bf12f2e1aab51938a5b8b3ec4fa774f2
MD5 c8ccb7027fa4598b61168f64b5f616e9
BLAKE2b-256 4a5d3e20dbf92d80ff1527f8a051b9cc1cc3f20b39191ba5c9dc95cbeaf57daa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page