Skip to main content

Wrapper for Great Expectations to fit the requirements of the Gemeente Amsterdam.

Project description

Introduction

This repository contains functions that will ease the use of Great Expectations. Users can input data and data quality rules and get results in return.

DISCLAIMER: The package is in MVP phase

Getting started

Install the dq suite on your compute, for example by running the following code in your workspace:

pip install dq-suite-amsterdam
import dq_suite

Load your data in dataframes, give them a table_name, and create a list of all dataframes:

df = spark.read.csv(csv_path+file_name, header=True, inferSchema=True) #example using csv
df.table_name = "showcase_table"
dfs = [df]
  • Define 'dfs' as a list of dataframes that require a dq check
  • Define 'dq_rules' as a JSON as shown in dq_rules_example.json in this repo
  • Define a name for your dq check, in this case "showcase"
results, brontabel_df, bronattribute_df, dqRegel_df = dq_suite.df_check(dfs, dq_rules, "showcase")

Validate the schema of a table

It is possible to validate the schema of an entire table to a schema definition from Amsterdam Schema in one go. This is done by adding two fields to the "dq_rules" JSON when describing the table (See: https://github.com/Amsterdam/dq-suite-amsterdam/blob/main/dq_rules_example.json).

You will need:

  • validate_table_schema: the id field of the table from Amsterdam Schema
  • validate_table_schema_url: the url of the table or dataset from Amsterdam Schema

The schema definition is converted into column level expectations (expect_column_values_to_be_of_type) on run time.

Known exceptions

The functions can run on Databricks using a Personal Compute Cluster or using a Job Cluster. Using a Shared Compute Cluster will results in an error, as it does not have the permissions that Great Expectations requires.

Updates

Version 0.1: Run a DQ check for a dataframe

Version 0.2: Run a DQ check for multiple dataframes

Version 0.3: Refactored I/O

Version 0.4: Added schema validation with Amsterdam Schema per table

Version 0.5: Export schema from Unity Catalog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_suite_amsterdam-0.5.1.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

dq_suite_amsterdam-0.5.1-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file dq_suite_amsterdam-0.5.1.tar.gz.

File metadata

  • Download URL: dq_suite_amsterdam-0.5.1.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for dq_suite_amsterdam-0.5.1.tar.gz
Algorithm Hash digest
SHA256 fc981e155372941727d0a8240f2ddb23e9f63501d86c0b866df5a0aa512ac261
MD5 a46c143a3fe9b8528751e845e5a8f74e
BLAKE2b-256 8f9b26e49efad66694aee253e2a8aee04f7699e6bcb620ebd4d708eb3e82a1f5

See more details on using hashes here.

File details

Details for the file dq_suite_amsterdam-0.5.1-py3-none-any.whl.

File metadata

File hashes

Hashes for dq_suite_amsterdam-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3bd7cccbd363d16db791e88005fad214e93539e5fcc06a07164ef8eb3a2fc1f3
MD5 61c9b474e33315430a7abbd860bf2d13
BLAKE2b-256 aef219fb84050af5fa7259d4b26ae0d84816b9f1bde900f3d05380e545d6a7c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page