Converts a dataset based on a specific schema

These details have not been verified by PyPI

Project links

Homepage

Project description

ckanext-transmute

The extension helps to validate and converts a dataset based on a specific schema.

Working with transmute

ckanext-transmute provides an action tsm_transmute It helps us to transmute data with the provided convertion scheme. The action doesn't change the original data, but creates a new data dict. There are two mandatory arguments - data and schema. data is a data dict you have and schema helps you to validate/change data in it.

Example: We have a data dict:

{
            "title": "Test-dataset",
            "email": "test@test.ua",
            "metadata_created": "",
            "metadata_modified": "",
            "metadata_reviewed": "",
            "resources": [
                {
                    "title": "test-res",
                    "extension": "xml",
                    "web": "https://stackoverflow.com/",
                    "sub-resources": [
                        {
                            "title": "sub-res",
                            "extension": "csv",
                            "extra": "should-be-removed",
                        }
                    ],
                },
                {
                    "title": "test-res2",
                    "extension": "csv",
                    "web": "https://stackoverflow.com/",
                },
            ],
        }

And we want to achieve this:

{
            "name": "test-dataset",
            "email": "test@test.ua",
            "metadata_created": datetime.datetime(2022, 2, 3, 15, 54, 26, 359453),
            "metadata_modified": datetime.datetime(2022, 2, 3, 15, 54, 26, 359453),
            "metadata_reviewed": datetime.datetime(2022, 2, 3, 15, 54, 26, 359453),
            "attachments": [
                {
                    "name": "test-res",
                    "format": "XML",
                    "url": "https://stackoverflow.com/",
                    "sub-resources": [{"name": "SUB-RES", "format": "CSV"}],
                },
                {
                    "name": "test-res2",
                    "format": "CSV",
                    "url": "https://stackoverflow.com/",
                },
            ],
        }

Then, our schema must be something like that:

{
        "root": "Dataset",
        "types": {
            "Dataset": {
                "fields": {
                    "title": {
                        "validators": [
                            "tsm_string_only",
                            "tsm_to_lowercase",
                            "tsm_name_validator",
                        ],
                        "map": "name",
                    },
                    "resources": {
                        "type": "Resource",
                        "multiple": True,
                        "map": "attachments",
                    },
                    "metadata_created": {
                        "validators": ["tsm_isodate"],
                        "default": "2022-02-03T15:54:26.359453",
                    },
                    "metadata_modified": {
                        "validators": ["tsm_isodate"],
                        "default_from": "metadata_created",
                    },
                    "metadata_reviewed": {
                        "validators": ["tsm_isodate"],
                        "replace_from": "metadata_modified",
                    },
                }
            },
            "Resource": {
                "fields": {
                    "title": {
                        "validators": ["tsm_string_only"],
                        "map": "name",
                    },
                    "extension": {
                        "validators": ["tsm_string_only", "tsm_to_uppercase"],
                        "map": "format",
                    },
                    "web": {
                        "validators": ["tsm_string_only"],
                        "map": "url",
                    },
                    "sub-resources": {
                        "type": "Sub-Resource",
                        "multiple": True,
                    },
                },
            },
            "Sub-Resource": {
                "fields": {
                    "title": {
                        "validators": ["tsm_string_only", "tsm_to_uppercase"],
                        "map": "name",
                    },
                    "extension": {
                        "validators": ["tsm_string_only", "tsm_to_uppercase"],
                        "map": "format",
                    },
                    "extra": {
                        "remove": True,
                    },
                }
            },
        },
    }

There is an example of schema with nested types. The root field is mandatory, it's must contain a main type name, from which the scheme starts. As you can see, Dataset type contains Resource type which contans Sub-Resource.

Transmutators

There are few default transmutators you can use in your schema. Of course, you can define a custom transmutator with the ITransmute interface.

tsm_name_validator - Wrapper over CKAN default name_validator validator
tsm_to_lowercase - Casts string value to a lowercase
tsm_to_uppercase - Casts string value to a uppercase
tsm_string_only - Validates if field.value is string
tsm_isodate - Validates datetime string. Mutates an iso-like string to datetime object
tsm_to_string - Casts a field.value to str
tsm_get_nested - Allows you to pick up a value from a nested structure. Example:

data = "title_translated": [
    {"nested_field": {"en": "en title", "ar": "العنوان ar"}},
]

schema = ...
    "title": {
        "replace_from": "title_translated",
        "validators": [
            ["tsm_get_nested", 0, "nested_field", "en"],
            "tsm_to_uppercase",
        ],
    },
    ...

This will take a value for a title field from title_translated field. Because title_translated is an array with nested objects, we are using the tsm_get_nested transmutator to achieve the value from it.

tsm_trim_string - Trim string with max lenght. Example to trim hello world to hello:

data = {"field_name": "hello world}

schema = ...
    "field_name": {
        "validators": [
            ["tsm_trim_string", 5]
        ],
    },
    ...

tsm_concat - Trim string with max lenght. Use $self to point on field value. Example:

data = {"id": "dataset-1}

schema = ...
    "package_url": {
        "replace_from": "id",
        "validators": [
            [
                "tsm_concat",
                "https://site.url/dataset/",
                "$self",
            ]
        ],
    },
    ...

tsm_unique_only - Preserve only unique values from a list. Works only with lists.

The default transmutator must receive at least one mandatory argument - field object. Field contains few properties: field_name, value and type.

There is a possibility to provide more arguments to a validator like in tsm_get_nested. For this use a nested array with first item transmutator and other - arguments to it.

Keywords

map_to (str) - changes the field.name in result dict.
validators (list[str]) - a list of transmutators that will be applied to a field.value. A transmutator could be a string or a list where the first item must be transmutator name and others are arbitrary values. Example:
```
...
"validators": [
    ["tsm_get_nested", "nested_field", "en"],
    "tsm_to_uppercase",
,
...
```
There are two transmutators: tsm_get_nested and tsm_to_uppercase.
multiple (bool, default: False) - if the field could have multiple items, e.g resources field in dataset, mark it as multiple to transmute all the items successively.
```
...
"resources": {
    "type": "Resource",
    "multiple": True
},
...
```
remove (bool, default: False) - removes a field from a result dict if True.
default (Any) - the default value that will be used if the original field.value evaluates to False.
default_from (str | list) - acts similar to default but accepts a field.name of a sibling field from which we want to take its value. Sibling field is a field that located in the same type. The current implementation doesn't allow to point on fields from other types. Could take a string that represents the field.name or an array of strings, to use multiple fields. See inherit_mode keyword for details.
```
...
"metadata_modified": {
    "validators": ["tsm_isodate"],
    "default_from": "metadata_created",
},
...
```
replace_from (str| list) - acts similar to default_from but replaces the origin value whenever it's empty or not.
inherit_mode (str, default: combine) - defines the mode for default_from and replace_from. By default we are combining values from all the fields, but we could just use first non-false value, in case if the field might be empty.
value (Any) - a value that will be used for a field. This keyword has the highest priority. Could be used to create a new field with an arbitrary value.
update (bool, default: False) - if the original value is mutable (array, object`) - you can update it. You can only update field values of the same types.

Installation

To install ckanext-transmute:

Activate your CKAN virtual environment, for example:

. /usr/lib/ckan/default/bin/activate
Clone the source and install it on the virtualenv

git clone https://github.com/mutantsan/ckanext-transmute.git cd ckanext-transmute pip install -e . pip install -r requirements.txt
Add transmute to the ckan.plugins setting in your CKAN config file (by default the config file is located at /etc/ckan/default/ckan.ini).
Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu:

sudo service apache2 reload

Developer installation

To install ckanext-transmute for development, activate your CKAN virtualenv and do:

git clone https://github.com/mutantsan/ckanext-transmute.git
cd ckanext-transmute
python setup.py develop
pip install -r dev-requirements.txt

Tests

I've used TDD to write this extension, so if you changing something be sure that all the tests are valid. To run the tests, do:

pytest --ckan-ini=test.ini

License

AGPL

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.0a0 pre-release

Oct 24, 2024

1.7.0

Oct 30, 2024

1.6.0

Mar 22, 2023

This version

1.5.8

Jan 31, 2023

1.4.7

Jun 3, 2022

1.3.7

Jun 3, 2022

1.2.7

May 4, 2022

1.2.6

May 4, 2022

1.2.5

Apr 11, 2022

1.1.5

Mar 17, 2022

1.1.4

Mar 16, 2022

1.0.4

Mar 14, 2022

1.0.3

Mar 11, 2022

1.0.2

Feb 16, 2022

1.0.1

Feb 15, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckanext-transmute-1.5.8.tar.gz (31.7 kB view details)

Uploaded Jan 31, 2023 Source

Built Distribution

ckanext_transmute-1.5.8-py3-none-any.whl (33.7 kB view details)

Uploaded Jan 31, 2023 Python 3

File details

Details for the file ckanext-transmute-1.5.8.tar.gz.

File metadata

Download URL: ckanext-transmute-1.5.8.tar.gz
Upload date: Jan 31, 2023
Size: 31.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for ckanext-transmute-1.5.8.tar.gz
Algorithm	Hash digest
SHA256	`b11b48133eeadee89e91d134f79f63413fb4f9de9f9c8fd2bf159d69ae71d83d`
MD5	`8e19183420d271256d1eef7867a114b4`
BLAKE2b-256	`ced3edc069825c93c32dafd6821f348a07abf4a2bace112bbb2b592d4e174eb6`

See more details on using hashes here.

File details

Details for the file ckanext_transmute-1.5.8-py3-none-any.whl.

File metadata

Download URL: ckanext_transmute-1.5.8-py3-none-any.whl
Upload date: Jan 31, 2023
Size: 33.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for ckanext_transmute-1.5.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`841034755fc8e768c500ed7fbcef48166d8e949a2c3ca2dc9bf3e1792c610488`
MD5	`ab9fc7c49717a19a248f1638faba127f`
BLAKE2b-256	`7e9b2add534bdc3c5f267b60c61477a6f37deeaeea11a9c221eff5c7dfe358c3`