Converts a dataset based on a specific schema
Project description
ckanext-transmute
The extension helps to validate and converts a dataset based on a specific schema.
Working with transmute
ckanext-transmute
provides an action tsm_transmute
It helps us to transmute data with the provided convertion scheme. The action doesn't change the original data, but creates a new data dict. There are two mandatory arguments - data
and schema
. data
is a data dict you have and schema
helps you to validate/change data in it.
Example: We have a data dict:
{
"title": "Test-dataset",
"email": "test@test.ua",
"metadata_created": "",
"metadata_modified": "",
"metadata_reviewed": "",
"resources": [
{
"title": "test-res",
"extension": "xml",
"web": "https://stackoverflow.com/",
"sub-resources": [
{
"title": "sub-res",
"extension": "csv",
"extra": "should-be-removed",
}
],
},
{
"title": "test-res2",
"extension": "csv",
"web": "https://stackoverflow.com/",
},
],
}
And we want to achieve this:
{
"name": "test-dataset",
"email": "test@test.ua",
"metadata_created": datetime.datetime(2022, 2, 3, 15, 54, 26, 359453),
"metadata_modified": datetime.datetime(2022, 2, 3, 15, 54, 26, 359453),
"metadata_reviewed": datetime.datetime(2022, 2, 3, 15, 54, 26, 359453),
"attachments": [
{
"name": "test-res",
"format": "XML",
"url": "https://stackoverflow.com/",
"sub-resources": [{"name": "SUB-RES", "format": "CSV"}],
},
{
"name": "test-res2",
"format": "CSV",
"url": "https://stackoverflow.com/",
},
],
}
Then, our schema must be something like that:
{
"root": "Dataset",
"types": {
"Dataset": {
"fields": {
"title": {
"validators": [
"tsm_string_only",
"tsm_to_lowercase",
"tsm_name_validator",
],
"map": "name",
},
"resources": {
"type": "Resource",
"multiple": True,
"map": "attachments",
},
"metadata_created": {
"validators": ["tsm_isodate"],
"default": "2022-02-03T15:54:26.359453",
},
"metadata_modified": {
"validators": ["tsm_isodate"],
"default_from": "metadata_created",
},
"metadata_reviewed": {
"validators": ["tsm_isodate"],
"replace_from": "metadata_modified",
},
}
},
"Resource": {
"fields": {
"title": {
"validators": ["tsm_string_only"],
"map": "name",
},
"extension": {
"validators": ["tsm_string_only", "tsm_to_uppercase"],
"map": "format",
},
"web": {
"validators": ["tsm_string_only"],
"map": "url",
},
"sub-resources": {
"type": "Sub-Resource",
"multiple": True,
},
},
},
"Sub-Resource": {
"fields": {
"title": {
"validators": ["tsm_string_only", "tsm_to_uppercase"],
"map": "name",
},
"extension": {
"validators": ["tsm_string_only", "tsm_to_uppercase"],
"map": "format",
},
"extra": {
"remove": True,
},
}
},
},
}
There is an example of schema with nested types. The root
field is mandatory, it's must contain a main type name, from which the scheme starts. As you can see, Dataset
type contains Resource
type which contans Sub-Resource
.
Transmutators
There are few default transmutators you can use in your schema. Of course, you can define a custom transmutator with the ITransmute
interface.
tsm_name_validator
- Wrapper over CKAN defaultname_validator
validatortsm_to_lowercase
- Casts string value to a lowercasetsm_to_uppercase
- Casts string value to a uppercasetsm_string_only
- Validates iffield.value
is stringtsm_isodate
- Validates datetime string. Mutates an iso-like string to datetime objecttsm_to_string
- Casts afield.value
tostr
tsm_get_nested
- Allows you to pick up a value from a nested structure. Example:
data = "title_translated": [
{"nested_field": {"en": "en title", "ar": "العنوان ar"}},
]
schema = ...
"title": {
"replace_from": "title_translated",
"validators": [
["tsm_get_nested", 0, "nested_field", "en"],
"tsm_to_uppercase",
],
},
...
This will take a value for a title
field from title_translated
field. Because title_translated
is an array with nested objects, we are using the tsm_get_nested
transmutator to achieve the value from it.
tsm_trim_string
- Trim string with max lenght. Example to trimhello world
tohello
:
data = {"field_name": "hello world}
schema = ...
"field_name": {
"validators": [
["tsm_trim_string", 5]
],
},
...
tsm_concat
- Trim string with max lenght. Use$self
to point on field value. Example:
data = {"id": "dataset-1}
schema = ...
"package_url": {
"replace_from": "id",
"validators": [
[
"tsm_concat",
"https://site.url/dataset/",
"$self",
]
],
},
...
tsm_unique_only
- Preserve only unique values from a list. Works only with lists.
The default transmutator must receive at least one mandatory argument - field
object. Field contains few properties: field_name
, value
and type
.
There is a possibility to provide more arguments to a validator like in tsm_get_nested
. For this use a nested array with first item transmutator and other - arguments to it.
Keywords
map_to
(str
) - changes thefield.name
in result dict.validators
(list[str]
) - a list of transmutators that will be applied to afield.value
. A transmutator could be astring
or alist
where the first item must be transmutator name and others are arbitrary values. Example:
There are two transmutators:... "validators": [ ["tsm_get_nested", "nested_field", "en"], "tsm_to_uppercase", , ...
tsm_get_nested
andtsm_to_uppercase
.multiple
(bool
, default:False
) - if the field could have multiple items, e.gresources
field in dataset, mark it asmultiple
to transmute all the items successively.... "resources": { "type": "Resource", "multiple": True }, ...
remove
(bool
, default:False
) - removes a field from a result dict ifTrue
.default
(Any
) - the default value that will be used if the original field.value evaluates toFalse
.default_from
(str
|list
) - acts similar todefault
but accepts afield.name
of a sibling field from which we want to take its value. Sibling field is a field that located in the sametype
. The current implementation doesn't allow to point on fields from othertypes
. Could take a string that represents thefield.name
or an array of strings, to use multiple fields. Seeinherit_mode
keyword for details.... "metadata_modified": { "validators": ["tsm_isodate"], "default_from": "metadata_created", }, ...
replace_from
(str
|list
) - acts similar todefault_from
but replaces the origin value whenever it's empty or not.inherit_mode
(str
, default:combine
) - defines the mode fordefault_from
andreplace_from
. By default we are combining values from all the fields, but we could just use first non-false value, in case if the field might be empty.value
(Any
) - a value that will be used for a field. This keyword has the highest priority. Could be used to create a new field with an arbitrary value.update
(bool
, default:False) - if the original value is mutable (
array, object`) - you can update it. You can only update field values of the same types.
Installation
To install ckanext-transmute:
-
Activate your CKAN virtual environment, for example:
. /usr/lib/ckan/default/bin/activate
-
Clone the source and install it on the virtualenv
git clone https://github.com/mutantsan/ckanext-transmute.git cd ckanext-transmute pip install -e . pip install -r requirements.txt
-
Add
transmute
to theckan.plugins
setting in your CKAN config file (by default the config file is located at/etc/ckan/default/ckan.ini
). -
Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu:
sudo service apache2 reload
Developer installation
To install ckanext-transmute for development, activate your CKAN virtualenv and do:
git clone https://github.com/mutantsan/ckanext-transmute.git
cd ckanext-transmute
python setup.py develop
pip install -r dev-requirements.txt
Tests
I've used TDD to write this extension, so if you changing something be sure that all the tests are valid. To run the tests, do:
pytest --ckan-ini=test.ini
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ckanext_transmute-1.6.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ba81f5341557ea66d04e6dc582f33afaa63ee2916baa1a2b73e68a0d4da4162 |
|
MD5 | b3a47c19d483945a1a1f72e87b94a23c |
|
BLAKE2b-256 | 7ec98859901f6d1804e5806755add77dd7b44d768e73817473f31415944bf448 |