Skip to main content

A resource normalizer for dataflows

Project description

Dataflows Resource write to db normalized

This library provides some dataflows processing for normalizing a resource.

It has special support for storing normalized data into DB tables.

What is normalization?

In short, it is the process of reducing duplication in a dataset.

More can be read about this concept here.

Example

Let's take, as an example, this world cities dataset (we shall call it the fact resource):

from dataflows import Flow, load, printer

Flow(
    load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'),
    printer(num_rows=1)
).process()

cities:

# name country subcountry geonameid
1 les Escaldes Andorra Escaldes-Engordany 3040051
2 Andorra la Vella Andorra Andorra la Vella 3041563
...
23018 Chitungwiza Zimbabwe Harare 1106542

It seems that the country and subcountry columns are quite repetitive - let's extract them into a separate, deduplicated resource (we will call that a dimension resource).

To do that we use the normalize processor.

This processor receives a single resource name, and a list of NormGroup instances. Each of these groups specifies one new dimension resource to be extracted and deduplicated.

Let's see it in action:

from dataflows_normalize import normalize, NormGroup

Flow(
    load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'),
    normalize([
       NormGroup(['country', 'subcountry'], 'country_id', 'id') 
    ], resource='cities'),
    printer()
).process()

cities:

# name geonameid country_id
1 les Escaldes 3040051 0
2 Andorra la Vella 3041563 1
3 Umm al Qaywayn 290594 2
4 Ras al-Khaimah 291074 3
5 Khawr Fakkān 291696 4
...
23014 Bulawayo 894701 2677
23015 Bindura 895061 2678
23016 Beitbridge 895269 2679
23017 Epworth 1085510 2676
23018 Chitungwiza 1106542 2676

cities_country_id:

# id country subcountry
1 30 Afghanistan Badakhshan
2 27 Afghanistan Badghis
3 21 Afghanistan Balkh
4 33 Afghanistan Bāmīān
5 31 Afghanistan Farah
6 19 Afghanistan Faryab
7 28 Afghanistan Ghaznī
8 13 Afghanistan Ghowr
9 22 Afghanistan Helmand
10 11 Afghanistan Herat
...
2671 2677 Zimbabwe Bulawayo
2672 2676 Zimbabwe Harare
2673 2673 Zimbabwe Manicaland
2674 2678 Zimbabwe Mashonaland Central
2675 2675 Zimbabwe Mashonaland East
2676 2674 Zimbabwe Mashonaland West
2677 2670 Zimbabwe Masvingo
2678 2671 Zimbabwe Matabeleland North
2679 2679 Zimbabwe Matabeleland South
2680 2672 Zimbabwe Midlands

If we follow the last line in the dataset (Chitungwiza), we can see that an entry for its region (Zimbabwe/Harare) was created with id 2676, and that id was added to the original row instead of the original values.

How much did we gain?

The original CSV file has a size of 895,586 bytes.

If we save the two new resources as CSVs, we would get

542,299 bytes for the fact resource and 68,023 for the regions dimension resource - a total of 610,322 bytes (or a reduction of 31% in size).

Not only this helps with size, it also improves greatly DB performance to store data in normalized form.

DB Normalization

Running similar code to above, only using normalize_to_db will do the following:

  • Load existing values from database dimension tables (in case these tables exist)
  • Normalize the input data, and split into fact and dimension resources
  • Update the DB tables with new values, while reusing existing references

The main difference in usage from normalize is that the names of DB tables are provided.

from dataflows_normalize import normalize_to_db, NormGroup

Flow(
    load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'),
    normalize_to_db(
        [
            NormGroup(['country', 'subcountry'], 'country_id', 'id', db_table='countries_db_table') 
        ], 
        'cities_db_table', 'cities',
        db_connection_str='...'
    ),
).process()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataflows-normalize-0.0.10.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

dataflows_normalize-0.0.10-py2.py3-none-any.whl (6.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file dataflows-normalize-0.0.10.tar.gz.

File metadata

  • Download URL: dataflows-normalize-0.0.10.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.0

File hashes

Hashes for dataflows-normalize-0.0.10.tar.gz
Algorithm Hash digest
SHA256 faaca96b297eb28f0f0e0acd4a12ccc2243da1ba9294bf2ce96b9b5429ce74aa
MD5 8b8b7fceaa57ac8834a68a79b9447b18
BLAKE2b-256 8423fb7e41099f4962e9667a57745d00633ccde24be453e75dd8b8ecea705a70

See more details on using hashes here.

Provenance

File details

Details for the file dataflows_normalize-0.0.10-py2.py3-none-any.whl.

File metadata

  • Download URL: dataflows_normalize-0.0.10-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.0

File hashes

Hashes for dataflows_normalize-0.0.10-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 34c93d0fcc37ab00f434ce9ab0bfe7b6cf1cd4d120cddc5fdac1ab039420ef16
MD5 5d3bcf99a6a77a8ce6a049575d2f560e
BLAKE2b-256 c80ec7bd8e7369eb5d7700dc530fda4e1f52b1cf20d819176dfd2ed4627eb9d5

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page