Skip to main content

A nifty data processing framework, based on data packages

Project description

# ![logo](logo-s.png) DataFlows

[![Travis](https://img.shields.io/travis/datahq/dataflows/master.svg)](https://travis-ci.org/datahq/dataflows)
[![Coveralls](http://img.shields.io/coveralls/datahq/dataflows.svg?branch=master)](https://coveralls.io/r/datahq/dataflows?branch=master)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dataflows.svg)
[![Gitter chat](https://badges.gitter.im/dataflows-chat/Lobby.png)](https://gitter.im/dataflows-chat/Lobby)

DataFlows is a novel and intuitive way of building data processing flows.

- It's built for medium-data processing - data that fits on your hard drive, but is too big to load in Excel or as-is into Python, and not big enough to require spinning up a Hadoop cluster...
- It's built upon the foundation of the Frictionless Data project - which means that all data prduced by these flows is easily reusable by others.

## QuickStart

Install `dataflows` via `pip install.`

Then use the command-line interface to bootstrap a basic processing script for any remote data file:

```bash

# Install from PyPi
$ pip install dataflows

# Inspect a remote CSV file
$ dataflows init https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv
Writing processing code into academy_csv.py
Running academy_csv.py
academy:
# Year Ceremony Award Winner Name Film
(string) (integer) (string) (string) (string) (string)
---- ---------- ----------- -------------------------------- ---------- ------------------------------ -------------------
1 1927/1928 1 Actor Richard Barthelmess The Noose
2 1927/1928 1 Actor 1 Emil Jannings The Last Command
3 1927/1928 1 Actress Louise Dresser A Ship Comes In
4 1927/1928 1 Actress 1 Janet Gaynor 7th Heaven
5 1927/1928 1 Actress Gloria Swanson Sadie Thompson
6 1927/1928 1 Art Direction Rochus Gliese Sunrise
7 1927/1928 1 Art Direction 1 William Cameron Menzies The Dove; Tempest
...

# dataflows create a local package of the data and a reusable processing script which you can tinker with
$ tree
.
├── academy_csv
│   ├── academy.csv
│   └── datapackage.json
└── academy_csv.py

1 directory, 3 files

# Resulting 'Data Package' is super easy to use in Python
[adam] ~/code/budgetkey-apps/budgetkey-app-main-page/tmp (master=) $ python
Python 3.6.1 (default, Mar 27 2017, 00:25:54)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datapackage import Package
>>> pkg = Package('academy_csv/datapackage.json')
>>> it = pkg.resources[0].iter(keyed=True)
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': None, 'Name': 'Richard Barthelmess', 'Film': 'The Noose'}
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': '1', 'Name': 'Emil Jannings', 'Film': 'The Last Command'}

# You now run `academy_csv.py` to repeat the process
# And obviously modify it to add data modification steps
```

## Features

* Trivial to get started and easy to scale up
* Set up and run from command line in seconds ...
* `dataflow init` => `flow.py`
* `python flow.py`
* Validate input (and esp source) quickly (non-zero length, right structure, etc.)
* Supports cache data from source and even between steps
* so that we can run and test quickly (retrieving is slow)
* Immediate test is run: and look at output ...
* Log, debug, rerun
* Degrades to simple python
* Conventions over configuration
* Log exceptions and / or terminate
* The input to each stage is a Data Package or Data Resource (not a previous task)
* Data package based and compatible
* Processors can be a function (or a class) processing row-by-row, resource-by-resource or a full package
* A pre-existing decent contrib library of Readers (Collectors) and Processors and Writers

## Learn more

Dive into the [Tutorial](TUTORIAL.md) to get a deeper glimpse into everything that `dataflows` can do.
Also review this list of [Built-in Processors](PROCESSORS.md), which also includes an API reference for each one of them.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataflows-0.0.33.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

dataflows-0.0.33-py2.py3-none-any.whl (37.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file dataflows-0.0.33.tar.gz.

File metadata

  • Download URL: dataflows-0.0.33.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for dataflows-0.0.33.tar.gz
Algorithm Hash digest
SHA256 2ddb79009a85a437f1996b95e0eb844a968a02733df3b1fbab450e4a7bff5cc1
MD5 94315f81591072544ec3596567f8a56a
BLAKE2b-256 d1d34ed3d1973edee9d1cffbc5e0988fe731d276f0f9da6ba9cb0481c6f6e072

See more details on using hashes here.

Provenance

File details

Details for the file dataflows-0.0.33-py2.py3-none-any.whl.

File metadata

  • Download URL: dataflows-0.0.33-py2.py3-none-any.whl
  • Upload date:
  • Size: 37.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for dataflows-0.0.33-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b8cbb10e9027a2556bf78b5727f67413f5eaa5213ecc4ea9cdb72c3ae5b2dd31
MD5 18e2cc324ac84480b9c61fc0dd1ca3bd
BLAKE2b-256 7252f086e7fc87acbdb3f2a3fc4ae135eed4a67b491defb88e4efe0609028d50

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page