pandalone: process data-trees with relocatable-paths
Project description
pandalone is a collection of utilities for working with hierarchical-data using relocatable-paths.
- Release:
0.1.12.dev0
- Date:
2016-04-11 12:48:04
- Documentation:
- Source:
- PyPI repo:
- Keywords:
calculation, data, dependencies, engineering, excel, library, numpy, pandas, processing, python, resolution, scientific, simulink, tree, utility
- Copyright:
2015 European Commission (JRC-IET)
- License:
Currently only 2 portions of the envisioned functionality are ready for use:
mod(pandalone.xleash): A mini-language for “throwing the rope” around rectangular areas of Excel-sheets.
mod(pandalone.mappings): Hierarchical string-like objects that may be used for indexing, facilitating renaming keys and column-names at a later stage.
Our goal is to facilitate the composition of engineering-models from loosely-coupled components. Initially envisioned as an indirection-framework around pandas coupled with a dependency-resolver, every such model should auto-adapt and process only values available, and allow remapping of the paths accessing them, to run on renamed/relocated value-trees without component-code modifications.
It is an open source library written for python-3.4 but tested under both python-2.7 and python-3.3+, for Windows and Linux.
Introduction
Overview
At the most fundamental level, an “execution” or a “run” of any data-processing can be thought like that:
.--------------. _____________ .-------------. ; DataTree ; | | ; DataTree ; ;--------------; ==> | <cfunc_1> | ==> ;--------------; ; /some/data ; | <cfunc_2> | ; /some/data ; ; /some/other ; | ... | ; /some/other ; ; /foo/bar ; |_____________| ; /foo/bar ; '--------------' '--------------.
The data-tree might come from json, hdf5, excel-workbooks, or plain dictionaries and lists. Its values are strings and numbers, numpy-lists, pandas or xray-datasets, etc.
The component-functions must abide to the following simple signature:
cfunc_do_something(pandelone, datatree)
and must not return any value, just read and write into the data-tree.
Here is a simple component-function:
def cfunc_standardize(pandelone, datatree): pin, pon = pandelone.paths(), df = datatree.get(pin.A) df[pon.A.B_std] = df[pin.A.B] / df[pin.A.B].std()
Notice the use of the relocatable-paths marked specifically as input or output.
TODO: continue rough example in tutorial…
Quick-start
Assuming that you have a working python-environment, open a command-shell, (in Windows use program(cmd.exe) BUT ensure program(python.exe) is in its envvar(PATH)), try the following commands:
- Install:
$ pip install pandalone ## Use `--pre` if version-string has a build-suffix.
Or in case you need the very latest from master branch :
$ pip install git+https://github.com/pandalone/pandalone.git
See: doc(install)
- Run:
$ pandalone --version
Install
Current version(x.x.x) runs on Python-2.7+ and Python-3.3+ and requires numpy/scipy, pandas and win32 libraries along with their native backends to be installed.
It has been tested under Windows and Linux and Python-3.3+ is the preferred interpreter, i.e, the Excel interface and desktop-UI runs only with it.
It is distributed on Wheels.
Python installation
As explained above, this project depends on packages with native-backends that require the use of C and Fortran compilers to build from sources. To avoid this hassle, you should choose one of the user-friendly distributions suggested below.
Below is a matrix of the two suggested self-wrapped python distributions for running this program (we excluded here default python included in linux). Both distributions:
are free (as of freedom),
do not require admin-rights for installation in Windows, and
have been tested to run successfully this program (also tested on default linux distros).
Distributions |
||
---|---|---|
Platform |
Windows |
Windows, Mac OS, Linux |
Ease of Installation |
Fair (requires fiddling with the envvar(PATH) and the Registry after install) |
|
Ease of Use |
Easy |
Moderate (should use command(conda) and/or command(pip) depending on whether a package contains native libraries |
# of Packages |
Only what’s included in the downloaded-archive |
Many 3rd-party packages uploaded by users |
Notes |
After installation, see ref:faq for:
|
|
Check also installation instructions from the pandas site. |
Package installation
Before installing it, make sure that there are no older versions left over on the python installation you are using. To cleanly uninstall it, run this command until you cannot find any project installed:
$ pip uninstall pandalone ## Use `pip3` if both python-2 & 3 are in PATH.
You can install the project directly from the PyPi repo the “standard” way, by typing the command(pip) in the console:
$ pip install pandalone
If you want to install a pre-release version (the version-string is not plain numbers, but ends with alpha, beta.2 or something else), use additionally option --pre.
$ pip install pandalone
Also you can install the very latest version straight from the sources:
$ pip install git+git://github.com/pandalone/pandalone.git --pre
If you want to upgrade an existing installation along with all its dependencies, add also option --upgrade (or option -U equivalently), but then the build might take some considerable time to finish. Also there is the possibility the upgraded libraries might break existing programs(!) so use it with caution, or from within a virtualenv (isolated Python environment).
To install it for different Python environments, repeat the procedure using the appropriate program(python.exe) interpreter for each environment.
After installation, it is important that you check which version is visible in your envvar(PATH):
$ pndlcmd --version
0.1.12.dev0
To install for different Python versions, repeat the procedure for every required version.
Older versions
To install an older released version issue the console command:
$ pip install pandalone=0.0.1 ## Use `--pre` if version-string has a build-suffix.
or alternatively straight from the sources:
$ pip install git+https://github.com/pandalone/pandalone.git@v0.0.9-alpha.3.1 --pre
Of course you can substitute v0.0.9-alpha.3.1 with any slug from “commits”, “branches” or “releases” that you will find on project’s github-repo).
Installing sources
If you download the sources you have more options for installation. There are various methods to get hold of them:
Download the source distribution from PyPi repo.
Download a release-snapshot from github
Clone the git-repository at github.
Assuming you have a working installation of git you can fetch and install the latest version of the project with the following series of commands:
$ git clone "https://github.com/pandalone/pandalone.git" pandalone.git $ cd pandalone.git $ python setup.py install ## Use `python3` if both python-2 & 3 installed.
When working with sources, you need to have installed all libraries that the project depends on:
$ pip install -r requirements/execution.pip .
The previous command installs a “snapshot” of the project as it is found in the sources. If you wish to link the project’s sources with your python environment, install the project in development mode:
$ python setup.py develop
Project files and folders
The files and folders of the project are listed below:
+--pandalone/ ## (package) Python-code +--tests/ ## (package) Test-cases +--doc/ ## Documentation folder +--setup.py ## (script) The entry point for `setuptools`, installing, testing, etc +--requirements/ ## (txt-files) Various pip and conda dependencies. +--README.rst +--CHANGES.rst +--AUTHORS.rst +--CONTRIBUTING.rst +--LICENSE.txt
Usage
Currently 2 portions of this library are ready for use: mod(pandalone.xleash) and mod(pandalone.mappings)
Cmd-line usage
The command-line usage below requires the Python environment to be installed, and provides for executing an experiment directly from the OS’s shell (i.e. program(cmd) in windows or program(bash) in POSIX), and in a single command.
[TBD]
GUI usage
For a quick-‘n-dirty method to explore the structure of the data-tree and run an experiment, just run:
$ pandalone gui
Excel usage
In Windows and OS X you may utilize the excellent xlwings library to use Excel files for providing input and output to the experiment.
To create the necessary template-files in your current-directory you should enter:
$ pandalone excel
You could type instead pandalone excel {file_path} to specify a different destination path.
[TBD]
Python usage
Example python REPL (Read-Eval-Print Loop) example-commands are given below that setup and run an experiment.
First run command(python) or command(ipython) and try to import the project to check its version:
code-block:
>>> import pandalone >>> pandalone.__version__ ## Check version once more. '0.1.12.dev0' >>> pandalone.__file__ ## To check where it was installed. # doctest: +SKIP /usr/local/lib/site-package/pandalone-...
If everything works, create the data-tree to hold the input-data (strings and numbers). You assemble data-tree by the use of:
sequences,
dictionaries,
class(pandas.DataFrame),
class(pandas.Series), and
URI-references to other data-trees.
[TBD]
Getting Involved
This project is hosted in github. To provide feedback about bugs and errors or questions and requests for enhancements, use github’s Issue-tracker.
Sources & Dependencies
To get involved with development, you need a POSIX environment to fully build it (Linux, OSX or Cygwin on Windows).
First you need to download the latest sources:
$ git clone https://github.com/pandalone/pandalone.git pandalone.git
$ cd pandalone.git
Liclipse IDE
Within the sources there are two sample files for the comprehensive LiClipse IDE:
file(eclipse.project)
file(eclipse.pydevproject)
Remove the eclipse prefix, (but leave the dot(.)) and import it as “existing project” from Eclipse’s File menu.
Another issue is caused due to the fact that LiClipse contains its own implementation of Git, EGit, which badly interacts with unix symbolic-links, such as the file(docs/docs), and it detects working-directory changes even after a fresh checkout. To workaround this, Right-click on the above file Properties --> Team --> Advanced --> Assume Unchanged
Then you can install all project’s dependencies in `development mode using the file(setup.py) script:
$ python setup.py --help ## Get help for this script.
Common commands: (see '--help-commands' for more)
setup.py build will build the package underneath 'build/'
setup.py install will install the package
Global options:
--verbose (-v) run verbosely (default)
--quiet (-q) run quietly (turns verbosity off)
--dry-run (-n) don't actually do anything
...
$ python setup.py develop ## Also installs dependencies into project's folder.
$ python setup.py build ## Check that the project indeed builds ok.
You should now run the test-cases to check that the sources are in good shape:
$ python setup.py test
Note
The above commands installed the dependencies inside the project folder and for the virtual-environment. That is why all build and testing actions have to go through python setup.py {some_cmd}.
If you are dealing with installation problems and/or you want to permantly install dependant packages, you have to deactivate the virtual-environment and start installing them into your base python environment:
$ deactivate
$ python setup.py develop
or even try the more permanent installation-mode:
$ python setup.py install # May require admin-rights
Design
FAQ
Why another XXX? What about YYY?
These are the knowingly related python projects:
OpenMDAO: It has influenced pandalone’s design. It is planned to interoperate by converting to and from it’s data-types. But it works on python-2 only and its architecture needs attending from programmers (no setup.py, no official test-cases).
PyDSTool: It does not overlap, since it does not cover IO and dependencies of data. Also planned to interoperate with it (as soon as we have a better grasp of it :-). It has some issues with the documentation, but they are working on it.
xray: Pandas for higher dimensions; data-trees should in principle work with “xray”.
Blaze: NumPy and Pandas interface to Big Data; data-trees should in principle work with “blaze”.
netCDF4: Hierarchical file-data-format similar to hdf5; a data-tree may derive in principle from “netCDF4 “.
hdf5: Hierarchical file-data-format, supported natively by pandas; a data-tree may derive in principle from “netCDF4 “.
Which other projects/ideas have you reviewed when building this library?
bubbles ETL: Processing-pipelines for (mostly) categorical data.
-
JTSKit, A utility library for working with JSON Table Schema in Python.
Celery: Execute distributed asynchronous tasks using message passing on a single or more worker servers using multiprocessing, Eventlet, or gevent.
Fuzzywuzzy and Jellyfish: Fuzzy string matching in python. Use it for writting code that can read coarsely-known column-names.
“Other’s people’s messy data (and how not to hate it)”, PyCon 2015(Canada) presentation by Mali Akmanalp.
Glossary
rubric:
data-tree The *container* of data consumed and produced by a :term`model`, which may contain also the model. Its values are accessed using **path** s. It is implemented by class(`pandalone.pandata.Pandel`) as a mergeable stack of **JSON-schema** abiding trees of strings and numbers, formed with: - sequences, - dictionaries, - mod(`pandas`) instances, and - URI-references. value-tree That part of the **data-tree** that relates only to the I/O data processed. model A collection of **component** s and accompanying **mappings**. component Encapsulates a data-transformation function, using **path** to refer to its inputs/outputs within the **value-tree**. path A `/file/like` string functioning as the *id* of data-values in the **data-tree**. It is composed of **step**, and it follows the syntax of the **JSON-pointer**. step pstep path-step The parts between between two conjecutive slashes(`/`) within a **path**. The class(`Pstep`) facilitates their manipulation. pmod pmods pmods-hierarchy mapping mappings Specifies a transformation of an "origin" path to a "destination" one (also called as "from" and "to" paths). The mapping always transforms the *final* path-step, and it can either *rename* or *relocate* that step, like that:: ORIGIN DESTINATION RESULT_PATH ------ ----------- ----------- /rename/path foo --> /rename/foo ## renaming /relocate/path foo/bar --> /relocate/foo/bar ## relocation /root a/b/c --> /a/b/c ## Relocates all /root sub-paths. The hierarchy is formed by class(`Pmod`) instances, which are build when parsing the **mappings** list, above. JSON-schema The `JSON schema <http://json-schema.org/>`_ is an `IETF draft <http://tools.ietf.org/html/draft-zyp-json-schema-03>`_ that provides a *contract* for what JSON-data is required for a given application and how to interact with it. JSON Schema is intended to define validation, documentation, hyperlink navigation, and interaction control of JSON data. You can learn more about it from this `excellent guide <http://spacetelescope.github.io/understanding-json-schema/>`_, and experiment with this `on-line validator <http://www.jsonschema.net/>`_. JSON-pointer JSON Pointer(rfc(`6901`)) defines a string syntax for identifying a specific value within a JavaScript Object Notation (JSON) document. It aims to serve the same purpose as *XPath* from the XML world, but it is much simpler.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pandalone-0.1.12.dev0.zip
.
File metadata
- Download URL: pandalone-0.1.12.dev0.zip
- Upload date:
- Size: 195.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 075892f68ea206696d9f5406c1d6e5473d218aeebb8aff97eabdbb7852eaeb27 |
|
MD5 | 74756d875c5f8489baf06e8d0d0eac72 |
|
BLAKE2b-256 | 217f3e7a863d203956bdf24aa42ef119425695fd376eee51c7fe99bb1a29cd3f |
File details
Details for the file pandalone-0.1.12.dev0-py2.py3-none-any.whl
.
File metadata
- Download URL: pandalone-0.1.12.dev0-py2.py3-none-any.whl
- Upload date:
- Size: 110.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1be03bf95fca7463726d7cebc95a134b3a69dbe5c28a5a66059db03d8d8ec562 |
|
MD5 | f342e3f3f0fe10ee2271db3eab312044 |
|
BLAKE2b-256 | 2d7da8073d424a5d711d31814f1245ed2b0f20180520798ffac9559a3cca9175 |