Library for data integration using a JSON/RDF object graph.
Project description
# jsongraph [![Build Status](https://travis-ci.org/pudo/jsongraph.svg?branch=master)](https://travis-ci.org/pudo/jsongraph)
This library provides tools to integrate data from multiple sources into a
coherent data model. Given a heterogeneous set of source records, it will
generate a set of composite entities with merged information from all
available sources. Further, it allows querying the resulting graph using a
simple, JSON-based graph query language.
The intent of this tool is to make a graph-based data integration system
(based on RDF) seamlessly available through simple JSON objects.
## Usage
This is what using the library looks like in a simplified scenario:
```python
from jsongraph import Graph
# Create a graph for all project information. This can be backed by a
# triple store or an in-memory construct.
graph = Graph(base_uri='file:///path/to/schema/files')
graph.register('person', 'person_schema.json')
# Load data about a person.
context = graph.context()
context.add('person', data)
context.save()
# Repeat data loading for a variety of source files.
# This will integrate data from all source files into a single representation
# of the data.
context = graph.consolidate('urn:prod')
# Metaweb-style queries:
for item in context.query([{"name": None, "limit": 5}]):
print item['name']
```
## Design
A ``jsongraph`` application is focussed on a ``Graph``, which stores a set of
data. A ``Graph`` can either exist only in memory, or be stored in a backend
database.
All data in a ``Graph`` is structured as collections of JSON objects (i.e.
nested dictionaries, lists and values). The structure of all stored objects
must be defined using a [JSON Schema](http://json-schema.org/). Some limits
apply to such schema, e.g. they may not allow additional or pattern properties.
### Contexts and Metadata
The objects in each ``Graph`` are grouped into a set of ``Contexts``. Those
also include metadata, such as the source of the data, and the level of trust
that the system shall have in those data. A ``Context`` will usually correspond
to a source data file, or a user interaction.
### Consolidated Contexts
When working with ``jsongraph``, a user will first load data into a variety of
``Contexts``. They can then generate a consolidated version of the data, in a
separate ``Context``.
This consolidated version applies entity de-duplication. For object properties
with multiple available values across several ``Contexts``, the information
from the most trustworthy ``Context`` will be chosen.
### Queries
``jsongraph`` includes a query language implementation, which is heavily
inspired by Google's [Metaweb Query Language](http://mql.freebaseapps.com/ch03.html).
Queries are written as JSON, and search proceeds by example. Searches can also
be deeply nested, traversing the links between objects stored in the ``Graph``
at an arbitrary complexity.
Queries on the data can be run either against any of the source ``Contexts``,
or against the consolidated context. Queries against the consolidated
``Context`` will produce responses which reflect the best available information
based on data from a variety of sources.
### De-duplication
One key part of the functions of this library will be the application of
de-duplication rules. This will take place in three steps:
* Generating a set of de-duplicating candidates for all entities in a given
``Graph``. These will be simplified representations of objects which can be
fed into a comparison tool (either automated or interactive with the user).
* Once the candidates have been decided, they are transformed into a mapping of
the type (``original_fingerprint`` -> ``same_as_fingerprint``). Such mappings
are applied to a context.
* Upon graph consolidation (see above), the entities which have been mapped to
another are not included. All their properties are inherited by the
destination entity.
A data comparison candidate may look like this:
```json
{
"fingerprint": "...",
"entity": "...",
"data": {
},
"source": {
"label": "...",
"url": "http://..."
}
}
```
## Tests
The test suite will usually be executed in it's own ``virtualenv`` and perform a
coverage check as well as the tests. To execute on a system with ``virtualenv``
and ``make`` installed, type:
```bash
$ make test
```
This library provides tools to integrate data from multiple sources into a
coherent data model. Given a heterogeneous set of source records, it will
generate a set of composite entities with merged information from all
available sources. Further, it allows querying the resulting graph using a
simple, JSON-based graph query language.
The intent of this tool is to make a graph-based data integration system
(based on RDF) seamlessly available through simple JSON objects.
## Usage
This is what using the library looks like in a simplified scenario:
```python
from jsongraph import Graph
# Create a graph for all project information. This can be backed by a
# triple store or an in-memory construct.
graph = Graph(base_uri='file:///path/to/schema/files')
graph.register('person', 'person_schema.json')
# Load data about a person.
context = graph.context()
context.add('person', data)
context.save()
# Repeat data loading for a variety of source files.
# This will integrate data from all source files into a single representation
# of the data.
context = graph.consolidate('urn:prod')
# Metaweb-style queries:
for item in context.query([{"name": None, "limit": 5}]):
print item['name']
```
## Design
A ``jsongraph`` application is focussed on a ``Graph``, which stores a set of
data. A ``Graph`` can either exist only in memory, or be stored in a backend
database.
All data in a ``Graph`` is structured as collections of JSON objects (i.e.
nested dictionaries, lists and values). The structure of all stored objects
must be defined using a [JSON Schema](http://json-schema.org/). Some limits
apply to such schema, e.g. they may not allow additional or pattern properties.
### Contexts and Metadata
The objects in each ``Graph`` are grouped into a set of ``Contexts``. Those
also include metadata, such as the source of the data, and the level of trust
that the system shall have in those data. A ``Context`` will usually correspond
to a source data file, or a user interaction.
### Consolidated Contexts
When working with ``jsongraph``, a user will first load data into a variety of
``Contexts``. They can then generate a consolidated version of the data, in a
separate ``Context``.
This consolidated version applies entity de-duplication. For object properties
with multiple available values across several ``Contexts``, the information
from the most trustworthy ``Context`` will be chosen.
### Queries
``jsongraph`` includes a query language implementation, which is heavily
inspired by Google's [Metaweb Query Language](http://mql.freebaseapps.com/ch03.html).
Queries are written as JSON, and search proceeds by example. Searches can also
be deeply nested, traversing the links between objects stored in the ``Graph``
at an arbitrary complexity.
Queries on the data can be run either against any of the source ``Contexts``,
or against the consolidated context. Queries against the consolidated
``Context`` will produce responses which reflect the best available information
based on data from a variety of sources.
### De-duplication
One key part of the functions of this library will be the application of
de-duplication rules. This will take place in three steps:
* Generating a set of de-duplicating candidates for all entities in a given
``Graph``. These will be simplified representations of objects which can be
fed into a comparison tool (either automated or interactive with the user).
* Once the candidates have been decided, they are transformed into a mapping of
the type (``original_fingerprint`` -> ``same_as_fingerprint``). Such mappings
are applied to a context.
* Upon graph consolidation (see above), the entities which have been mapped to
another are not included. All their properties are inherited by the
destination entity.
A data comparison candidate may look like this:
```json
{
"fingerprint": "...",
"entity": "...",
"data": {
},
"source": {
"label": "...",
"url": "http://..."
}
}
```
## Tests
The test suite will usually be executed in it's own ``virtualenv`` and perform a
coverage check as well as the tests. To execute on a system with ``virtualenv``
and ``make`` installed, type:
```bash
$ make test
```
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
jsongraph-0.2.1.tar.gz
(13.6 kB
view details)
Built Distribution
jsongraph-0.2.1-py2-none-any.whl
(18.2 kB
view details)
File details
Details for the file jsongraph-0.2.1.tar.gz
.
File metadata
- Download URL: jsongraph-0.2.1.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6d8a7de3c10bf503b525d7a273c35229b1264615979e6c29ef118688a5cd341 |
|
MD5 | 34bb88624750308114b12469ead41176 |
|
BLAKE2b-256 | 6ec1dba9d80a4fb451bee84788bc714e4f21d88cb4c065b37b0b1bf2cbf5fbf0 |
File details
Details for the file jsongraph-0.2.1-py2-none-any.whl
.
File metadata
- Download URL: jsongraph-0.2.1-py2-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df7053969894b934b2b78c3f646ce752ccc7923455ff276ec4604d03f8dec8b4 |
|
MD5 | e5e44f67f74c30136b7817bf8e8226a4 |
|
BLAKE2b-256 | f65d62547c4578beb02965ddf25b723d4ec214e322bbdf404328101c591b561f |