A standalone web service that parses the contents of a CKAN site's data files and pushes them into its DataStore. Accelerated by qsv.

These details have not been verified by PyPI

Project links

Homepage

Project description

DataPusher+

DataPusher+ is a fork of Datapusher that combines the speed and robustness of ckanext-xloader with the data type guessing of Datapusher.

Datapusher+ is built using CKAN Service Provider, with Messytables replaced by qsv.

TNRIS/TWDB provided the use cases that informed and supported the development of Datapusher+, specifically, to support a Resource-first upload workflow.

It features:

"Bullet-proof", ultra-fast data type inferencing with qsv

Unlike Messytables which scans only the the first few rows to guess the type of a column, qsv scans the entire table so its data type inferences are guaranteed[^1].

Despite this, qsv is still exponentially faster even if it scans the whole file, not only inferring data types, it also calculates some descriptive statistics as well. For example, scanning a 2.7 million row, 124MB CSV file for types and stats took 0.16 seconds[^2].

It is very fast as qsv is written in Rust, is multithreaded, and uses all kinds of performance techniques especially designed for data-wrangling.
Exponentially faster loading speed

Similar to xloader, we use PostgreSQL COPY to directly pipe the data into the datastore, short-circuiting the additional processing/transformation/API calls used by Datapusher.

But unlike xloader, we load everything using the proper data types and not as text, so there's no need to reload the data again after adjusting the Data Dictionary, as you would with xloader.
Production-ready Robustness

In production, the number one source of support issues is Datapusher - primarily, because of data quality issues and Datapusher's inability to correctly infer data types, gracefully handle errors[^3], and provide the Data Publisher actionable information to correct the data.

Datapusher+'s design directly addresses all these issues.
More informative datastore loading messages

Datapusher+ messages are designed to be more verbose and actionable, so the data publisher's user experience is far better and makes it possible to have a resource-first upload workflow.
Extended data-wrangling with qsv

Apart from bullet-proof data type inferences, qsv is leveraged by Datapusher+ to convert XLS/ODS files; count the number of rows; transcode to UTF-8 if required; validate if a CSV conforms to the RFC 4180 standard; optionally create a preview subset and optionally deduplicate rows in this initial version.

Future versions of Datapusher+ will further leverage qsv's 70+ commands to do additional data wrangling, preprocessing and validation. The Roadmap is available here. Ideas, suggestions and your feedback are most welcome!

[^1]: Why use qsv instead of a "proper" python data analysis library like pandas? [^2]: It takes 0.16 seconds with an index to run qsv stats against the qsv whirlwind tour sample file on a Ryzen 4800H (8 physical/16 logical cores) with 32 gb memory and a 1 TB SSD. Without an index, it takes 1.3 seconds. [^3]: Imagine you have a 1M row CSV, and the last row has an invalid value for a numeric column (e.g. "N/A" instead of a number). After spending hours pushing the data very slowly, legacy datapusher will abort on the last row and the ENTIRE job is invalid. Ok, that's bad, but what makes it worse is that the old table has been deleted already, and Datapusher doesn't tell you what caused the job to fail! YIKES!!!!

Resource-first Upload Workflow

In traditional CKAN, the dataset package upload workflow is as follows:

Enter package metadata
Upload resource/s
Check if the datapusher uploaded the dataset correctly.
- With the Datapusher,this make take a while, and when it fails, it doesn't really give you actionable information on why it failed.
- With xloader, its 10x faster. But then, that speed comes at the cost of all columns defined as text, and the Data Publisher will need to manually change the data types in the Data Dictionary and reload the data again.

In TNRIS/TWDB's extensive user research, one of the key usability gaps they found with CKAN is this workflow. Why can't the data publisher upload the primary resource first, before entering the metadata? And more importantly, why can't some of the metadata be automatically inferred and populated based on the attributes of the dataset?

This is why qsv's speed is critical for a Resource-first upload workflow. By the time the data publisher uploads the resource and starts populating the rest of the form a few seconds later, a lot of inferred metadata (Data Dictionary for this initial version) should be available for pre-populating the rest of the form.

See this discussion and this issue about the "Multi-pass Datapusher" from May 2015 for additional context.

Development installation

Datapusher+ is a drop-in replacement for Datapusher, so it's installed the same way.

Create a virtual environment for Datapusher+ using at least python 3.7:

python -m venv dpplus_venv
. dpplus_venv/bin/activate
cd dpplus_venv

ℹ️ NOTE: DP+ requires at least python 3.7. However, Ubuntu 18.04 LTS only comes with python 3.6. To install python 3.7 on Ubuntu 18.04 (or even a higher version, as DP+ works with python 3.7 and above), follow the procedure below:
sudo add-apt-repository ppa:deadsnakes/ppa
# we use 3.7 here, but you can get a higher version by changing the version suffix of the packages below
sudo apt install python3.7 python3.7-venv python3.7-dev
Note that DP+ still works with CKAN<=2.8, which uses older versions of python.

Install the required packages:

sudo apt-get install python-dev python-virtualenv build-essential libxslt1-dev libxml2-dev zlib1g-dev git libffi-dev

Get the code:

git clone https://github.com/datHere/datapusher-plus
cd datapusher-plus

Install the dependencies:

pip install -r requirements-dev.txt
pip install -e .

Install qsv:

Download the appropriate precompiled binaries for your platform and copy it to the appropriate directory, e.g. for Linux:

wget https://github.com/jqnatividad/qsv/releases/download/0.46.1/qsv-0.46.1-x86_64-unknown-linux-gnu.zip
unzip qsv-0.46.1-x86_64-unknown-linux-gnu.zip
sudo mv qsv /usr/local/bin
sudo mv qsvlite /usr/local/bin
sudo mv qsvdp /usr/local/bin

Alternatively, if you want to install qsv from source, follow the instructions here. Note that when compiling from source, you may want to look into the Performance Tuning section to squeeze even more performance from qsv.

ℹ️ NOTE: qsv is a general purpose CSV data-wrangling toolkit that gets regular updates. To update to the latest version, just run sudo qsv/sudo qsvlite and it will check the repo for the latest version and update as required.

Copy datapusher/settings.py to a new file like settings_local.py and modify your configuration as required. Make sure to create the datapusher PostgreSQL user (see DataPusher+ Database Setup).

cp datapusher/settings.py settings_local.py
nano settings_local.py

Run DataPusher+:

python datapusher/main.py settings_local.py

By default, DataPusher+ should be running at the following port:

http://localhost:8800/

Production deployment

These instructions assume you already have CKAN installed on this server in the default location described in the CKAN install documentation (/usr/lib/ckan/default). If this is correct you should be able to run the following commands directly, if not you will need to adapt the previous path to your needs.

These instructions set up the DataPusher web service on uWSGI running on port 8800, but can be easily adapted to other WSGI servers like Gunicorn. You'll probably need to set up Nginx as a reverse proxy in front of it and something like Supervisor to keep the process up.

# Install requirements for DataPusher+
sudo apt install python3-venv python3-dev build-essential libxslt1-dev libxml2-dev libffi-dev

# Create a virtualenv for DataPusher+. DP+ requires python 3.7+.
# If you are on Ubuntu 18.04 LTS and installed python3.7 manually as noted below
sudo python3.7 -m venv /usr/lib/ckan/datapusher-plus
# If you already have Python 3.7+
sudo python3 -m venv /usr/lib/ckan/datapusher-plus

# Install qsvdp binary, if required
wget https://github.com/jqnatividad/qsv/releases/download/0.46.1/qsv-0.46.1-x86_64-unknown-linux-gnu.zip
unzip qsv-0.46.1-x86_64-unknown-linux-gnu.zip
sudo mv qsvdp /usr/local/bin

# Install DataPusher-plus and uwsgi for production
sudo /usr/lib/ckan/datapusher-plus/bin/pip install datapusher-plus uwsgi

# generate a settings file and tune it, as well as a uwsgi ini file
sudo mkdir -p /etc/ckan/datapusher
sudo curl https://raw.githubusercontent.com/dathere/datapusher-plus/master/datapusher/settings.py -o /etc/ckan/datapusher/settings.py
sudo curl https://raw.githubusercontent.com/dathere/datapusher-plus/master/deployment/datapusher-uwsgi.ini -o /etc/ckan/datapusher/uwsgi.ini

# Initialize the database
/usr/lib/ckan/datapusher-plus/bin/datapusher_initdb /etc/ckan/datapusher/settings.py

# Create a user to run the web service (if necessary)
sudo addgroup www-data
sudo adduser -G www-data www-data

At this point you can run DataPusher+ with the following command:

/usr/lib/ckan/datapusher-plus/bin/uwsgi --enable-threads -i /etc/ckan/datapusher/uwsgi.ini

You might need to change the uid and guid settings when using a different user.

To deploy it using supervisor:

sudo curl https://raw.githubusercontent.com/dathere/datapusher-plus/master/deployment/datapusher-uwsgi.conf -o /etc/supervisor/conf.d/datapusher-uwsgi.conf
sudo service supervisor restart

Configuring

CKAN Configuration

Add datapusher to the plugins in your CKAN configuration file (generally located at /etc/ckan/default/production.ini or /etc/ckan/default/ckan.ini):

ckan.plugins = <other plugins> datapusher

In order to tell CKAN where this webservice is located, the following must be added to the [app:main] section of your CKAN configuration file :

ckan.datapusher.url = http://127.0.0.1:8800/

There are other CKAN configuration options that allow to customize the CKAN - DataPusher integration. Please refer to the DataPusher Settings section in the CKAN documentation for more details.

ℹ️ NOTE: DP+ recognizes some additional TSV and spreadsheet subformats - xlsm and xlsb for Excel Spreadsheets, and tab for TSV files. To process these subformats, set ckan.datapusher.formats as follows in your CKAN.INI file:
ckan.datapusher.formats = csv xls xlsx xlsm xlsb tsv tab application/csv application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet ods application/vnd.oasis.opendocument.spreadsheet
and add this entry to your CKAN's resource_formats.json file.
["TAB", "Tab Separated Values File", "text/tab-separated-values", []],

DataPusher+ Configuration

The DataPusher+ instance is configured in the deployment/datapusher_settings.py file. The location of this file can be adjusted using the JOB_CONFIG environment variable which should provide an absolute path to a python-formatted config file.

Here's a summary of the options available.

Name	Default	Description
HOST	'0.0.0.0'	Web server host
PORT	8800	Web server port
SQLALCHEMY_DATABASE_URI	'postgresql://datapusher_jobs: YOURPASSWORD @localhost/datapusher_jobs'	SQLAlchemy Database URL. See note below about setting up the `datapusher_jobs` db beforehand.
MAX_CONTENT_LENGTH	'1024000'	Max size of files to process in bytes
CHUNK_SIZE	'16384'	Chunk size when processing the data file
DOWNLOAD_TIMEOUT	'30'	Download timeout for requesting the file
SSL_VERIFY	False	Do not validate SSL certificates when requesting the data file (Warning: Do not use this setting in production)
TYPES	'String', 'Float', 'Integer', 'DateTime', 'Date', 'NULL'	These are the types that qsv can infer.
TYPE_MAPPING	{'String': 'text', 'Integer': 'numeric', 'Float': 'numeric', 'DateTime': 'timestamp', 'Date': 'timestamp', 'NULL': 'text'}	Internal qsv type mapping to PostgreSQL types
LOG_FILE	`/tmp/ckan_service.log`	Where to write the logs. Use an empty string to disable
STDERR	`True`	Log to stderr?
QSV_BIN	/usr/local/bin/qsvdp	The location of the qsv binary to use. qsvdp is the DP+ optimized version of qsv. It only has the commands used by DP+, has the self-update engine removed, and is 6x smaller than qsv and 3x smaller than qsvlite. You may also want to look into using qsvdp_nightly, for even more performance.
PREVIEW_ROWS	1000	The number of rows to insert to the data store. Set to 0 to insert all rows
QSV_DEDUP	`True`	Automatically deduplicate rows?
DEFAULT_EXCEL_SHEET	0	The zero-based index of the Excel sheet to export to CSV and insert into the Datastore. Negative values are accepted, i.e. -1 is the last sheet, -2 is 2nd to the last, etc.
AUTO_ALIAS	`True`	Automatically create a resource alias - RESOURCE_NAME-PACKAGE_NAME-OWNER_ORG, that's easier to use in API calls and with the scheming datastore_choices helper
WRITE_ENGINE_URL		The Postgres connection string to use to write to the Datastore using Postgres COPY. This should be similar to your `ckan.datastore.write_url`, except you'll need to use the `datapusher` user

All of the configuration options above can be also provided as environment variables prepending the name with DATAPUSHER_, eg DATAPUSHER_SQLALCHEMY_DATABASE_URI, DATAPUSHER_PORT, etc. For variables with boolean values you must use 1 or 0.

DataPusher+ Database Setup

DP+ requires a dedicated PostgreSQL account named datapusher to connect to the CKAN Datastore.

To create the datapusher user and give it the required privileges to the datastore_default database:

su - postgres
psql -d datastore_default
CREATE ROLE datapusher LOGIN PASSWORD 'thepassword';
GRANT CREATE, CONNECT, TEMPORARY ON DATABASE datastore_default TO datapusher;
GRANT SELECT, INSERT, UPDATE, DELETE, TRUNCATE ON ALL TABLES IN SCHEMA public TO datapusher;
\q

DP+ also requires its own job_store database to keep track of all the DP+ jobs. In the original Datapusher, this was a sqlite database by default. Though DP+ can still use a sqlite database, we are discouraging its use.

To setup the datapusher_jobs database and its user:

sudo -u postgres createuser -S -D -R -P datapusher_jobs
sudo -u postgres createdb -O datapusher_jobs datapusher_jobs -E utf-8

Usage

Any file that has one of the supported formats (defined in ckan.datapusher.formats) will be attempted to be loaded into the DataStore.

You can also manually trigger resources to be resubmitted. When editing a resource in CKAN (clicking the "Manage" button on a resource page), a new tab named "DataStore" will appear. This will contain a log of the last attempted upload and a button to retry the upload.

DataPusher+ UI DataPusher+ UI 2

Command line

Run the following command to submit all resources to datapusher, although it will skip files whose hash of the data file has not changed:

ckan -c /etc/ckan/default/ckan.ini datapusher resubmit

On CKAN<=2.8:

paster --plugin=ckan datapusher resubmit -c /etc/ckan/default/ckan.ini

To Resubmit a specific resource, whether or not the hash of the data file has changed::

ckan -c /etc/ckan/default/ckan.ini datapusher submit {dataset_id}

On CKAN<=2.8:

paster --plugin=ckan datapusher submit <pkgname> -c /etc/ckan/default/ckan.ini

License

It is open and licensed under the GNU Affero General Public License (AGPL) v3.0 whose full text may be found at:

http://www.fsf.org/licensing/licenses/agpl-3.0.html

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.0a0 pre-release

May 6, 2024

0.16.4

Jan 23, 2024

0.16.3

Jan 23, 2024

0.16.2

Jan 23, 2024

0.16.1

Jan 15, 2024

0.15.0

Jun 26, 2023

0.14.1

Jun 26, 2023

0.14.0

Jun 26, 2023

0.13.2

Jun 22, 2023

0.13.1

Jun 22, 2023

0.13.0

Jun 16, 2023

0.11.0

Apr 10, 2023

0.10.2

Feb 4, 2023

0.10.1

Feb 3, 2023

0.10.0

Feb 4, 2023

0.9.0

Jan 30, 2023

0.8.0

Jan 18, 2023

0.7.0

Jan 17, 2023

0.6.0

Jan 6, 2023

0.5.1

Jan 5, 2023

0.5.0

Jan 4, 2023

0.4.0

Dec 13, 2022

0.3.1

Dec 9, 2022

0.3.0

Dec 9, 2022

0.2.0

Dec 7, 2022

0.1.0

Sep 9, 2022

This version

0.0.23

May 9, 2022

0.0.22

May 9, 2022

0.0.21

May 5, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datapusher-plus-0.0.23.tar.gz (37.1 kB view hashes)

Uploaded May 9, 2022 Source

Built Distribution

datapusher_plus-0.0.23-py3-none-any.whl (31.1 kB view hashes)

Uploaded May 9, 2022 Python 3

Hashes for datapusher-plus-0.0.23.tar.gz

Hashes for datapusher-plus-0.0.23.tar.gz
Algorithm	Hash digest
SHA256	`6f85448d026be4ac6cb9a15de810d91c6dd2b4703d5eabe4aceb2bff61cc619e`
MD5	`34f8f36cb2cf7c5bd7a50a3bb1abcea9`
BLAKE2b-256	`9d6c07bae67cf1eab7135abb9b09bcbff1b2718d6fbe6c2bfb50ab269c84e6c6`

Hashes for datapusher_plus-0.0.23-py3-none-any.whl

Hashes for datapusher_plus-0.0.23-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a3ca97745ef7e852c4514292ce286a5a430415685bff7e90c2c7e6209884d322`
MD5	`6dd0cdc05e2602d2d628e612c64c6aa8`
BLAKE2b-256	`d3cd22dc48c4e6e4d6246610ae8e6c35b18fbe271293d8ed7ad842b3a7cf2d85`