Skip to main content

This project contains the Universal Transfer Operator which can transfer all the data that could be read from the source Dataset into the destination Dataset. From a DAG author standpoint, all transfers would be performed through the invocation of only the Universal Transfer Operator.

Project description

Universal Transfer Operator

transfers made easy

CI

The Universal Transfer Operator simplifies how users transfer data from a source to a destination using Apache Airflow. It offers a consistent agnostic interface, improving the users' experience so they do not need to use explicitly specific providers or operators.

At the moment, it supports transferring data between file locations and databases (in both directions) and cross-database transfers.

This project is maintained by Astronomer.

Installation

pip install apache-airflow-provider-transfers

Example DAGs

Checkout the example_dags folder for examples of how the UniversalTransfeOperator can be used.

How Universal Transfer Operator Works

Approach

With Universal Transfer Operator, users can perform data transfers using the following transfer modes:

  1. Non-native
  2. Native
  3. Third-party

Non-native transfer

Non-native transfers rely on transferring the data through the Airflow worker node. Chunking is applied where possible. This method can be suitable for datasets smaller than 2GB, depending on the source and target. The performance of this method is highly dependent upon the worker's memory, disk, processor and network configuration.

Internally, the steps involved are:

  • Retrieve the dataset data in chunks from dataset storage to the worker node.
  • Send data to the cloud dataset from the worker node.

Following is an example of non-native transfers between Google cloud storage and Sqlite:

https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/example_dags/example_universal_transfer_operator.py#L37-L41

Improving bottlenecks by using native transfer

An alternative to using the Non-native transfer method is the native method. The native transfers rely on mechanisms and tools offered by the data source or data target providers. In the case of moving from object storage to a Snowflake database, for instance, a native transfer consists in using the built-in COPY INTO command. When loading data from S3 to BigQuery, the Universal Transfer Operator uses the GCP Storage Transfer Service.

The benefit of native transfers is that they will likely perform better for larger datasets (2 GB) and do not rely on the Airflow worker node hardware configuration. With this approach, the Airflow worker nodes are used as orchestrators and do not perform the transfer. The speed depends exclusively on the service being used and the bandwidth between the source and destination.

Steps:

  • Request destination dataset to ingest data from the source dataset.
  • Destination dataset requests source dataset for data.

NOTE: The Native method implementation is in progress and will be available in future releases.

Transfer using a third-party tool

The Universal Transfer Operator can also offer an interface to generic third-party services that transfer data, similar to Fivetran.

Here is an example of how to use Fivetran for transfers:

https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/example_dags/example_dag_fivetran.py#L52-L58

Supported technologies

Documentation

The documentation is a work in progress -- we aim to follow the Diátaxis system.

Changelog

The Universal Transfer Operator follows semantic versioning for releases. Check the changelog for the latest changes.

Release management

See Managing Releases to learn more about our release philosophy and steps.

Contribution guidelines

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Read the Contribution Guideline for a detailed overview of how to contribute.

Contributors and maintainers should abide by the Contributor Code of Conduct.

License

Apache Licence 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apache-airflow-provider-transfers-0.1.0.tar.gz (40.7 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file apache-airflow-provider-transfers-0.1.0.tar.gz.

File metadata

File hashes

Hashes for apache-airflow-provider-transfers-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7b949719d371ff06996fd450073ee62c17687601a063cbfd955f595c127c662f
MD5 8a92ac4e3bdbf1d4e4ddf77e303761af
BLAKE2b-256 e575a351df3dec0b824113fd0fd201e16b1bacd552ad45a4c220957d86071e60

See more details on using hashes here.

File details

Details for the file apache_airflow_provider_transfers-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for apache_airflow_provider_transfers-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 97eee03c6fbadffe68c24956dee260311c5608affbb48c96c16475263dba3deb
MD5 c3623e1ae8739e3ac6b175f1484cf08b
BLAKE2b-256 1e257d869d65840fd3e5edb31b3fa97252db01f00b4c452d841266c8ba372c1b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page