Skip to main content

Kedro-Airflow makes it easy to deploy Kedro projects to Airflow

Project description

Kedro-Airflow

develop master
CircleCI CircleCI
Build status Build status

License Python Version PyPI Version Code Style: Black

Apache Airflow is a tool for orchestrating complex workflows and data processing pipelines. The Kedro-Airflow plugin can be used for:

  • Rapid pipeline creation in the prototyping phase. You can write Python functions in Kedro without worrying about schedulers, daemons, services or having to recreate the Airflow DAG file.
  • Automatic dependency resolution in Kedro. This allows you to bypass Airflow's need to specify the order of your tasks.
  • Distributing Kedro tasks across many workers. You can also enable monitoring and scheduling of the tasks' runtimes.

How do I install Kedro-Airflow?

kedro-airflow is a Python plugin. To install it:

pip install kedro-airflow

How do I use Kedro-Airflow?

The Kedro-Airflow plugin adds a kedro airflow create CLI command that generates an Airflow DAG file in the airflow_dags folder of your project. At runtime, this file translates your Kedro pipeline into Airflow Python operators. This DAG object can be modified according to your needs and you can then deploy your project to Airflow by running kedro airflow deploy.

Prerequisites

The following conditions must be true for Airflow to run your pipeline:

  • Your project directory must be available to the Airflow runners in the directory listed at the top of the DAG file.
  • Your source code must be on the Python path (by default the DAG file takes care of this).
  • All datasets must be explicitly listed in catalog.yml and reachable for the Airflow workers. Kedro-Airflow does not support MemoryDataSet or datasets that require Spark.
  • All local paths in configuration files (notably in catalog.yml and logging.yml) should be absolute paths and not relative paths.

Process

  1. Run kedro airflow create to generate a DAG file for your project.
  2. If needed, customize the DAG file as described below.
  3. Run kedro airflow deploy which will copy the DAG file from the airflow_dags folder in your Kedro project into the dags folder in the Airflow home directory.

Note: The generated DAG file will be placed in $AIRFLOW_HOME/dags/ when kedro airflow deploy is run, where AIRFLOW_HOME is an environment variable. If the environment variable is not defined, Kedro-Airflow will create ~/airflow and ~/airflow/dags (if required) and copy the DAG file into it.

Customization

There are a number of items in the DAG file that you may want to customize including:

  • Source location,
  • Project location,
  • DAG construction,
  • Default operator arguments,
  • Operator-specific arguments,
  • And / or Airflow context and execution date.

The following sections guide you to the appropriate location within the file.

Source location

The line sys.path.append("/Users/<user-name>/new-kedro-project/src") enables Python and Airflow to find your project source.

Project location

The line project_path = "/Users/<user-name>/new-kedro-project" sets the location for your project directory. This is passed to your get_config method.

DAG construction

The construction of the actual DAG object can be altered as needed. You can learn more about how to do this by going through the Airflow tutorial.

Default operator arguments

The default arguments for the Airflow operators are contained in the default_args dictionary.

Operator-specific arguments

The operator_specific_arguments callback is called to retrieve any additional arguments specific to individual operators. It is passed the Airflow task_id and should return a dictionary of additional arguments. For example, to change the number of retries on node named analysis to 5 you may have:

def operator_specific_arguments(task_id):
    if task_id == "analysis":
        return {"retries": 5}
    return {}

The easiest way to find the correct task_id is to use Airflow's list_tasks command.

Airflow context and execution date

The process_context callback provides a hook for ingesting Airflow's Jinja context. It is called before every node, receives the context and catalog and must return a catalog. A common use of this is to pick up the execution date and either insert it into the catalog or modify the catalog based on it.

The list of default context variables is available in the Airflow documentation.

What licence do you use?

Kedro-Airflow is licensed under the Apache 2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro-airflow-0.1.0.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

kedro_airflow-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file kedro-airflow-0.1.0.tar.gz.

File metadata

  • Download URL: kedro-airflow-0.1.0.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.9

File hashes

Hashes for kedro-airflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1e22dcac5426210bf8b0fb637fd749f81a33de00d01e3c478c692fac0e1c9ed2
MD5 0a37045f9bd46506f7293e50b3ee9e4f
BLAKE2b-256 0c69e3dbea513d81e5a156b6276c58d7eb9cd5e0467a481d0d8deff1348824b7

See more details on using hashes here.

File details

Details for the file kedro_airflow-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kedro_airflow-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.9

File hashes

Hashes for kedro_airflow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 475201a4555e2fdd60f2efc22ef3951d5184f1489b844b35cb690a5d695dad7e
MD5 10e4bc8494226960b3fe7b19b522c32e
BLAKE2b-256 3678a9383daca26a2e00ad24e44f4eeee63c9a9bbd8201fdf90f2cbda71701a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page