Affordable Databricks Workflows in Apache Airflow
Project description
Astro Databricks
Orchestrate your Databricks notebooks in Airflow and execute them as Databricks Workflows
The Astro Databricks Provider is an Apache Airflow provider created by Astronomer to run your Databricks notebooks as Databricks Workflows while maintaining Airflow as the authoring interface. When using the DatabricksTaskGroup
and DatabricksNotebookOperator
, notebooks run as a Databricks Workflow which can result in a 75% cost reduction ($0.40/DBU for all-purpose compute, $0.10/DBU for Jobs compute).
Prerequisites
- Apache Airflow >= 2.2.4
- Python >= 2.7
- Databricks account
- Previously created Databricks Notebooks
Install
pip install astro-provider-databricks
Quickstart
-
Use pre-existing or create two simple Databricks Notebooks. Their identifiers will be used in step (5). The original example DAG uses:
Shared/Notebook_1
Shared/Notebook_2
-
Generate a Databricks Personal Token. This will be used in step (6).
-
Ensure that your Airflow environment is set up correctly by running the following commands:
export AIRFLOW_HOME=`pwd` airflow db init
-
Create a Databricks connection in Airflow. This can be done by running the following command, replacing the login and password (with your access token):
# If using Airflow 2.3 or higher: airflow connections add 'databricks_conn' \ --conn-json '{ "conn_type": "databricks", "login": "some.email@yourcompany.com", "host": "https://dbc-c9390870-65ef.cloud.databricks.com/", "password": "personal-access-token" }' # If using Airflow between 2.2.4 and less than 2.3: airflow connections add 'databricks_conn' --conn-type 'databricks' --conn-login 'some.email@yourcompany.com' --conn-host 'https://dbc-9c390870-65ef.cloud.databricks.com/' --conn-password 'personal-access-token'
-
Copy the following workflow into a file named
example_databricks_workflow.py
and add it to thedags
directory of your Airflow project:Alternatively, you can download
example_databricks_workflow.py
curl -O https://raw.githubusercontent.com/astronomer/astro-provider-databricks/main/example_dags/example_databricks_workflow.py
-
Run the example DAG:
airflow dags test example_databricks_workflow `date -Iseconds`
Which will log, among other lines, the link to the Databricks Job Run URL:
[2023-03-13 15:27:09,934] {notebook.py:158} INFO - Check the job run in Databricks: https://dbc-c9390870-65ef.cloud.databricks.com/?o=4256138892007661#job/950578808520081/run/14940832
This will create a Databricks Workflow with two Notebook jobs. This workflow may take two minutes to complete if the cluster is already up & running or approximately five minutes depending on your cluster initialisation time.
Available features
DatabricksWorkflowTaskGroup
: Airflow task group that allows users to create a Databricks Workflow.DatabricksNotebookOperator
: Airflow operator which abstracts a pre-existing Databricks Notebook. Can be used independently to run the Notebook, or within a Databricks Workflow Task Group.AstroDatabricksPlugin
: An Airflow plugin which is installed by the default. It allows users, by using the UI, to view a Databricks job and retry running it in case of failure.
Documentation
The documentation is a work in progress--we aim to follow the Diátaxis system:
Changelog
Astro Databricks follows semantic versioning for releases. Read changelog to understand more about the changes introduced to each version.
Contribution guidelines
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
Read the Contribution Guidelines for a detailed overview on how to contribute.
Contributors and maintainers should abide by the Contributor Code of Conduct.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file astro_provider_databricks-0.1.1.tar.gz
.
File metadata
- Download URL: astro_provider_databricks-0.1.1.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.23.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1948dbad60e207fb2e1499303be9b896e1abaa9cb1811e7cdbf3599b85fb931c |
|
MD5 | 3faf938bbb47bfd3b6f42b435c75d767 |
|
BLAKE2b-256 | 76b102c8c38d4269b3d4aaa059578f06d6e5eb324dc4bc5cd1bf8cee68f72d4d |
Provenance
File details
Details for the file astro_provider_databricks-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: astro_provider_databricks-0.1.1-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.23.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4612575e367b6cd60a460a3150e4d982e51d8b93011f8efa2c2aa26bc4caa8ad |
|
MD5 | 65c644a44f1d2759319e09baeaccc4a0 |
|
BLAKE2b-256 | 93b84be8054533af948c75bf30b7ca8c5d6f32bcf0ed13a69c3b15fa68d6f602 |