Skip to main content

Databricks SQLAlchemy plugin for Python

Project description

Databricks dialect for SQLALchemy 2.0

The Databricks dialect for SQLAlchemy serves as bridge between SQLAlchemy and the Databricks SQL Python driver. A working example demonstrating usage can be found in examples/sqlalchemy.py.

Usage with SQLAlchemy <= 2.0

A SQLAlchemy 1.4 compatible dialect was first released in connector version 2.4. Support for SQLAlchemy 1.4 was dropped from the dialect as part of databricks-sql-connector==3.0.0. To continue using the dialect with SQLAlchemy 1.x, you can use databricks-sql-connector^2.4.0.

Installation

To install the dialect and its dependencies:

pip install databricks-sqlalchemy

If you also plan to use alembic you can alternatively run:

pip install alembic

Connection String

Every SQLAlchemy application that connects to a database needs to use an Engine, which you can create by passing a connection string to create_engine. The connection string must include these components:

  1. Host
  2. HTTP Path for a compute resource
  3. API access token
  4. Initial catalog for the connection
  5. Initial schema for the connection

Note: Our dialect is built and tested on workspaces with Unity Catalog enabled. Support for the hive_metastore catalog is untested.

For example:

import os
from sqlalchemy import create_engine

host = os.getenv("DATABRICKS_SERVER_HOSTNAME")
http_path = os.getenv("DATABRICKS_HTTP_PATH")
access_token = os.getenv("DATABRICKS_TOKEN")
catalog = os.getenv("DATABRICKS_CATALOG")
schema = os.getenv("DATABRICKS_SCHEMA")

engine = create_engine(
    f"databricks://token:{access_token}@{host}?http_path={http_path}&catalog={catalog}&schema={schema}"
    )

Types

The SQLAlchemy type hierarchy contains backend-agnostic type implementations (represented in CamelCase) and backend-specific types (represented in UPPERCASE). The majority of SQLAlchemy's CamelCase types are supported. This means that a SQLAlchemy application using these types should "just work" with Databricks.

SQLAlchemy Type Databricks SQL Type
BigInteger BIGINT
LargeBinary (not supported)
Boolean BOOLEAN
Date DATE
DateTime TIMESTAMP_NTZ
Double DOUBLE
Enum (not supported)
Float FLOAT
Integer INT
Numeric DECIMAL
PickleType (not supported)
SmallInteger SMALLINT
String STRING
Text STRING
Time STRING
Unicode STRING
UnicodeText STRING
Uuid STRING

In addition, the dialect exposes three UPPERCASE SQLAlchemy types which are specific to Databricks:

LargeBinary() and PickleType()

Databricks Runtime doesn't currently support binding of binary values in SQL queries, which is a pre-requisite for this functionality in SQLAlchemy.

Enum() and CHECK constraints

Support for CHECK constraints is not implemented in this dialect. Support is planned for a future release.

SQLAlchemy's Enum() type depends on CHECK constraints and is therefore not yet supported.

DateTime(), TIMESTAMP_NTZ(), and TIMESTAMP()

Databricks Runtime provides two datetime-like types: TIMESTAMP which is always timezone-aware and TIMESTAMP_NTZ which is timezone agnostic. Both types can be imported from databricks.sqlalchemy and used in your models.

The SQLAlchemy documentation indicates that DateTime() is not timezone-aware by default. So our dialect maps this type to TIMESTAMP_NTZ(). In practice, you should never need to use TIMESTAMP_NTZ() directly. Just use DateTime().

If you need your field to be timezone-aware, you can import TIMESTAMP() and use it instead.

Note that SQLAlchemy documentation suggests that you can declare a DateTime() with timezone=True on supported backends. However, if you do this with the Databricks dialect, the timezone argument will be ignored.

from sqlalchemy import DateTime
from databricks.sqlalchemy import TIMESTAMP

class SomeModel(Base):
    some_date_without_timezone  = DateTime()
    some_date_with_timezone     = TIMESTAMP()

String(), Text(), Unicode(), and UnicodeText()

Databricks Runtime doesn't support length limitations for STRING fields. Therefore String() or String(1) or String(255) will all produce identical DDL. Since Text(), Unicode(), UnicodeText() all use the same underlying type in Databricks SQL, they will generate equivalent DDL.

Time()

Databricks Runtime doesn't have a native time-like data type. To implement this type in SQLAlchemy, our dialect stores SQLAlchemy Time() values in a STRING field. Unlike DateTime above, this type can optionally support timezone awareness (since the dialect is in complete control of the strings that we write to the Delta table).

from sqlalchemy import Time

class SomeModel(Base):
    time_tz     = Time(timezone=True)
    time_ntz    = Time()

Usage Notes

Identity() and autoincrement

Identity and generated value support is currently limited in this dialect.

When defining models, SQLAlchemy types can accept an autoincrement argument. In our dialect, this argument is currently ignored. To create an auto-incrementing field in your model you can pass in an explicit Identity() instead.

Furthermore, in Databricks Runtime, only BIGINT fields can be configured to auto-increment. So in SQLAlchemy, you must use the BigInteger() type.

from sqlalchemy import Identity, String

class SomeModel(Base):
    id      = BigInteger(Identity())
    value   = String()

When calling Base.metadata.create_all(), the executed DDL will include GENERATED ALWAYS AS IDENTITY for the id column. This is useful when using SQLAlchemy to generate tables. However, as of this writing, Identity() constructs are not captured when SQLAlchemy reflects a table's metadata (support for this is planned).

Parameters

databricks-sql-connector supports two approaches to parameterizing SQL queries: native and inline. Our SQLAlchemy 2.0 dialect always uses the native approach and is therefore limited to DBR 14.2 and above. If you are writing parameterized queries to be executed by SQLAlchemy, you must use the "named" paramstyle (:param). Read more about parameterization in docs/parameters.md.

Usage with pandas

Use pandas.DataFrame.to_sql and pandas.read_sql to write and read from Databricks SQL. These methods both accept a SQLAlchemy connection to interact with Databricks.

Read from Databricks SQL into pandas

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("databricks://token:dapi***@***.cloud.databricks.com?http_path=***&catalog=main&schema=test")
with engine.connect() as conn:
    # This will read the contents of `main.test.some_table`
    df = pd.read_sql("some_table", conn)

Write to Databricks SQL from pandas

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("databricks://token:dapi***@***.cloud.databricks.com?http_path=***&catalog=main&schema=test")
squares = [(i, i * i) for i in range(100)]
df = pd.DataFrame(data=squares,columns=['x','x_squared'])

with engine.connect() as conn:
    # This will write the contents of `df` to `main.test.squares`
    df.to_sql('squares',conn)

PrimaryKey() and ForeignKey()

Unity Catalog workspaces in Databricks support PRIMARY KEY and FOREIGN KEY constraints. Note that Databricks Runtime does not enforce the integrity of FOREIGN KEY constraints. You can establish a primary key by setting primary_key=True when defining a column.

When building ForeignKey or ForeignKeyConstraint objects, you must specify a name for the constraint.

If your model definition requires a self-referential FOREIGN KEY constraint, you must include use_alter=True when defining the relationship.

from sqlalchemy import Table, Column, ForeignKey, BigInteger, String

users = Table(
    "users",
    metadata_obj,
    Column("id", BigInteger, primary_key=True),
    Column("name", String(), nullable=False),
    Column("email", String()),
    Column("manager_id", ForeignKey("users.id", name="fk_users_manager_id_x_users_id", use_alter=True))
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databricks_sqlalchemy-2.0.1.tar.gz (109.8 kB view details)

Uploaded Source

Built Distribution

databricks_sqlalchemy-2.0.1-py3-none-any.whl (114.9 kB view details)

Uploaded Python 3

File details

Details for the file databricks_sqlalchemy-2.0.1.tar.gz.

File metadata

  • Download URL: databricks_sqlalchemy-2.0.1.tar.gz
  • Upload date:
  • Size: 109.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.6

File hashes

Hashes for databricks_sqlalchemy-2.0.1.tar.gz
Algorithm Hash digest
SHA256 ce18879b4d84bd46ee3fdc864f097bdd573acc7310156d68049b0e17cfe9a6f9
MD5 f0c6acb7c04ba45d6f44d55d51ac148f
BLAKE2b-256 a7835478b486dfcb3a8347b4e1a2a33bd983d0ec6990f9efa6f2f48965be252f

See more details on using hashes here.

File details

Details for the file databricks_sqlalchemy-2.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for databricks_sqlalchemy-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b8e5aa7ef9add762a8ead039fe94e0f3a6e073ae4e644c88ebf29c97ec160998
MD5 b5420abe9af1db5bda355269eb834553
BLAKE2b-256 f9008288e810cf0d0879281e0e42b11a383c36e29c47f9c8744aec7b1ae66bf7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page