Skip to main content

Unification of data connectors for distributed data tasks

Project description

Tentaclio

CircleCI status Documentation Status

Python library that simplifies:

  • Handling streams from different protocols such as file:, ftp:, sftp:, s3:, ...
  • Opening database connections.
  • Managing the credentials in distributed systems.

Main considerations in the design:

  • Easy to use: all streams are open via tentaclio.open, all database connections through tentaclio.db.
  • URLs are the basic resource locator and db connection string.
  • Automagic authentication for protected resources.
  • Extensible: you can add your own handlers for other schemes.
  • Pandas interaction.

Quick Examples.

Read and write streams.

import tentaclio
contents = "👋 🐙"

with tentaclio.open("ftp://localhost:2021/upload/file.txt", mode="w") as writer:
    writer.write(contents)

# Using boto3 authentication under the hood.
bucket = "s3://my-bucket/octopus/hello.txt"
with tentaclio.open(bucket) as reader:
    print(reader.read())

Copy streams

import tentaclio

tentaclio.copy("/home/constantine/data.csv", "sftp://constantine:tentacl3@sftp.octoenergy.com/uploads/data.csv")

Delete resources

import tentaclio

tentaclio.remove("s3://my-bucket/octopus/the-9th-tentacle.txt")

List resources

import tentaclio

for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"):
    print("Entry", entry)

Authenticated resources.

import os

import tentaclio

print("env ftp credentials", os.getenv("OCTOIO__CONN__OCTOENERGY_FTP"))
# This prints `sftp://constantine:tentacl3@sftp.octoenergy.com/`

# Credentials get automatically injected.

with tentaclio.open("sftp://sftp.octoenergy.com/uploads/data.csv") as reader:
    print(reader.read())

Database connections.

import os

import tentaclio

print("env TENTACLIO__CONN__DB", os.getenv("TENTACLIO__CONN__DB"))

# This prints `postgresql://octopus:tentacle@localhost:5444/example`

# hostname is a wildcard, the credentials get injected.
with tentaclio.db("postgresql://hostname/example") as pg:
    results = pg.query("select * from my_table")

Pandas interaction.

import pandas as pd  # 🐼🐼
import tentaclio  # 🐙

df = pd.DataFrame([[1, 2, 3], [10, 20, 30]], columns=["col_1", "col_2", "col_3"])

bucket = "s3://my-bucket/data/pandas.csv"

with tentaclio.open(bucket, mode="w") as writer:  # supports more pandas readers
    df.to_csv(writer, index=False)

with tentaclio.open(bucket) as reader:
    new_df = pd.read_csv(reader)

# another example: using pandas.DataFrame.to_sql() with tentaclio to upload
with tentaclio.db(
        connection_info,
        connect_args={'options': '-csearch_path=schema_name'}
    ) as client:
    df.to_sql(
        name='observations', # table name
        con=client.conn,
    )

Installation

You can get tentaclio using pip

pip install tentaclio

or pipenv

pipenv install tentaclio

Developing.

Clone this repo and install pipenv:

In the Makefile you'll find some useful targets for linting, testing, etc. i.e.:

make test

How to use

This is how to use tentaclio for your daily data ingestion and storing needs.

Streams

In order to open streams to load or store data the universal function is:

import tentaclio

with tentaclio.open("/path/to/my/file") as reader:
    contents = reader.read()

with tentaclio.open("s3://bucket/file", mode='w') as writer:
    writer.write(contents)

Allowed modes are r, w, rb, and wb. You can use t instead of b to indicate text streams, but that's the default.

In order to keep tentaclio as light as possible, it only includes file, ftp, sftp, http and https schemes by default. However, many more are easily available by installing extra packages:

Default:

  • /local/file
  • file:///local/file
  • ftp://path/to/file
  • sftp://path/to/file
  • http://host.com/path/to/resource
  • https://host.com/path/to/resource

tentaclio-s3

  • s3://bucket/file

tentaclio-gs

  • gs://bucket/file
  • gsc://bucket/file

tentaclio-gdrive

  • gdrive:/My Drive/file
  • googledrive:/My Drive/file

tentaclio-postgres

  • postgresql://host/database::table will allow you to write from a csv format into a database with the same column names (note that the table goes after :: :warning:).

You can add the credentials for any of the urls in order to access protected resources.

You can use these readers and writers with pandas functions like:

import pandas as pd
import tentaclio

with tentaclio.open("/path/to/my/file") as reader:
    df = pd.read_csv(reader)

[...]

with tentaclio.open("s3::/path/to/my/file", mode='w') as writer:
    df.to_parquet(writer)

Readers, Writers and their closeable versions can be used anywhere expecting a file-like object; pandas or pickle are examples of such functions.

Notes on writing files for Spark, Presto, and similar downstream systems

The default behaviour for the open context manager in python is to create an empty file when opening it in writable mode. This can be annoying if the process that creates the data within the with clause yields empty dataframes and nothing gets written. This will make Spark and Presto panic.

To avoid this we can make the stream empty safe so the empty buffer won't be flushed if no writes have been performed so no empty file will be created.

with tio.make_empty_safe(tio.open("s3://bucket/file.parquet", mode="wb")) as writer:
    if not df.empty:
        df.to_parquet(writer)

File system like operations to resources

Listing resources

Some URL schemes allow listing resources in a pythonnic way:

import tentaclio

for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"):
    print("Entry", entry)

Whereas listdir might be convinient we also offer scandir, which returns a list of DirEntrys, and, walk. All functions follow as closely as possible their standard library definitions.

Database access

In order to open db connections you can use tentaclio.db and have instant access to postgres, sqlite, athena and mssql.

import tentaclio

[...]

query = "select 1";
with tentaclio.db(POSTGRES_TEST_URL) as client:
    result =client.query(query)
[...]

The supported db schemes are:

Default:

  • sqlite://
  • mssql://
    • Any other scheme supported by sqlalchemy.

tentaclio-postgres

  • postgresql://

tentaclio-athena

  • awsathena+rest://

tentaclio-databricks

  • databricks+thrift://

tentaclio-snowflake

  • snowflake://

Extras for databases

For postgres you can set the variable TENTACLIO__PG_APPLICATION_NAME and the value will be injected when connecting to the database.

Automatic credentials injection

  1. Configure credentials by using environmental variables prefixed with TENTACLIO__CONN__ (i.e. TENTACLIO__CONN__DATA_FTP=sfpt://real_user:132ldsf@ftp.octoenergy.com).

  2. Open a stream:

with tentaclio.open("sftp://ftp.octoenergy.com/file.csv") as reader:
    reader.read()

The credentials get injected into the url.

  1. Open a db client:
import tentaclio

with tentaclio.db("postgresql://hostname/my_data_base") as client:
    client.query("select 1")

Note that hostname in the url to be authenticated is a wildcard that will match any hostname. So authenticate("http://hostname/file.txt") will be injected to http://user:pass@octo.co/file.txt if the credential for http://user:pass@octo.co/ exists.

Different components of the URL are set differently:

  • Scheme and path will be set from the URL, and null if missing.
  • Username, password and hostname will be set from the stored credentials.
  • Port will be set from the stored credentials if it exists, otherwise from the URL.
  • Query will be set from the URL if it exists, otherwise from the stored credentials (so it can be overriden)

Credentials file

You can also set a credentials file that looks like:

secrets:
    db_1: postgresql://user1:pass1@myhost.com/database_1
    db_2: mssql://user2:pass2@otherhost.com/database_2?driver=ODBC+Driver+17+for+SQL+Server
    ftp_server: ftp://fuser:fpass@ftp.myhost.com

And make it accessible to tentaclio by setting the environmental variable TENTACLIO__SECRETS_FILE. The actual name of each url is for traceability and has no effect in the functionality.

(Note that you may need to add ?driver={driver from /usr/local/etc/odbcinst.ini} for mssql database connection strings; see above example)

Alternatively you can run curl https://raw.githubusercontent.com/octoenergy/tentaclio/master/extras/init_tentaclio.sh to create a secrets file in ~/.tentaclio.yml and automatically configure your environment.

Environment variables can be included in the credentials file by using ${ENV_VARIABLE} as it follows:

secrets:
    db: postgresql://${DB_USER}:${DB_PASS}@myhost.com/database

Tentaclio will search DB_USER and DB_PASS in the environment and will interpolate their values with the secrets file content.

Quick note on protocols structural subtyping.

In order to abstract concrete dependencies from the implementation of data related functions (or in any part of the system really) we use typed protocols. This allows a more flexible dependency injection than using subclassing or more complex approches. This idea is heavily inspired by how this exact thing is done in go. Learn more about this principle in our tech blog.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tentaclio-1.3.0.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

tentaclio-1.3.0-py2.py3-none-any.whl (39.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file tentaclio-1.3.0.tar.gz.

File metadata

  • Download URL: tentaclio-1.3.0.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.9

File hashes

Hashes for tentaclio-1.3.0.tar.gz
Algorithm Hash digest
SHA256 acaf3b22e6fe8b51ef35a926776d69bb3f8f7bb2fa5a405102b8fc7ef048fe0c
MD5 b7b6f91c9d21997e39ceaaa87efd889a
BLAKE2b-256 99b62abcdcac6cc7fac5c837449219e64cccffad6e7dd519c70d013f2aec6dae

See more details on using hashes here.

File details

Details for the file tentaclio-1.3.0-py2.py3-none-any.whl.

File metadata

  • Download URL: tentaclio-1.3.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.9

File hashes

Hashes for tentaclio-1.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6d298d9e37b245fceccc13e7182ae66a4925ab973918b6266ebcb7eb37726ca5
MD5 1cebc2983f915fa74496da41643a0ffe
BLAKE2b-256 bdb61de81239370b19b04ce515a26e0b03c2ed3c59dd74675120356531106bce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page