Unification of data connectors for distributed data tasks
Project description
Tentaclio
Python library that simplifies:
- Handling streams from different protocols such as
file:
,ftp:
,sftp:
,s3:
, ... - Opening database connections.
- Managing the credentials in distributed systems.
Main considerations in the design:
- Easy to use: all streams are open via
tentaclio.open
, all database connections throughtentaclio.db
. - URLs are the basic resource locator and db connection string.
- Automagic authentication for protected resources.
- Extensible: you can add your own handlers for other schemes.
- Pandas interaction.
Quick Examples.
Read and write streams.
import tentaclio
contents = "👋 🐙"
with tentaclio.open("ftp://localhost:2021/upload/file.txt", mode="w") as writer:
writer.write(contents)
# Using boto3 authentication under the hood.
bucket = "s3://my-bucket/octopus/hello.txt"
with tentaclio.open(bucket) as reader:
print(reader.read())
Copy streams
import tentaclio
tentaclio.copy("/home/constantine/data.csv", "sftp://constantine:tentacl3@sftp.octoenergy.com/uploads/data.csv")
Delete resources
import tentaclio
tentaclio.remove("s3://my-bucket/octopus/the-9th-tentacle.txt")
List resources
import tentaclio
for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"):
print("Entry", entry)
Authenticated resources.
import os
import tentaclio
print("env ftp credentials", os.getenv("OCTOIO__CONN__OCTOENERGY_FTP"))
# This prints `sftp://constantine:tentacl3@sftp.octoenergy.com/`
# Credentials get automatically injected.
with tentaclio.open("sftp://sftp.octoenergy.com/uploads/data.csv") as reader:
print(reader.read())
Database connections.
import os
import tentaclio
print("env TENTACLIO__CONN__DB", os.getenv("TENTACLIO__CONN__DB"))
# This prints `postgresql://octopus:tentacle@localhost:5444/example`
# hostname is a wildcard, the credentials get injected.
with tentaclio.db("postgresql://hostname/example") as pg:
results = pg.query("select * from my_table")
Pandas interaction.
import pandas as pd # 🐼🐼
import tentaclio # 🐙
df = pd.DataFrame([[1, 2, 3], [10, 20, 30]], columns=["col_1", "col_2", "col_3"])
bucket = "s3://my-bucket/data/pandas.csv"
with tentaclio.open(bucket, mode="w") as writer: # supports more pandas readers
df.to_csv(writer, index=False)
with tentaclio.open(bucket) as reader:
new_df = pd.read_csv(reader)
Installation
You can get tentaclio using pip
pip install tentaclio
or pipenv
pipenv install tentaclio
Developing.
Clone this repo and install pipenv:
In the Makefile
you'll find some useful targets for linting, testing, etc. i.e.:
make test
How to use
This is how to use tentaclio
for your daily data ingestion and storing needs.
Streams
In order to open streams to load or store data the universal function is:
import tentaclio
with tentaclio.open("/path/to/my/file") as reader:
contents = reader.read()
with tentaclio.open("s3://bucket/file", mode='w') as writer:
writer.write(contents)
Allowed modes are r
, w
, rb
, and wb
. You can use t
instead of b
to indicate text streams, but that's the default.
The supported url protocols are:
/local/file
file:///local/file
s3://bucket/file
gs://bucket/file
gsc://bucket/file
gdrive:/My Drive/file
googledrive:/My Drive/file
ftp://path/to/file
sftp://path/to/file
http://host.com/path/to/resource
https://host.com/path/to/resource
postgresql://host/database::table
will allow you to write from a csv format into a database with the same column names (note that the table goes after::
:warning:).
You can add the credentials for any of the urls in order to access protected resources.
You can use these readers and writers with pandas functions like:
import pandas as pd
import tentaclio
with tentaclio.open("/path/to/my/file") as reader:
df = pd.read_csv(reader)
[...]
with tentaclio.open("s3::/path/to/my/file", mode='w') as writer:
df.to_parquet(writer)
Readers
, Writers
and their closeable versions can be used anywhere expecting a file-like object; pandas or pickle are examples of such functions.
File system like operations to resources
Listing resources
Some URL schemes allow listing resources in a pythonnic way:
import tentaclio
for entry in tentaclio.listdir("s3:://mybucket/path/to/dir"):
print("Entry", entry)
Whereas listdir
might be convinient we also offer scandir
, which returns a list of DirEntrys, and, walk
. All functions follow as closely as possible their standard library definitions.
Database access
In order to open db connections you can use tentaclio.db
and have instant access to postgres, sqlite, athena and mssql.
import tentaclio
[...]
query = "select 1";
with tentaclio.db(POSTGRES_TEST_URL) as client:
result =client.query(query)
[...]
The supported db schemes are:
postgresql://
sqlite://
awsathena+rest://
mssql://
- Any other scheme supported by sqlalchemy.
Extras for databases
For postgres you can set the variable TENTACLIO__PG_APPLICATION_NAME
and the value will be injected
when connecting to the database.
Automatic credentials injection
-
Configure credentials by using environmental variables prefixed with
TENTACLIO__CONN__
(i.e.TENTACLIO__CONN__DATA_FTP=sfpt://real_user:132ldsf@ftp.octoenergy.com
). -
Open a stream:
with tentaclio.open("sftp://ftp.octoenergy.com/file.csv") as reader:
reader.read()
The credentials get injected into the url.
- Open a db client:
import tentaclio
with tentaclio.db("postgresql://hostname/my_data_base") as client:
client.query("select 1")
Note that hostname
in the url to be authenticated is a wildcard that will match any hostname. So authenticate("http://hostname/file.txt")
will be injected to http://user:pass@octo.co/file.txt
if the credential for http://user:pass@octo.co/
exists.
Different components of the URL are set differently:
- Scheme and path will be set from the URL, and null if missing.
- Username, password and hostname will be set from the stored credentials.
- Port will be set from the stored credentials if it exists, otherwise from the URL.
- Query will be set from the URL if it exists, otherwise from the stored credentials (so it can be overriden)
Credentials file
You can also set a credentials file that looks like:
secrets:
db_1: postgresql://user1:pass1@myhost.com/database_1
db_2: postgresql://user2:pass2@otherhost.com/database_2
ftp_server: ftp://fuser:fpass@ftp.myhost.com
And make it accessible to tentaclio by setting the environmental variable TENTACLIO__SECRETS_FILE
. The actual name of each url is for traceability and has no effect in the functionality.
Alternatively you can run curl https://raw.githubusercontent.com/octoenergy/tentaclio/master/extras/init_tentaclio.sh
to create a secrets file in ~/.tentaclio.yml
and
automatically configure your environment.
Configuring access to google drive.
Google drive support is experimental and should be used at your own risk. Also, due to google drive itself it's rather slow.
-
Get the credentials. First we need a credentials file in order to be able to generate tokens. The easiest way to do this is by going to this example, click on enable drive api. Give the project a name of your choosing (eg
tentaclio
), set the OAuth client selector to "Desktop app", and download the generated JSON file. -
Generate token file
pipenv install tentaclio && pipenv run python -m tentaclio google-token generate
This will open a browser with a google auth page, log in and accept the authorisation request.
The token file has been saved in a default location. You can also configure this via the env variable TENTACLIO__GOOGLE_DRIVE_TOKEN_FILE
- Get rid of credentials.json
The
credentials.json
file is not longer need, feel free to delete it.
Quick note on protocols structural subtyping.
In order to abstract concrete dependencies from the implementation of data related functions (or in any part of the system really) we use typed protocols. This allows a more flexible dependency injection than using subclassing or more complex approches. This idea is heavily inspired by how this exact thing is done in go. Learn more about this principle in our tech blog.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tentaclio-0.0.11.tar.gz
.
File metadata
- Download URL: tentaclio-0.0.11.tar.gz
- Upload date:
- Size: 38.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a09f9472f3e6444de652aeb98e4d1707d4bff6f3aca4b2c435308dae7e01619 |
|
MD5 | 9191aa01130a33b32d9c3b2513b11c86 |
|
BLAKE2b-256 | 5828c72518500824ba93219ada795ff1af5a6a850d4a2b25b68debab50d767cd |
File details
Details for the file tentaclio-0.0.11-py2.py3-none-any.whl
.
File metadata
- Download URL: tentaclio-0.0.11-py2.py3-none-any.whl
- Upload date:
- Size: 50.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6a3368c8cf215a46dab23f7c431dca991c9c7b414039a11ee310dfef8d9d147 |
|
MD5 | d6db6c1d2a609785505a6f1ac9ed76cc |
|
BLAKE2b-256 | ea17cc170a32288cadb838980868834664402c73c0e24300c9b579da4a34ce7f |