Skip to main content

Access Azure Datalake Gen1 with fsspec and dask

Project description

Dask interface to Azure-Datalake Gen1 and Gen2 Storage Quickstart

PyPI version shields.io Latest conda-forge version

This package can be installed using:

pip install adlfs

or

conda install -c conda-forge adlfs

The adl:// and abfs:// protocols are included in fsspec's known_implementations registry in fsspec > 0.6.1, otherwise users must explicitly inform fsspec about the supported adlfs protocols.

To use the Gen1 filesystem:

import dask.dataframe as dd

storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}

dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)

To use the Gen2 filesystem you can use the protocol abfs or az:

import dask.dataframe as dd

storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}

ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)
ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)

To read from a public storage blob you are required to specify the 'account_name'. For example, you can access NYC Taxi & Limousine Commission as:

storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)

Details

The package includes pythonic filesystem implementations for both Azure Datalake Gen1 and Azure Datalake Gen2, that facilitate interactions between both Azure Datalake implementations and Dask. This is done leveraging the intake/filesystem_spec base class and Azure Python SDKs.

Operations against both Gen1 Datalake currently only work with an Azure ServicePrincipal with suitable credentials to perform operations on the resources of choice.

Operations against the Gen2 Datalake are implemented by leveraging multi-protocol access, using the Azure Blob Storage Python SDK. The AzureBlobFileSystem accepts all of the BlockBlobService arguments.

By default, write operations create BlockBlobs in Azure, which, once written can not be appended.  It is possible to create an AppendBlob using an `mode="ab"` when creating, and then when operating on blobs.  Currently AppendBlobs are not available if hierarchical namespaces are enabled.

Project details


Release history Release notifications | RSS feed

This version

0.5.8

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adlfs-0.5.8.tar.gz (36.4 kB view details)

Uploaded Source

File details

Details for the file adlfs-0.5.8.tar.gz.

File metadata

  • Download URL: adlfs-0.5.8.tar.gz
  • Upload date:
  • Size: 36.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.1.0.post20200119 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for adlfs-0.5.8.tar.gz
Algorithm Hash digest
SHA256 dc9f236b00335afbb5b69c2ea9de9064d9dfd776ce557d9d99aa9d488a5f9f5c
MD5 be9d5da2d2838e4e3596c285907b0825
BLAKE2b-256 2de7917265ee2dc2b3c7f6b4ea0d07d488d2b8f90d1ce89fdd8080f897b4578a

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page