Skip to main content

Access Azure Datalake Gen1 with fsspec and dask

Project description

Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage

PyPI version shields.io Latest conda-forge version

Quickstart

This package can be installed using:

pip install adlfs

or

conda install -c conda-forge adlfs

The adl:// and abfs:// protocols are included in fsspec's known_implementations registry in fsspec > 0.6.1, otherwise users must explicitly inform fsspec about the supported adlfs protocols.

To use the Gen1 filesystem:

import dask.dataframe as dd

storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}

dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)

To use the Gen2 filesystem you can use the protocol abfs or az:

import dask.dataframe as dd

storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}

ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)
ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)

or optionally, if AZURE_STORAGE_ACCOUNT_NAME and an AZURE_STORAGE_<CREDENTIAL> is 
set as an environmental variable, then storage_options will be read from the environmental
variables

To read from a public storage blob you are required to specify the 'account_name'. For example, you can access NYC Taxi & Limousine Commission as:

storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)

Details

The package includes pythonic filesystem implementations for both Azure Datalake Gen1 and Azure Datalake Gen2, that facilitate interactions between both Azure Datalake implementations and Dask. This is done leveraging the intake/filesystem_spec base class and Azure Python SDKs.

Operations against both Gen1 Datalake currently only work with an Azure ServicePrincipal with suitable credentials to perform operations on the resources of choice.

Operations against the Gen2 Datalake are implemented by leveraging Azure Blob Storage Python SDK.

The filesystem can be instantiated with a variety of credentials, including:
    account_name
    account_key
    sas_token
    connection_string
    Azure ServicePrincipal credentials (which requires tenant_id, client_id, client_secret)
    location_mode:  valid value are "primary" or "secondary" and apply to RA-GRS accounts

The following enviornmental variables can also be set and picked up for authentication:
    "AZURE_STORAGE_CONNECTION_STRING"
    "AZURE_STORAGE_ACCOUNT_NAME"
    "AZURE_STORAGE_ACCOUNT_KEY"
    "AZURE_STORAGE_SAS_TOKEN"
    "AZURE_STORAGE_CLIENT_SECRET"
    "AZURE_STORAGE_CLIENT_ID"
    "AZURE_STORAGE_TENANT_ID"

The AzureBlobFileSystem accepts all of the Async BlobServiceClient arguments.

By default, write operations create BlockBlobs in Azure, which, once written can not be appended.  It is possible to create an AppendBlob using an `mode="ab"` when creating, and then when operating on blobs.  Currently AppendBlobs are not available if hierarchical namespaces are enabled.

Project details


Release history Release notifications | RSS feed

This version

0.7.6

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adlfs-0.7.6.tar.gz (49.9 kB view details)

Uploaded Source

Built Distribution

adlfs-0.7.6-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file adlfs-0.7.6.tar.gz.

File metadata

  • Download URL: adlfs-0.7.6.tar.gz
  • Upload date:
  • Size: 49.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.8

File hashes

Hashes for adlfs-0.7.6.tar.gz
Algorithm Hash digest
SHA256 50b178e3d504fa02bf0e41d9f9ef3529fea3bbf18f272c377babff2f4ee361ef
MD5 0ed4d1bc22a055c0062c9b1d4cc16590
BLAKE2b-256 323b759a2f057a768b1e174389e3e5efe6c1627512c5dbf448c4e1af28973e12

See more details on using hashes here.

Provenance

File details

Details for the file adlfs-0.7.6-py3-none-any.whl.

File metadata

  • Download URL: adlfs-0.7.6-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.8

File hashes

Hashes for adlfs-0.7.6-py3-none-any.whl
Algorithm Hash digest
SHA256 37b0762844d308aacf074ce9bdab095f69a0eb81cd9e94c262b1d15800912937
MD5 5a37a91c79b76156b8f9635e62ec99e0
BLAKE2b-256 3cea92740a8a13b74329ebb20e8d6c0eb3fe5f22f8f17d94e12579c9616ef047

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page