Skip to main content

Access Azure Datalake Gen1 with fsspec and dask

Project description

Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage

PyPI version shields.io Latest conda-forge version

Quickstart

This package can be installed using:

pip install adlfs

or

conda install -c conda-forge adlfs

The adl:// and abfs:// protocols are included in fsspec's known_implementations registry in fsspec > 0.6.1, otherwise users must explicitly inform fsspec about the supported adlfs protocols.

To use the Gen1 filesystem:

import dask.dataframe as dd

storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}

dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)

To use the Gen2 filesystem you can use the protocol abfs or az:

import dask.dataframe as dd

storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}

ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)
ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)

Accepted protocol / uri formats include:
'PROTOCOL://container/path-part/file'
'PROTOCOL://container@account.dfs.core.windows.net/path-part/file'

or optionally, if AZURE_STORAGE_ACCOUNT_NAME and an AZURE_STORAGE_<CREDENTIAL> is 
set as an environmental variable, then storage_options will be read from the environmental
variables

To read from a public storage blob you are required to specify the 'account_name'. For example, you can access NYC Taxi & Limousine Commission as:

storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)

Details

The package includes pythonic filesystem implementations for both Azure Datalake Gen1 and Azure Datalake Gen2, that facilitate interactions between both Azure Datalake implementations and Dask. This is done leveraging the intake/filesystem_spec base class and Azure Python SDKs.

Operations against both Gen1 Datalake currently only work with an Azure ServicePrincipal with suitable credentials to perform operations on the resources of choice.

Operations against the Gen2 Datalake are implemented by leveraging Azure Blob Storage Python SDK.

Setting credentials

The storage_options can be instantiated with a variety of keyword arguments depending on the filesystem. The most commonly used arguments are:

  • connection_string
  • account_name
  • account_key
  • sas_token
  • tenant_id, client_id, and client_secret are combined for an Azure ServicePrincipal e.g. storage_options={'account_name': ACCOUNT_NAME, 'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}
  • anon: boo, optional. The value to use for whether to attempt anonymous access if no other credential is passed. By default (None), the AZURE_STORAGE_ANON environment variable is checked. False values (false, 0, f) will resolve to False and anonymous access will not be attempted. Otherwise the value for anon resolves to True.
  • location_mode: valid values are "primary" or "secondary" and apply to RA-GRS accounts

For more argument details see all arguments for AzureBlobFileSystem here and AzureDatalakeFileSystem here.

The following environmental variables can also be set and picked up for authentication:

  • "AZURE_STORAGE_CONNECTION_STRING"
  • "AZURE_STORAGE_ACCOUNT_NAME"
  • "AZURE_STORAGE_ACCOUNT_KEY"
  • "AZURE_STORAGE_SAS_TOKEN"
  • "AZURE_STORAGE_TENANT_ID"
  • "AZURE_STORAGE_CLIENT_ID"
  • "AZURE_STORAGE_CLIENT_SECRET"

The filesystem can be instantiated for different use cases based on a variety of storage_options combinations. The following list describes some common use cases utilizing AzureBlobFileSystem, i.e. protocols abfsor az. Note that all cases require the account_name argument to be provided:

  1. Anonymous connection to public container: storage_options={'account_name': ACCOUNT_NAME, 'anon': True} will assume the ACCOUNT_NAME points to a public container, and attempt to use an anonymous login. Note, the default value for anon is True.
  2. Auto credential solving using Azure's DefaultAzureCredential() library: storage_options={'account_name': ACCOUNT_NAME, 'anon': False} will use DefaultAzureCredential to get valid credentials to the container ACCOUNT_NAME. DefaultAzureCredential attempts to authenticate via the mechanisms and order visualized here.
  3. Auto credential solving without requiring storage_options: Set AZURE_STORAGE_ANON to false, resulting in automatic credential resolution. Useful for compatibility with fsspec.
  4. Azure ServicePrincipal: tenant_id, client_id, and client_secret are all used as credentials for an Azure ServicePrincipal: e.g. storage_options={'account_name': ACCOUNT_NAME, 'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}.

Append Blob

The AzureBlobFileSystem accepts all of the Async BlobServiceClient arguments.

By default, write operations create BlockBlobs in Azure, which, once written can not be appended. It is possible to create an AppendBlob using mode="ab" when creating and operating on blobs. Currently, AppendBlobs are not available if hierarchical namespaces are enabled.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adlfs-2024.4.0.tar.gz (48.0 kB view details)

Uploaded Source

Built Distribution

adlfs-2024.4.0-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file adlfs-2024.4.0.tar.gz.

File metadata

  • Download URL: adlfs-2024.4.0.tar.gz
  • Upload date:
  • Size: 48.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for adlfs-2024.4.0.tar.gz
Algorithm Hash digest
SHA256 7002b93ec6ee7cc61d11db7ceef1a1133f0115dacc2a5c98a5c0d0af93ebff7e
MD5 5b8f4d6e4d7ac6ef6483adebdbc4f8d5
BLAKE2b-256 fa025863a4dab31f4a5c1f0b9e8298edae3764fed64b5daa1123344c85bb6f57

See more details on using hashes here.

Provenance

File details

Details for the file adlfs-2024.4.0-py3-none-any.whl.

File metadata

  • Download URL: adlfs-2024.4.0-py3-none-any.whl
  • Upload date:
  • Size: 40.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for adlfs-2024.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 66dda007545cf96a943587739d35da4cacd60cb9c505a3bc5a83abee3d41d8d3
MD5 aaaebb00d3758ba7711d0e6400d35ef1
BLAKE2b-256 bd6ade50f9ff69a0dcdd17b17a2c4b1b1fa10ed93024cfad663dab71214d1ddf

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page