Skip to main content

Convenient Filesystem interface to Azure Data-lake Store

Project description

https://travis-ci.org/Azure/azure-data-lake-store-python.svg?branch=dev

azure-datalake-store is a file-system management system in python for the Azure Data-Lake Store.

To install from source instead of pip (for local testing and development):

> pip install -r dev_requirements.txt
> python setup.py develop

To run tests, you are required to set the following environment variables: azure_tenant_id, azure_username, azure_password, azure_data_lake_store_name

To play with the code, here is a starting point:

from azure.datalake.store import core, lib, multithread
token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(store_name, token)

# typical operations
adl.ls('')
adl.ls('tmp/', detail=True)
adl.cat('littlefile')
adl.head('gdelt20150827.csv')

# file-like object
with adl.open('gdelt20150827.csv', blocksize=2**20) as f:
    print(f.readline())
    print(f.readline())
    print(f.readline())
    # could have passed f to any function requiring a file object:
    # pandas.read_csv(f)

with adl.open('anewfile', 'wb') as f:
    # data is written on flush/close, or when buffer is bigger than
    # blocksize
    f.write(b'important data')

adl.du('anewfile')

# recursively download the whole directory tree with 5 threads and
# 16MB chunks
multithread.ADLDownloader(adl, "", 'my_temp_dir', 5, 2**24)

To interact with the API at a higher-level, you can use the provided command-line interface in “azure/datalake/store/cli.py”. You will need to set the appropriate environment variables as described above to connect to the Azure Data Lake Store.

To start the CLI in interactive mode, run “python azure/datalake/store/cli.py” and then type “help” to see all available commands (similiar to Unix utilities):

> python azure/datalake/store/cli.py
azure> help

Documented commands (type help <topic>):
========================================
cat    chmod  close  du      get   help  ls     mv   quit  rmdir  touch
chgrp  chown  df     exists  head  info  mkdir  put  rm    tail

azure>

While still in interactive mode, you can run “ls -l” to list the entries in the home directory (“help ls” will show the command’s usage details). If you’re not familiar with the Unix/Linux “ls” command, the columns represent 1) permissions, 2) file owner, 3) file group, 4) file size, 5-7) file’s modification time, and 8) file name.

> python azure/datalake/store/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd         0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd        36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd         0 Aug 03 13:46 tmp
azure> ls -l --human-readable
drwxrwx--- 0123abcd 0123abcd   0B Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1M Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd  36B Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd   0B Aug 03 13:46 tmp
azure>

To download a remote file, run “get remote-file [local-file]”. The second argument, “local-file”, is optional. If not provided, the local file will be named after the remote file minus the directory path.

> python azure/datalake/store/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd         0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd        36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd         0 Aug 03 13:46 tmp
azure> get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
azure>

It is also possible to run in command-line mode, allowing any available command to be executed separately without remaining in the interpreter.

For example, listing the entries in the home directory:

> python azure/datalake/store/cli.py ls -l
drwxrwx--- 0123abcd 0123abcd         0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd        36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd         0 Aug 03 13:46 tmp
>

Also, downloading a remote file:

> python azure/datalake/store/cli.py get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure-datalake-store-0.0.1.zip (42.2 kB view details)

Uploaded Source

Built Distribution

azure_datalake_store-0.0.1-py2.py3-none-any.whl (35.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file azure-datalake-store-0.0.1.zip.

File metadata

File hashes

Hashes for azure-datalake-store-0.0.1.zip
Algorithm Hash digest
SHA256 ddcac2aa4637904fb8fe6ebcb963fb2bdba30fcf32af6bca54b4984017678ab6
MD5 ce43b391a934aa17b9833ca83b812bec
BLAKE2b-256 7eb5a6df8a1386a5f88a5f11617be7e625ed8d44ca503387340c26045b028cf3

See more details on using hashes here.

File details

Details for the file azure_datalake_store-0.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for azure_datalake_store-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b12f2c3755bb475f57400e75d1af3a669e2444d64c63d629bc10d7ac9e97ac7c
MD5 85e937a41a3127af6bb43cc396a07162
BLAKE2b-256 1520f3800b6ccfa7a06ebec95fd836df85cc15a511fce003b620c3b8901a82a9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page