Skip to main content

Azure Data Lake Store Filesystem Client Library for Python

Project description

azure-datalake-store

https://travis-ci.org/Azure/azure-data-lake-store-python.svg?branch=dev https://coveralls.io/repos/github/Azure/azure-data-lake-store-python/badge.svg?branch=master

azure-datalake-store is a file-system management system in python for the Azure Data-Lake Store.

To install from source instead of pip (for local testing and development):

> pip install -r dev_requirements.txt
> python setup.py develop

To run tests, you are required to set the following environment variables: azure_tenant_id, azure_username, azure_password, azure_data_lake_store_name

To play with the code, here is a starting point:

from azure.datalake.store import core, lib, multithread
token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(token, store_name=store_name)

# typical operations
adl.ls('')
adl.ls('tmp/', detail=True)
adl.ls('tmp/', detail=True, invalidate_cache=True)
adl.cat('littlefile')
adl.head('gdelt20150827.csv')

# file-like object
with adl.open('gdelt20150827.csv', blocksize=2**20) as f:
    print(f.readline())
    print(f.readline())
    print(f.readline())
    # could have passed f to any function requiring a file object:
    # pandas.read_csv(f)

with adl.open('anewfile', 'wb') as f:
    # data is written on flush/close, or when buffer is bigger than
    # blocksize
    f.write(b'important data')

adl.du('anewfile')

# recursively download the whole directory tree with 5 threads and
# 16MB chunks
multithread.ADLDownloader(adl, "", 'my_temp_dir', 5, 2**24)

Command Line Sample Usage

To interact with the API at a higher-level, you can use the provided command-line interface in “samples/cli.py”. You will need to set the appropriate environment variables as described above to connect to the Azure Data Lake Store. Below is a simple sample, with more details beyond.

python samples\cli.py ls -l

Execute the program without arguments to access documentation.

To start the CLI in interactive mode, run “python samples/cli.py” and then type “help” to see all available commands (similiar to Unix utilities):

> python samples/cli.py
azure> help

Documented commands (type help <topic>):
========================================
cat    chmod  close  du      get   help  ls     mv   quit  rmdir  touch
chgrp  chown  df     exists  head  info  mkdir  put  rm    tail

azure>

While still in interactive mode, you can run “ls -l” to list the entries in the home directory (“help ls” will show the command’s usage details). If you’re not familiar with the Unix/Linux “ls” command, the columns represent 1) permissions, 2) file owner, 3) file group, 4) file size, 5-7) file’s modification time, and 8) file name.

> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd         0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd        36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd         0 Aug 03 13:46 tmp
azure> ls -l --human-readable
drwxrwx--- 0123abcd 0123abcd   0B Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1M Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd  36B Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd   0B Aug 03 13:46 tmp
azure>

To download a remote file, run “get remote-file [local-file]”. The second argument, “local-file”, is optional. If not provided, the local file will be named after the remote file minus the directory path.

> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd         0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd        36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd         0 Aug 03 13:46 tmp
azure> get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
azure>

It is also possible to run in command-line mode, allowing any available command to be executed separately without remaining in the interpreter.

For example, listing the entries in the home directory:

> python samples/cli.py ls -l
drwxrwx--- 0123abcd 0123abcd         0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd   1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd        36 Jul 22 18:32 xyz.csv
drwxrwx--- 0123abcd 0123abcd         0 Aug 03 13:46 tmp
>

Also, downloading a remote file:

> python samples/cli.py get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
2016-08-04 18:57:49,734 - ADLFS - DEBUG - File downloaded (xyz.csv -> xyz.csv)
>

Release History

0.0.12 (2017-06-20)

  • Fix a regression with ls returning the top level folder if it has no contents. It now properly returns an empty array if a folder has no children.

0.0.11 (2017-06-02)

  • Update to name incomplete file downloads with a .inprogress suffix. This suffix is removed when the download completes successfully.

0.0.10 (2017-05-24)

  • Allow users to explicitly use or invalidate the internal, local cache of the filesystem that is built up from previous ls calls. It is now set to always call the service instead of the cache by default.

  • Update to properly create the wheel package during build to ensure all pip packages are available.

  • Update folder upload/download to properly throw early in the event that the destination files exist and overwrite was not specified. NOTE: target folder existence (or sub folder existence) does not automatically cause failure. Only leaf node existence will result in failure.

  • Fix a bug that caused file not found errors when attempting to get information about the root folder.

0.0.9 (2017-05-09)

  • Enforce basic SSL utilization to ensure performance due to GitHub issue 625 <https://github.com/pyca/pyopenssl/issues/625>

0.0.8 (2017-04-26)

  • Fix server-side throttling retry support. This is not a guarantee that if the server is throttling the upload (or download) it will eventually succeed, but there is now a back-off retry in place to make it more likely.

0.0.7 (2017-04-19)

  • Update the build process to more efficiently handle multi-part namespaces for pip.

0.0.6 (2017-03-15)

  • Fix an issue with path caching that should drastically improve performance for download

0.0.5 (2017-03-01)

  • Fix for downloader to ensure there is access to the source path before creating destination files

  • Fix for credential objects to inherit from msrest.authentication for more universal authentication support

  • Add support for the following:

    • set_expiry: allows for setting expiration on files

    • ACL management:

      • set_acl: allows for the full replacement of an ACL on a file or folder

      • set_acl_entries: allows for “patching” an existing ACL on a file or folder

      • get_acl_status: retrieves the ACL information for a file or folder

      • remove_acl_entries: removes the specified entries from an ACL on a file or folder

      • remove_acl: removes all non-default ACL entries from a file or folder

      • remove_default_acl: removes all default ACL entries from a folder

  • Remove unsupported and unused “TRUNCATE” operation.

  • Added API-Version support with a default of the latest api version (2016-11-01)

0.0.4 (2017-02-07)

  • Fix for folder upload to properly delete folders with contents when overwrite specified.

  • Fix to set verbose output to False/Off by default. This removes progress tracking output by default but drastically improves performance.

0.0.3 (2017-02-02)

  • Fix to setup.py to include the HISTORY.rst file. No other changes

0.0.2 (2017-01-30)

  • Addresses an issue with lib.auth() not properly defaulting to 2FA

  • Fixes an issue with Overwrite for ADLUploader sometimes not being honored.

  • Fixes an issue with empty files not properly being uploaded and resulting in a hang in progress tracking.

  • Addition of a samples directory showcasing examples of how to use the client and upload and download logic.

  • General cleanup of documentation and comments.

  • This is still based on API version 2016-11-01

0.0.1 (2016-11-21)

  • Initial preview release. Based on API version 2016-11-01.

  • Includes initial ADLS filesystem functionality and extended upload and download support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure-datalake-store-0.0.12.tar.gz (50.5 kB view details)

Uploaded Source

Built Distribution

azure_datalake_store-0.0.12-py2.py3-none-any.whl (46.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file azure-datalake-store-0.0.12.tar.gz.

File metadata

File hashes

Hashes for azure-datalake-store-0.0.12.tar.gz
Algorithm Hash digest
SHA256 d718262bed5439621ed3d61f3ea3f40334e8e6889966d3828491177a0960f7b3
MD5 fc9c6537a6bdbc281bd63a766704fe7b
BLAKE2b-256 10dd2f5cc9ed2ac4d6d889475c73196fb124d0e78fd64fc6badc5db58e49a024

See more details on using hashes here.

File details

Details for the file azure_datalake_store-0.0.12-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for azure_datalake_store-0.0.12-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 602a3f8ac9839886524f749ead203ad41f185a6356099df472b5b0a23ccdeb19
MD5 619fa96bf76a6b996e212aeef64573c6
BLAKE2b-256 78ebfe57caa2d3c77421bb59b537c4ae7e336e20e5f91b7acaf7204fcfe58126

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page