Skip to main content

Python utilities used by Deep Procedural Intelligence

Project description

DPU Utilities

Build Status

This contains a set of utilities used across projects of the DPU team.

Python

Stored in the python subdirectory, published as the dpu-utils package.

Installation

pip install dpu-utils

Overview

Below you can find an overview of the utilities included. Detailed documentation is provided at the docstring of each class.

Generic Utilities dpu_utils.utils
  • ChunkWriter provides a convenient API for writing output in multiple parts (chunks).
  • RichPath an API that abstract local and Azure Blob paths in your code.
  • *Iterator Wrappers that can parallelize and shuffle iterators.
  • {load,save}_json[l]_gz convenience API for loading and writing .json[l].gz files.
  • git_tag_run tags the current working directory git the state of the code.
  • run_and_debug when an exception happens, start a debug session. Usually a wrapper of __main__.
General Machine Learning Utilities dpu_utils.mlutils
  • Vocabulary map elements into unique integer ids and back. Commonly used in machine learning models that work over discrete data (e.g. words in NLP). Contains methods for converting an list of tokens into their "tensorized" for of integer ids.
  • BpeVocabulary a vocabulary for machine learning models that employs BPE (via sentencepiece).
  • CharTensorizer convert character sequences into into tensors, commonly used in machine learning models whose input is a list of characters.
Code-related Utilities dpu_utils.codeutils
TensorFlow 1.x Utilities dpu_utils.tfutils

Unsorted segment operations following TensorFlow's unsorted_segment_sum operations:

TensorFlow 2.x Utilities dpu_utils.tf2utils

Unsorted segment operations following TensorFlow's unsorted_segment_sum operations:

TensorFlow Models dpu_utils.tfmodels

These models have not been tested with TF 2.0.

PyTorch Utilities dpu_utils.ptutils
  • BaseComponent a wrapper abstract class around nn.Module that takes care of essential elements of most neural network components.
  • ComponentTrainer a training loop for BaseComponents.

Command-line tools

Approximate Duplicate Code Detection

You can use the deduplicationcli command to detect duplicates in pre-processed source code, by invoking

deduplicationcli DATA_PATH OUT_JSON

where DATA_PATH is a file containing tokenized .jsonl.gz files and OUT_JSON is the target output file. For more options look at --help.

An exact (but usually slower) version of this can be found here along with code to tokenize Java, C#, Python and JavaScript into the relevant formats.

Tests

Run the unit tests

python setup.py test

Generate code coverage reports

# pip install coverage
coverage run --source dpu_utils/ setup.py test && \
  coverage html

The resulting HTML file will be in htmlcov/index.html.

.NET

Stored in the dotnet subdirectory.

Generic Utilities:

  • Microsoft.Research.DPU.Utils.RichPath: a convenient way of using both paths and Azure paths in your code.

Code-related Utilities:

  • Microsoft.Research.DPU.CSharpSourceGraphExtraction: infrastructure to extract Program Graphs from C# projects.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpu_utils-0.2.10.tar.gz (48.2 kB view details)

Uploaded Source

Built Distribution

dpu_utils-0.2.10-py2.py3-none-any.whl (63.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file dpu_utils-0.2.10.tar.gz.

File metadata

  • Download URL: dpu_utils-0.2.10.tar.gz
  • Upload date:
  • Size: 48.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for dpu_utils-0.2.10.tar.gz
Algorithm Hash digest
SHA256 8cdd688f0b0abaf30910c56e07f8ddfea68071a8f944fb1a6c7dc89e9d0a4529
MD5 58fceca5e0b8b3df1d600c0b241b3aa4
BLAKE2b-256 798c8c7d58c74e6aa125d79a660efe0b41af8289edc9630c7501194d2e4b6dfa

See more details on using hashes here.

File details

Details for the file dpu_utils-0.2.10-py2.py3-none-any.whl.

File metadata

  • Download URL: dpu_utils-0.2.10-py2.py3-none-any.whl
  • Upload date:
  • Size: 63.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for dpu_utils-0.2.10-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d746f7b4ed28b5ca3de135658a6cfcebca741e78be48185e797f57d77c6a4cab
MD5 7e2695d20d7457bc166c44ec80759bc4
BLAKE2b-256 e860b4c1accb015777d1f1f6dcddb4295f4f87463e70dfee5c7a7e1e41af772e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page