Skip to main content

Python utilities used by Deep Procedural Intelligence

Project description

DPU Utilities PyPI - Python VersionAnaconda

Build Status

This contains a set of utilities used across projects of the DPU team.

Python

Stored in the python subdirectory, published as the dpu-utils package.

Installation

pip install dpu-utils

OR via the community-maintained Conda recipe:

conda install -c conda-forge dpu-utils

Overview

Below you can find an overview of the utilities included. Detailed documentation is provided at the docstring of each class.

Generic Utilities dpu_utils.utils
  • ChunkWriter provides a convenient API for writing output in multiple parts (chunks).
  • RichPath an API that abstract local and Azure Blob paths in your code.
  • *Iterator Wrappers that can parallelize and shuffle iterators.
  • {load,save}_json[l]_gz convenience API for loading and writing .json[l].gz files.
  • git_tag_run tags the current working directory git the state of the code.
  • run_and_debug when an exception happens, start a debug session. Usually a wrapper of __main__.
General Machine Learning Utilities dpu_utils.mlutils
  • Vocabulary map elements into unique integer ids and back. Commonly used in machine learning models that work over discrete data (e.g. words in NLP). Contains methods for converting an list of tokens into their "tensorized" for of integer ids.
  • BpeVocabulary a vocabulary for machine learning models that employs BPE (via sentencepiece).
  • CharTensorizer convert character sequences into into tensors, commonly used in machine learning models whose input is a list of characters.
Code-related Utilities dpu_utils.codeutils
TensorFlow 1.x Utilities dpu_utils.tfutils

Unsorted segment operations following TensorFlow's unsorted_segment_sum operations:

TensorFlow 2.x Utilities dpu_utils.tf2utils

Unsorted segment operations following TensorFlow's unsorted_segment_sum operations:

TensorFlow Models dpu_utils.tfmodels

These models have not been tested with TF 2.0.

PyTorch Utilities dpu_utils.ptutils
  • BaseComponent a wrapper abstract class around nn.Module that takes care of essential elements of most neural network components.
  • ComponentTrainer a training loop for BaseComponents.

Command-line tools

Approximate Duplicate Code Detection

You can use the deduplicationcli command to detect duplicates in pre-processed source code, by invoking

deduplicationcli DATA_PATH OUT_JSON

where DATA_PATH is a file containing tokenized .jsonl.gz files and OUT_JSON is the target output file. For more options look at --help.

An exact (but usually slower) version of this can be found here along with code to tokenize Java, C#, Python and JavaScript into the relevant formats.

Tests

Run the unit tests

python setup.py test

Generate code coverage reports

# pip install coverage
coverage run --source dpu_utils/ setup.py test && \
  coverage html

The resulting HTML file will be in htmlcov/index.html.

.NET

Stored in the dotnet subdirectory.

Generic Utilities:

  • Microsoft.Research.DPU.Utils.RichPath: a convenient way of using both paths and Azure paths in your code.

Code-related Utilities:

  • Microsoft.Research.DPU.CSharpSourceGraphExtraction: infrastructure to extract Program Graphs from C# projects.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpu_utils-0.6.0.tar.gz (55.6 kB view details)

Uploaded Source

Built Distribution

dpu_utils-0.6.0-py2.py3-none-any.whl (73.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file dpu_utils-0.6.0.tar.gz.

File metadata

  • Download URL: dpu_utils-0.6.0.tar.gz
  • Upload date:
  • Size: 55.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.9

File hashes

Hashes for dpu_utils-0.6.0.tar.gz
Algorithm Hash digest
SHA256 a3a58f1a00a17bbf0e3e0154b751458dad63d1be34b82832d03232bd93764f08
MD5 8d81e50c8986efbd87d2deb45cb55ebf
BLAKE2b-256 b84c3f499047263e3a1f898ce6819fe03ff0a50e676d73f1b198880c17fb2836

See more details on using hashes here.

File details

Details for the file dpu_utils-0.6.0-py2.py3-none-any.whl.

File metadata

  • Download URL: dpu_utils-0.6.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 73.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.9

File hashes

Hashes for dpu_utils-0.6.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4f92d93d2b187e367bb2aea78cb1ea28cf8963c6bf3540a1342cbb4e0e5d0314
MD5 4e13e0c662a68aad99929c3316ba22f2
BLAKE2b-256 98c87d76d921ea19a9bb46c045a9bf0b47c836a0f8bd663f11dfa5079328b420

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page