Skip to main content

Python utilities used by Deep Procedural Intelligence

Project description

DPU Utilities

Build Status

This contains a set of utilities used across projects of the DPU team.

Python

Stored in the python subdirectory, published as the dpu-utils package.

Installation

pip install dpu-utils

Overview

Below you can find an overview of the utilities included. Detailed documentation is provided at the docstring of each class.

Generic Utilities dpu_utils.utils
  • ChunkWriter provides a convenient API for writing output in multiple parts (chunks).
  • RichPath an API that abstract local and Azure Blob paths in your code.
  • *Iterator Wrappers that can parallelize and shuffle iterators.
  • {load,save}_json[l]_gz convenience API for loading and writing .json[l].gz files.
  • git_tag_run tags the current working directory git the state of the code.
  • run_and_debug when an exception happens, start a debug session. Usually a wrapper of __main__.
General Machine Learning Utilities dpu_utils.mlutils
  • Vocabulary map elements into unique integer ids and back. Commonly used in machine learning models that work over discrete data (e.g. words in NLP). Contains methods for converting an list of tokens into their "tensorized" for of integer ids.
  • BpeVocabulary a vocabulary for machine learning models that employs BPE (via sentencepiece).
  • CharTensorizer convert character sequences into into tensors, commonly used in machine learning models whose input is a list of characters.
Code-related Utilities dpu_utils.codeutils
TensorFlow 1.x Utilities dpu_utils.tfutils

Unsorted segment operations following TensorFlow's unsorted_segment_sum operations:

TensorFlow 2.x Utilities dpu_utils.tf2utils

Unsorted segment operations following TensorFlow's unsorted_segment_sum operations:

TensorFlow Models dpu_utils.tfmodels

These models have not been tested with TF 2.0.

PyTorch Utilities dpu_utils.ptutils
  • BaseComponent a wrapper abstract class around nn.Module that takes care of essential elements of most neural network components.
  • ComponentTrainer a training loop for BaseComponents.

Command-line tools

Approximate Duplicate Code Detection

You can use the deduplicationcli command to detect duplicates in pre-processed source code, by invoking

deduplicationcli DATA_PATH OUT_JSON

where DATA_PATH is a file containing tokenized .jsonl.gz files and OUT_JSON is the target output file. For more options look at --help.

An exact (but usually slower) version of this can be found here along with code to tokenize Java, C#, Python and JavaScript into the relevant formats.

Tests

Run the unit tests

python setup.py test

Generate code coverage reports

# pip install coverage
coverage run --source dpu_utils/ setup.py test && \
  coverage html

The resulting HTML file will be in htmlcov/index.html.

.NET

Stored in the dotnet subdirectory.

Generic Utilities:

  • Microsoft.Research.DPU.Utils.RichPath: a convenient way of using both paths and Azure paths in your code.

Code-related Utilities:

  • Microsoft.Research.DPU.CSharpSourceGraphExtraction: infrastructure to extract Program Graphs from C# projects.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpu_utils-0.2.13.tar.gz (48.3 kB view details)

Uploaded Source

Built Distribution

dpu_utils-0.2.13-py2.py3-none-any.whl (63.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file dpu_utils-0.2.13.tar.gz.

File metadata

  • Download URL: dpu_utils-0.2.13.tar.gz
  • Upload date:
  • Size: 48.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.6

File hashes

Hashes for dpu_utils-0.2.13.tar.gz
Algorithm Hash digest
SHA256 eb1986dbbf2053c3fc7de203afcf0e8d8c0926a13567f3eb01de315824c3ceda
MD5 88d178ca942dd242152a3d74439992a1
BLAKE2b-256 48807bb6195ba7a3d81729c1f452350a4668b37c309d5b2d1ccca2a508fa7a39

See more details on using hashes here.

File details

Details for the file dpu_utils-0.2.13-py2.py3-none-any.whl.

File metadata

  • Download URL: dpu_utils-0.2.13-py2.py3-none-any.whl
  • Upload date:
  • Size: 63.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.6

File hashes

Hashes for dpu_utils-0.2.13-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 08a346a885ba2e3a04ff2b5bcb21390161490639354577ffe3b4322697a583e0
MD5 a2abaa39f40136ca665996f72c52e059
BLAKE2b-256 18c94916a7f24255d2bc3b5f9abadbdad43954cf1f15c4209095eece3e1c5f11

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page