Skip to main content

Python utilities used by Deep Procedural Intelligence

Project description

DPU Utilities PyPI - Python Version

Build Status

This contains a set of utilities used across projects of the DPU team.

Python

Stored in the python subdirectory, published as the dpu-utils package.

Installation

pip install dpu-utils

Overview

Below you can find an overview of the utilities included. Detailed documentation is provided at the docstring of each class.

Generic Utilities dpu_utils.utils
  • ChunkWriter provides a convenient API for writing output in multiple parts (chunks).
  • RichPath an API that abstract local and Azure Blob paths in your code.
  • *Iterator Wrappers that can parallelize and shuffle iterators.
  • {load,save}_json[l]_gz convenience API for loading and writing .json[l].gz files.
  • git_tag_run tags the current working directory git the state of the code.
  • run_and_debug when an exception happens, start a debug session. Usually a wrapper of __main__.
General Machine Learning Utilities dpu_utils.mlutils
  • Vocabulary map elements into unique integer ids and back. Commonly used in machine learning models that work over discrete data (e.g. words in NLP). Contains methods for converting an list of tokens into their "tensorized" for of integer ids.
  • BpeVocabulary a vocabulary for machine learning models that employs BPE (via sentencepiece).
  • CharTensorizer convert character sequences into into tensors, commonly used in machine learning models whose input is a list of characters.
Code-related Utilities dpu_utils.codeutils
TensorFlow 1.x Utilities dpu_utils.tfutils

Unsorted segment operations following TensorFlow's unsorted_segment_sum operations:

TensorFlow 2.x Utilities dpu_utils.tf2utils

Unsorted segment operations following TensorFlow's unsorted_segment_sum operations:

TensorFlow Models dpu_utils.tfmodels

These models have not been tested with TF 2.0.

PyTorch Utilities dpu_utils.ptutils
  • BaseComponent a wrapper abstract class around nn.Module that takes care of essential elements of most neural network components.
  • ComponentTrainer a training loop for BaseComponents.

Command-line tools

Approximate Duplicate Code Detection

You can use the deduplicationcli command to detect duplicates in pre-processed source code, by invoking

deduplicationcli DATA_PATH OUT_JSON

where DATA_PATH is a file containing tokenized .jsonl.gz files and OUT_JSON is the target output file. For more options look at --help.

An exact (but usually slower) version of this can be found here along with code to tokenize Java, C#, Python and JavaScript into the relevant formats.

Tests

Run the unit tests

python setup.py test

Generate code coverage reports

# pip install coverage
coverage run --source dpu_utils/ setup.py test && \
  coverage html

The resulting HTML file will be in htmlcov/index.html.

.NET

Stored in the dotnet subdirectory.

Generic Utilities:

  • Microsoft.Research.DPU.Utils.RichPath: a convenient way of using both paths and Azure paths in your code.

Code-related Utilities:

  • Microsoft.Research.DPU.CSharpSourceGraphExtraction: infrastructure to extract Program Graphs from C# projects.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpu_utils-0.3.2.tar.gz (51.6 kB view details)

Uploaded Source

Built Distribution

dpu_utils-0.3.2-py2.py3-none-any.whl (67.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file dpu_utils-0.3.2.tar.gz.

File metadata

  • Download URL: dpu_utils-0.3.2.tar.gz
  • Upload date:
  • Size: 51.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.9

File hashes

Hashes for dpu_utils-0.3.2.tar.gz
Algorithm Hash digest
SHA256 f1dff826879971e6ef27ea296216146035daf10f317b272dbf9f17d654c195a9
MD5 9b3518cc962d2a50ae5a1493410df296
BLAKE2b-256 dbe43dc0a48db6e1612444ee183284b3d511cad18d626a69e715adbeee05aa6f

See more details on using hashes here.

File details

Details for the file dpu_utils-0.3.2-py2.py3-none-any.whl.

File metadata

  • Download URL: dpu_utils-0.3.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 67.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.9

File hashes

Hashes for dpu_utils-0.3.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 0f73c8fae329db3bbb718a578bbd5c5f05f70ac19d53ce3f638b2538659e6b98
MD5 5f2e80651a522919b089c28204096f09
BLAKE2b-256 1d09421408281c5556a8b50c6c6e4ef78cf83237cd034dc113079542b9bf1259

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page