Skip to main content

Python package to sanitize in a standard way ML-related labels.

Project description

Sanitize ML Labels

PyPI Downloads License CI

Sanitize ML Labels is a Python package designed to standardize and sanitize ML-related labels. Currently supports over 100 labels, including metric and model names.

If you have ML-related labels, and you find yourself renaming and sanitizing them in a consistent manner, with the proper capitalizaton, this package ensures they are always sanitized in a standard way.

How do I install this package?

You can install it using pip:

pip install sanitize_ml_labels

Usage examples

Here are some common use cases for normalizing labels:

Example for metrics

from sanitize_ml_labels import sanitize_ml_labels

labels = [
    "acc",
    "loss",
    "auroc",
    "lr"
]

assert sanitize_ml_labels(labels) == [
    "Accuracy",
    "Loss",
    "AUROC",
    "Learning rate"
]

Example for models

from sanitize_ml_labels import sanitize_ml_labels

labels = [
    "mlp",
    "cnn",
    "ffNN",
    "Feed-forward neural network",
    "perceptron",
    "recurrent neural network",
    "LStM"
]

assert sanitize_ml_labels(labels) == [
    "MLP",
    "CNN",
    "FFNN",
    "FFNN",
    "Perceptron",
    "RNN",
    "LSTM"
]

assert sanitize_ml_labels("vanilla mlp") == "MLP"
assert sanitize_ml_labels("vanilla cnn") == "CNN"

assert sanitize_ml_labels([
    "Large Language Model",
    "transe",
    "Generative Pre-trained Transformer",
    "Graph Convolutional Neural Network",
    "Convolutional Graph Neural Network",
    "Graph Neural Network",
    "Graph Attention Network",
    "Graph Attention Neural Network",
]) == ["LLM","TransE","GPT","GCN","GCN","GNN","GAT","GAT"]

Sometimes, it happens that you have prefixed all your models with "vanilla" or "simple" or "basic". This package can help you remove these prefixes.

from sanitize_ml_labels import sanitize_ml_labels

labels = [
    "vanilla mlp",
    "vanilla cnn",
    "vanilla ffnn",
    "vanilla perceptron"
]

assert sanitize_ml_labels(labels) == ["MLP", "CNN", "FFNN", "Perceptron"]

Corner cases

Sometimes, you might encounter hyphenated terms that need to be correctly identified and normalized. We use a heuristic approach based on an extended list of over 45K hyphenated English words, originally from the Metadata consulting website.

The lookup heuristic, written by Tommaso Fontana, ensures efficient and accurate hyphenated word recognition.

from sanitize_ml_labels import sanitize_ml_labels

# Running the following
assert sanitize_ml_labels("non-existent-edges-in-graph") == "Non-existent edges in graph"

Extra utilities

In addition to label sanitization, the package provides methods to check metric normalization:

Is normalized metric

Validates if a metric falls within the range [0, 1].

from sanitize_ml_labels import is_normalized_metric

assert not is_normalized_metric("MSE")
assert is_normalized_metric("acc")
assert is_normalized_metric("accuracy")
assert is_normalized_metric("AUROC")
assert is_normalized_metric("auprc")

Is absolutely normalized metric

Validates if a metric falls within the range [-1, 1].

from sanitize_ml_labels import is_absolutely_normalized_metric

assert not is_absolutely_normalized_metric("auprc")
assert is_absolutely_normalized_metric("MCC")
assert is_absolutely_normalized_metric("Markedness")

Shoud be maximized

Whether a metric should be maximized or minimized. Unknown metrics will raise a NotImplementedError.

from sanitize_ml_labels import should_be_maximized

assert not should_be_maximized("MSE")
assert should_be_maximized("AUROC")
assert should_be_maximized("accuracy")

License

This software is licensed under the MIT license. See the LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanitize_ml_labels-1.1.4.tar.gz (324.5 kB view details)

Uploaded Source

File details

Details for the file sanitize_ml_labels-1.1.4.tar.gz.

File metadata

  • Download URL: sanitize_ml_labels-1.1.4.tar.gz
  • Upload date:
  • Size: 324.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.9

File hashes

Hashes for sanitize_ml_labels-1.1.4.tar.gz
Algorithm Hash digest
SHA256 329b9b9c52fcc6d93c6cc7cde4067f2cd57eba2dafafe2932505cd16596a0e86
MD5 fa13a3f302d0a870605fd9619d24e7de
BLAKE2b-256 f8b1a588878ce72505500772f3dd5729eeabb32800040265de161dd45f4b2a4c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page