Python package to sanitize in a standard way ML-related labels.
Project description
Sanitize ML Labels
Sanitize ML Labels is a Python package designed to standardize and sanitize ML-related labels. Currently supports over 100 labels, including metric and model names.
If you have ML-related labels, and you find yourself renaming and sanitizing them in a consistent manner, with the proper capitalizaton, this package ensures they are always sanitized in a standard way.
How do I install this package?
You can install it using pip:
pip install sanitize_ml_labels
Usage examples
Here are some common use cases for normalizing labels:
Example for metrics
from sanitize_ml_labels import sanitize_ml_labels
labels = [
"acc",
"loss",
"auroc",
"lr"
]
assert sanitize_ml_labels(labels) == [
"Accuracy",
"Loss",
"AUROC",
"Learning rate"
]
Example for models
from sanitize_ml_labels import sanitize_ml_labels
labels = [
"mlp",
"cnn",
"ffNN",
"Feed-forward neural network",
"perceptron",
"recurrent neural network",
"LStM"
]
assert sanitize_ml_labels(labels) == [
"MLP",
"CNN",
"FFNN",
"FFNN",
"Perceptron",
"RNN",
"LSTM"
]
assert sanitize_ml_labels("vanilla mlp") == "MLP"
assert sanitize_ml_labels("vanilla cnn") == "CNN"
assert sanitize_ml_labels([
"Large Language Model",
"transe",
"Generative Pre-trained Transformer",
"Graph Convolutional Neural Network",
"Convolutional Graph Neural Network",
"Graph Neural Network",
"Graph Attention Network",
"Graph Attention Neural Network",
]) == ["LLM","TransE","GPT","GCN","GCN","GNN","GAT","GAT"]
Sometimes, it happens that you have prefixed all your models with "vanilla" or "simple" or "basic". This package can help you remove these prefixes.
from sanitize_ml_labels import sanitize_ml_labels
labels = [
"vanilla mlp",
"vanilla cnn",
"vanilla ffnn",
"vanilla perceptron"
]
assert sanitize_ml_labels(labels) == ["MLP", "CNN", "FFNN", "Perceptron"]
Corner cases
Sometimes, you might encounter hyphenated terms that need to be correctly identified and normalized. We use a heuristic approach based on an extended list of over 45K hyphenated English words, originally from the Metadata consulting website.
The lookup heuristic, written by Tommaso Fontana, ensures efficient and accurate hyphenated word recognition.
from sanitize_ml_labels import sanitize_ml_labels
# Running the following
assert sanitize_ml_labels("non-existent-edges-in-graph") == "Non-existent edges in graph"
Extra utilities
In addition to label sanitization, the package provides methods to check metric normalization:
Is normalized metric
Validates if a metric falls within the range [0, 1].
from sanitize_ml_labels import is_normalized_metric
assert not is_normalized_metric("MSE")
assert is_normalized_metric("acc")
assert is_normalized_metric("accuracy")
assert is_normalized_metric("AUROC")
assert is_normalized_metric("auprc")
Is absolutely normalized metric
Validates if a metric falls within the range [-1, 1].
from sanitize_ml_labels import is_absolutely_normalized_metric
assert not is_absolutely_normalized_metric("auprc")
assert is_absolutely_normalized_metric("MCC")
assert is_absolutely_normalized_metric("Markedness")
Shoud be maximized
Whether a metric should be maximized or minimized. Unknown metrics will raise a NotImplementedError
.
from sanitize_ml_labels import should_be_maximized
assert not should_be_maximized("MSE")
assert should_be_maximized("AUROC")
assert should_be_maximized("accuracy")
License
This software is licensed under the MIT license. See the LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file sanitize_ml_labels-1.1.4.tar.gz
.
File metadata
- Download URL: sanitize_ml_labels-1.1.4.tar.gz
- Upload date:
- Size: 324.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 329b9b9c52fcc6d93c6cc7cde4067f2cd57eba2dafafe2932505cd16596a0e86 |
|
MD5 | fa13a3f302d0a870605fd9619d24e7de |
|
BLAKE2b-256 | f8b1a588878ce72505500772f3dd5729eeabb32800040265de161dd45f4b2a4c |