Skip to main content

Have you every struggled with needing a Spacy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go!

Project description

Classy Classification

Have you ever struggled with needing a Spacy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go! For few-shot classification using sentence-transformers or spaCy models, provide a dictionary with labels and examples, or just provide a list of labels for zero shot-classification with Hugginface zero-shot classifiers.

Current Release Version pypi Version PyPi downloads Code style: black

Install

pip install classy-classification

Or, install with faster inference using ONNX.

pip install classy-classification[onnx]

SetFit support

I got a lot of requests for SetFit support, but I decided to create a separate package for this. Feel free to check it out. ❤️

ONNX issues

pickling

ONNX does show some issues when pickling the data.

M1

Some installation issues might occur, which can be fixed by these commands.

brew install cmake
brew install protobuf
pip3 install onnx --no-use-pep517

Quickstart

SpaCy embeddings

import spacy
# or import standalone
# from classy_classification import ClassyClassifier

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}

nlp = spacy.load("en_core_web_trf")
nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data,
        "model": "spacy"
    }
)

print(nlp("I am looking for kitchen appliances.")._.cats)

# Output:
#
# [{"furniture" : 0.21}, {"kitchen": 0.79}]

Sentence level classification

import spacy

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}

nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data,
        "model": "spacy",
        "include_sent": True
    }
)

print(nlp("I am looking for kitchen appliances. And I love doing so.").sents[0]._.cats)

# Output:
#
# [[{"furniture" : 0.21}, {"kitchen": 0.79}]

Define random seed and verbosity

nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data,
        "verbose": True,
        "config": {"seed": 42}
    }
)

Multi-label classification

Sometimes multiple labels are necessary to fully describe the contents of a text. In that case, we want to make use of the multi-label implementation, here the sum of label scores is not limited to 1. Just pass the same training data to multiple keys.

import spacy

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa.",
               "We have a new dinner table.",
               "There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens.",
                "We have a new dinner table."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens.",
                "We have a new dinner table.",
                "There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens.",
                "We have a new dinner table."]
}

nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data,
        "model": "spacy",
        "multi_label": True,
    }
)

print(nlp("I am looking for furniture and kitchen equipment.")._.cats)

# Output:
#
# [{"furniture": 0.92}, {"kitchen": 0.91}]

Outlier detection

Sometimes it is worth to be able to do outlier detection or binary classification. This can either be approached using a binary training dataset, however, I have also implemented support for a OneClassSVM for outlier detection using a single label. Not that this method does not return probabilities, but that the data is formatted like label-score value pair to ensure uniformity.

Approach 1:

import spacy

data_binary = {
    "inlier": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "outlier": ["Text about kitchen equipment",
                "This text is about politics",
                "Comments about AI and stuff."]
}

nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data_binary,
    }
)

print(nlp("This text is a random text")._.cats)

# Output:
#
# [{'inlier': 0.2926672385488411, 'outlier': 0.707332761451159}]

Approach 2:

import spacy

data_singular = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa.",
               "We have a new dinner table."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data_singular,
    }
)

print(nlp("This text is a random text")._.cats)

# Output:
#
# [{'furniture': 0, 'not_furniture': 1}]

Sentence-transfomer embeddings

import spacy

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}

nlp = spacy.blank("en")
nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data,
        "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "device": "gpu"
    }
)

print(nlp("I am looking for kitchen appliances.")._.cats)

# Output:
#
# [{"furniture": 0.21}, {"kitchen": 0.79}]

Hugginface zero-shot classifiers

import spacy

data = ["furniture", "kitchen"]

nlp = spacy.blank("en")
nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data,
        "model": "typeform/distilbert-base-uncased-mnli",
        "cat_type": "zero",
        "device": "gpu"
    }
)

print(nlp("I am looking for kitchen appliances.")._.cats)

# Output:
#
# [{"furniture": 0.21}, {"kitchen": 0.79}]

Credits

Inspiration Drawn From

Huggingface does offer some nice models for few/zero-shot classification, but these are not tailored to multi-lingual approaches. Rasa NLU has a nice approach for this, but its too embedded in their codebase for easy usage outside of Rasa/chatbots. Additionally, it made sense to integrate sentence-transformers and Hugginface zero-shot, instead of default word embeddings. Finally, I decided to integrate with Spacy, since training a custom Spacy TextCategorizer seems like a lot of hassle if you want something quick and dirty.

Or buy me a coffee

"Buy Me A Coffee"

Standalone usage without spaCy

from classy_classification import ClassyClassifier

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}

classifier = ClassyClassifier(data=data)
classifier("I am looking for kitchen appliances.")
classifier.pipe(["I am looking for kitchen appliances."])

# overwrite training data
classifier.set_training_data(data=data)
classifier("I am looking for kitchen appliances.")

# overwrite [embedding model](https://www.sbert.net/docs/pretrained_models.html)
classifier.set_embedding_model(model="paraphrase-MiniLM-L3-v2")
classifier("I am looking for kitchen appliances.")

# overwrite SVC config
classifier.set_classification_model(
    config={
        "C": [1, 2, 5, 10, 20, 100],
        "kernel": ["linear"],
        "max_cross_validation_folds": 5
    }
)
classifier("I am looking for kitchen appliances.")

Save and load models

data = {
    "furniture": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "kitchen": ["There also exist things like fridges.",
                "I hope to be getting a new stove today.",
                "Do you also have some ovens."]
}
classifier = classyClassifier(data=data)

with open("./classifier.pkl", "wb") as f:
    pickle.dump(classifier, f)

f = open("./classifier.pkl", "rb")
classifier = pickle.load(f)
classifier("I am looking for kitchen appliances.")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

classy-classification-0.6.5.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

classy_classification-0.6.5-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file classy-classification-0.6.5.tar.gz.

File metadata

  • Download URL: classy-classification-0.6.5.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.1 CPython/3.10.11 Darwin/22.5.0

File hashes

Hashes for classy-classification-0.6.5.tar.gz
Algorithm Hash digest
SHA256 99b7500eade6d50f00d3f06b3daf9d0d7c081caee96b4f71aac2f4a82b0415e8
MD5 03194fb991e5eaf8ee9d43ef534df881
BLAKE2b-256 2e3a4c6c818c0d2601a00bb403d1aa8f2168e3e37042b8c02eaa076c46c3c09e

See more details on using hashes here.

File details

Details for the file classy_classification-0.6.5-py3-none-any.whl.

File metadata

File hashes

Hashes for classy_classification-0.6.5-py3-none-any.whl
Algorithm Hash digest
SHA256 40b779b63acbade71e66865945a82ba665a47a4604bafe687d2623f81de5953f
MD5 916fdb098d67ee8800985156ab2547dc
BLAKE2b-256 82cfb5de12b5d42201d876ea8f06ba5f666004975d43e60968c3d974cd30aecb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page