Skip to main content

No project description provided

Project description

spacy-setfit

This repository contains an easy and intuitive approach to using SetFit in combination with spaCy.

Installation

Before using spaCy with SetFit, make sure you have the necessary packages installed. You can install them using pip:

pip install spacy spacy-setfit

Additionally, you will need to download a spaCy model, for example:

python -m spacy download en_core_web_sm

Getting Started

To use spaCy with SetFit use the following code:

import spacy
import spacy_setfit

# Create some example data
train_dataset = {
    "inlier": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "outlier": ["Text about kitchen equipment",
                "This text is about politics",
                "Comments about AI and stuff."]
}

# Load the spaCy language model:
nlp = spacy.load("en_core_web_sm")

# Add the "text_categorizer" pipeline component to the spaCy model, and configure it with SetFit parameters:
nlp.add_pipe("text_categorizer", config={
    "pretrained_model_name_or_path": "paraphrase-MiniLM-L3-v2",
    "setfit_trainer_args": {
        "train_dataset": train_dataset
    }
})
doc = nlp("I really need to get a new sofa.")
doc.cats
# {'inlier': 0.902350975129, 'outlier': 0.097649024871}

The code above processes the input text with the spaCy model, and the doc.cats attribute returns the predicted categories and their associated probabilities.

That's it! You have now successfully integrated spaCy with SetFit for text categorization tasks. You can further customize and train the model using additional data or adjust the SetFit parameters as needed.

Feel free to explore more features and documentation of spaCy and SetFit to enhance your text classification projects.

setfit_trainer_args

The setfit_trainer_args are a simplified version of the official args from the SetFit library.

Arguments

  • train_dataset (Union[dict, Dataset]): The training dataset to be used by the SetFitTrainer. It can be either a dictionary or a Dataset object.

  • eval_dataset (Union[dict, Dataset], optional): The evaluation dataset to be used by the SetFitTrainer. It can be either a dictionary or a Dataset object. Defaults to None.

  • num_iterations (int, optional): The number of iterations to train the model. Defaults to 20.

  • num_epochs (int, optional): The number of epochs to train the model. Defaults to 1.

  • learning_rate (float, optional): The learning rate for the optimizer. Defaults to 2e-5.

  • batch_size (float, optional): The batch size for training. Defaults to 16.

  • seed (int, optional): The random seed for reproducibility. Defaults to 42.

  • column_mapping (dict, optional): A mapping dictionary that specifies how to map input columns to model inputs. Defaults to None.

  • use_amp (bool, optional): Whether to use Automatic Mixed Precision (AMP) for training. Defaults to False.

Please note that the above documentation provides an overview of the arguments and their purpose. For more detailed information and usage examples, it is recommended to refer to the official SetFit library documentation or any specific implementation details provided by the library.

Usage

To use the setfit_trainer_args, you can create a dictionary with the desired values for the arguments. Here's an example:

setfit_trainer_args = {
    "train_dataset": train_data,
    "eval_dataset": eval_data,
    "num_iterations": 20,
    "num_epochs": 1,
    "learning_rate": 2e-5,
    "batch_size": 16,
    "seed": 42,
    "column_mapping": column_map,
    "use_amp": False
}

setfit_from_pretrained_args

The setfit_from_pretrained_args are a simplified version of the official args from the SetFit library and Hugging Face transformers.

Arguments

  • pretrained_model_name_or_path (str or Path): This argument specifies the model to be loaded. It can be either:

    • The model_id (string) of a model hosted on the Hugging Face Model Hub, e.g., bigscience/bloom.
    • A path to a directory containing model weights saved using the save_pretrained method of PreTrainedModel, e.g., ../path/to/my_model_directory/.
  • revision (str, optional): The revision of the model on the Hub. It can be a branch name, a git tag, or any commit id. Defaults to the latest commit on the main branch.

  • force_download (bool, optional): Whether to force (re-)downloading the model weights and configuration files from the Hub, overriding the existing cache. Defaults to False.

  • resume_download (bool, optional): Whether to delete incompletely received files and attempt to resume the download if such a file exists. Defaults to False.

  • proxies (Dict[str, str], optional): A dictionary of proxy servers to use by protocol or endpoint. It is used for requests made during the downloading process. For example: proxies = {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}

  • token (str or bool, optional): The token to use as HTTP bearer authorization for remote files. By default, it uses the token cached when running huggingface-cli login.

  • cache_dir (str or Path, optional): The path to the folder where cached files are stored.

  • local_files_only (bool, optional): If True, it avoids downloading the file and returns the path to the local cached file if it exists. Defaults to False.

  • model_kwargs (Dict, optional): Additional keyword arguments to pass to the model during initialization.

Please note that the above documentation provides an overview of the arguments and their purpose. For more detailed information and usage examples, it is recommended to refer to the official SetFit library documentation or any specific implementation details provided by the library.

Usage

To use the setfit_from_pretrained_args, you can create a dictionary with the desired values for the arguments. Here's an example:

setfit_from_pretrained_args = {
    'pretrained_model_name_or_path': '',  # str or Path
    'revision': None,  # str, optional
    'force_download': False,  # bool, optional
    'resume_download': False,  # bool, optional
    'proxies': None,  # Dict[str, str], optional
    'token': None,  # str or bool, optional
    'cache_dir': None,  # str or Path, optional
    'local_files_only': False,  # bool, optional
    'model_kwargs': None  # Dict, optional
}

Saving and Loading models

Saving and Loading with Pickle

You can use the pickle module in Python to save and load instances of the pre-trained pipeline. pickle allows you to serialize Python objects, including custom classes, into a binary format that can be saved to a file and loaded back into memory later. Here's an example of how to save and load using pickle:

import pickle

nlp = ...

# Save nlp pipeline
with open("my_cool_model.pkl", "wb") as file:
    pickle.dump(nlp, file)

# Load nlp pipeline
with open("my_cool_model.pkl", "rb") as file:
    nlp = pickle.load(file)

doc = nlp("I really need to get a new sofa.")
doc.cats
# {'inlier': 0.902350975129, 'outlier': 0.097649024871}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy-setfit-0.1.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

spacy_setfit-0.1-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file spacy-setfit-0.1.tar.gz.

File metadata

  • Download URL: spacy-setfit-0.1.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.1 CPython/3.10.11 Darwin/22.5.0

File hashes

Hashes for spacy-setfit-0.1.tar.gz
Algorithm Hash digest
SHA256 3c540dc8bf93a35740e410ddbec9f6937e4e09c6d163803a0b699ef74018b6f9
MD5 70deb50b954d44b9c81bb6b0d95bfd1a
BLAKE2b-256 1545fdfa35dcb4741377c3faf085ba68de1644c130a5625c2d516c74c5de0e86

See more details on using hashes here.

File details

Details for the file spacy_setfit-0.1-py3-none-any.whl.

File metadata

  • Download URL: spacy_setfit-0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.1 CPython/3.10.11 Darwin/22.5.0

File hashes

Hashes for spacy_setfit-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 20b074a3a855838f947d6ccf99636a63d2d4d49c1968e62cb1e8220664c07392
MD5 526e8a9663801f90a3c9119e07b7afd0
BLAKE2b-256 49e3b4336c06317d37ab82fe946f8352f6433e8619582b8b664ceb25080c38bd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page