Skip to main content

Text data synthesize and pseudo labelling using LLMs

Project description

🦠 Mutate

A library to synthesize text datasets using Large Language Models (LLM). Mutate reads through the examples in the dataset and generates similar examples using auto generated few shot prompts.

1. Installation

pip install mutate-nlp

or

pip install git+https://github.com/infinitylogesh/mutate

2. Usage

Open In Colab

2.1 Synthesize text data from local csv files

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-125M",
                device=1)

task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"


# returns a python generator
text_synth_gen = pipe("csv",
                    data_files=["local/path/sentiment_classfication.csv"],
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    text_column_alias="Comment",
                    label_column_alias="sentiment",
                    shot_count=5,
                    class_names=["pos","neg"])

#Loop through the generator to synthesize examples by class
for synthesized_examples  in text_synth_gen:
    print(synthesized_examples)
Show Output
{
    "text": ["The story was very dull and was a waste of my time. This was not a film I would ever watch. The acting was bad. I was bored. There were no surprises. They showed one dinosaur,",
    "I did not like this film. It was a slow and boring film, it didn't seem to have any plot, there was nothing to it. The only good part was the ending, I just felt that the film should have ended more abruptly."]
    "label":["neg","neg"]
}

{
    "text":["The Bell witch is one of the most interesting, yet disturbing films of recent years. It’s an odd and unique look at a very real, but very dark issue. With its mixture of horror, fantasy and fantasy adventure, this film is as much a horror film as a fantasy film. And it‘s worth your time. While the movie has its flaws, it is worth watching and if you are a fan of a good fantasy or horror story, you will not be disappointed."],
    "label":["pos"]
}

# and so on .....

2.2 Synthesize text data from 🤗 datasets

Under the hood Mutate uses the wonderful 🤗 datasets library for dataset processing, So it supports 🤗 datasets out of the box.

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-2.7B",
                device=1)

task_desc = "Each item in the following contains customer service queries expressing the mentioned intent"

synthesizerGen = pipe("banking77",
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    # if the `text_column` doesn't have a meaningful value
                    text_column_alias="Queries",
                    label_column_alias="Intent", # if the `label_column` doesn't have a meaningful value
                    shot_count=5,
                    dataset_args=["en"])


for exp in synthesizerGen:
    print(exp)
Show Output
{"text":["How can i know if my account has been activated? (This is the one that I am confused about)",
         "Thanks! My card activated"],
"label":["activate_my_card",
         "activate_my_card"]
}

{
"text": ["How do i activate this new one? Is it possible?",
         "what is the activation process for this card?"],
"label":["activate_my_card",
         "activate_my_card"]
}

# and so on .....

2.3 I am feeling lucky : Infinetly loop through the dataset to generate examples indefinetly

Caution: Infinetly looping through the dataset has a higher chance of duplicate examples to be generated.

from mutate import pipeline

pipe = pipeline("text-classification-synthesis",
                model="EleutherAI/gpt-neo-2.7B",
                device=1)

task_desc = "Each item in the following contains movie reviews and corresponding sentiments. Possible sentimets are neg and pos"


# returns a python generator
text_synth_gen = pipe("csv",
                    data_files=["local/path/sentiment_classfication.csv"],
                    task_desc=task_desc,
                    text_column="text",
                    label_column="label",
                    text_column_alias="Comment",
                    label_column_alias="sentiment",
                    class_names=["pos","neg"],
                    # Flag to generate indefinite examples
                    infinite_loop=True)

#Infinite loop
for exp in synthesizerGen:
    print(exp)

3. Support

3.1 Currently supports

  • Text classification dataset synthesis : Few Shot text data synsthesize for text classification datasets using Causal LLMs ( GPT like )

3.2 Roadmap:

  • Other types of text Dataset synthesis - NER , sentence pairs etc
  • Finetuning support for better quality generation
  • Pseudo labelling

4. Credit

5. References

The Idea of generating examples from Large Language Model is inspired by the works below,

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mutate-nlp-0.1.1.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

mutate_nlp-0.1.1-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file mutate-nlp-0.1.1.tar.gz.

File metadata

  • Download URL: mutate-nlp-0.1.1.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.1 CPython/3.10.9 Darwin/22.2.0

File hashes

Hashes for mutate-nlp-0.1.1.tar.gz
Algorithm Hash digest
SHA256 89b24e38e22d0445908548db7797d91e470cab612e7e2daf618cc29de88b91e5
MD5 67e068b7e3d0cc8f47eef86c5ef97351
BLAKE2b-256 32c25b87de79019f241c4a574e54e62278f9df2b69a8d41baffb4a34a40819a3

See more details on using hashes here.

File details

Details for the file mutate_nlp-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: mutate_nlp-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.1 CPython/3.10.9 Darwin/22.2.0

File hashes

Hashes for mutate_nlp-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c77b6d143c5c3d097474c90c42ed21db0eaf54c00b43d03e4303eaf62c87c0a6
MD5 5857330ac716832fbcf5e6b4b02e0541
BLAKE2b-256 21a9c22f29ba8a602c4e59b0a50c0a46043a1e39d2c25288b6fb58dfa2aa0735

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page