Skip to main content

Dataset Viber is your chill repo for data collection, annotation and vibe checks.

Project description

dataset-viber
Dataset Viber

Avoid the hype, check the vibe!

I've cooked up Dataset Viber, a cool set of tools to make your life easier when dealing with data for AI models. Dataset Viber is all about making your data prep journey smooth and fun. It's not for team collaboration or production, nor trying to be all fancy and formal - just a bunch of cool tools to help you collect feedback and do vibe-checks as an AI engineer or lover. Want to see it in action? Just plug it in and start vibing with your data. It's that easy!

  • CollectorInterface: Lazily collect data of model interactions without human annotation.
  • AnnotatorInterface: Walk through your data and annotate it with models in the loop.
  • BulkInterface: Explore your data distribution and annotate in bulk.
  • Embdedder: Efficiently embed data with ONNX-optimized speeds.

Need any tweaks or want to hear more about a specific tool? Just open an issue or give me a shout!

[!NOTE]

  • Data is logged to a local CSV or directly to the Hugging Face Hub.
  • All tools also run in .ipynb notebooks.
  • Models in the loop through fn_model.
  • Input data streamers through fn_next_input.
  • It supports various tasks for text, chat and image modalities.
  • Import and export from the Hugging Face Hub or CSV files.

[!TIP] Examples can be found in src/dataset_viber/examples.

Installation

You can install the package via pip:

pip install dataset-viber

Or install BulkInterface dependencies:

pip install dataset-viber[bulk]

How are we vibing?

CollectorInterface

Built on top of the gr.Interface and gr.ChatInterface to lazily collect data for interactions automatically.

https://github.com/user-attachments/assets/4ddac8a1-62ab-4b3b-9254-f924f5898075

Hub dataset

CollectorInterface
import gradio as gr
from dataset_viber import CollectorInterface

def calculator(num1, operation, num2):
    if operation == "add":
        return num1 + num2
    elif operation == "subtract":
        return num1 - num2
    elif operation == "multiply":
        return num1 * num2
    elif operation == "divide":
        return num1 / num2

inputs = ["number", gr.Radio(["add", "subtract", "multiply", "divide"]), "number"]
outputs = "number"

interface = CollectorInterface(
    fn=calculator,
    inputs=inputs,
    outputs=outputs,
    csv_logger=False, # True if you want to log to a CSV
    dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()
CollectorInterface.from_interface
interface = gr.Interface(
    fn=calculator,
    inputs=inputs,
    outputs=outputs
)
interface = CollectorInterface.from_interface(
   interface=interface,
   csv_logger=False, # True if you want to log to a CSV
   dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()
CollectorInterface.from_pipeline
from transformers import pipeline
from dataset_viber import CollectorInterface

pipeline = pipeline("text-classification", model="mrm8488/bert-tiny-finetuned-sms-spam-detection")
interface = CollectorInterface.from_pipeline(
    pipeline=pipeline,
    csv_logger=False, # True if you want to log to a CSV
    dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()

AnnotatorInterface

Built on top of the CollectorInterface to collect and annotate data and log it to the Hub.

Text

https://github.com/user-attachments/assets/d1abda66-9972-4c60-89d2-7626f5654f15

Hub dataset

text-classification/multi-label-text-classification
from dataset_viber import AnnotatorInterFace

texts = [
    "Anthony Bourdain was an amazing chef!",
    "Anthony Bourdain was a terrible tv persona!"
]
labels = ["positive", "negative"]

interface = AnnotatorInterFace.for_text_classification(
    texts=texts,
    labels=labels,
    multi_label=False, # True if you have multi-label data
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
token-classification
from dataset_viber import AnnotatorInterFace

texts = ["Anthony Bourdain was an amazing chef in New York."]
labels = ["NAME", "LOC"]

interface = AnnotatorInterFace.for_token_classification(
    texts=texts,
    labels=labels,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
extractive-question-answering
from dataset_viber import AnnotatorInterFace

questions = ["Where was Anthony Bourdain located?"]
contexts = ["Anthony Bourdain was an amazing chef in New York."]

interface = AnnotatorInterFace.for_question_answering(
    questions=questions,
    contexts=contexts,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
text-generation/translation/completion
from dataset_viber import AnnotatorInterFace

prompts = ["Tell me something about Anthony Bourdain."]
completions = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]

interface = AnnotatorInterFace.for_text_generation(
    prompts=prompts, # source
    completions=completions, # optional to show initial completion / target
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
text-generation-preference
from dataset_viber import AnnotatorInterFace

prompts = ["Tell me something about Anthony Bourdain."]
completions_a = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]
completions_b = ["Anthony Michael Bourdain was an cool guy that knew how to cook."]

interface = AnnotatorInterFace.for_text_generation_preference(
    prompts=prompts,
    completions_a=completions_a,
    completions_b=completions_b,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

Chat and multi-modal chat

https://github.com/user-attachments/assets/fe7f0139-95a3-40e8-bc03-e37667d4f7a9

Hub dataset

[!TIP] I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done using Hugging Face Datasets. As shown in utils. Additionally GradioChatbot shows how to use the chatbot interface for multi-modal.

chat-classification
from dataset_viber import AnnotatorInterFace

prompts = [
    [
        {
            "role": "user",
            "content": "Tell me something about Anthony Bourdain."
        },
        {
            "role": "assistant",
            "content": "Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."
        }
    ]
]

interface = AnnotatorInterFace.for_chat_classification(
    prompts=prompts,
    labels=["toxic", "non-toxic"],
    multi_label=False, # True if you have multi-label data
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
chat-generation
from dataset_viber import AnnotatorInterFace

prompts = [
    [
        {
            "role": "user",
            "content": "Tell me something about Anthony Bourdain."
        }
    ]
]

completions = [
    "Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]

interface = AnnotatorInterFace.for_chat_generation(
    prompts=prompts,
    completions=completions,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
chat-generation-preference
from dataset_viber import AnnotatorInterFace

prompts = [
    [
        {
            "role": "user",
            "content": "Tell me something about Anthony Bourdain."
        }
    ]
]
completions_a = [
    "Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]
completions_b = [
    "Anthony Michael Bourdain was an cool guy that knew how to cook."
]

interface = AnnotatorInterFace.for_chat_generation_preference(
    prompts=prompts,
    completions_a=completions_a,
    completions_b=completions_b,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

Image and multi-modal

https://github.com/user-attachments/assets/57d89edf-ae40-4942-a20a-bf8443100b66

Hub dataset

[!TIP] I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done using Hugging Face Datasets. As shown in utils.

image-classification/multi-label-image-classification
from dataset_viber import AnnotatorInterFace

images = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
labels = ["anthony-bourdain", "not-anthony-bourdain"]

interface = AnnotatorInterFace.for_image_classification(
    images=images,
    labels=labels,
    multi_label=False, # True if you have multi-label data
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-generation
from dataset_viber import AnnotatorInterFace

prompts = [
    "Anthony Bourdain laughing",
    "David Chang wearing a suit"
]
images = [
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]

interface = AnnotatorInterFace.for_image_generation(
    prompts=prompts,
    completions=images,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)

interface.launch()
image-description
from dataset_viber import AnnotatorInterFace

images = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
descriptions = ["Anthony Bourdain laughing", "David Chang wearing a suit"]

interface = AnnotatorInterFace.for_image_description(
    images=images,
    descriptions=descriptions, # optional to show initial descriptions
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-question-answering/visual-question-answering
from dataset_viber import AnnotatorInterFace

images = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
questions = ["Who is this?", "What is he wearing?"]
answers = ["Anthony Bourdain", "a suit"]

interface = AnnotatorInterFace.for_image_question_answering(
    images=images,
    questions=questions, # optional to show initial questions
    answers=answers, # optional to show initial answers
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-generation-preference
from dataset_viber import AnnotatorInterFace

prompts = [
    "Anthony Bourdain laughing",
    "David Chang wearing a suit"
]

images_a = [
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]

images_b = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]

interface = AnnotatorInterFace.for_image_generation_preference(
    prompts=prompts,
    completions_a=images_a,
    completions_b=images_b,
    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
    fn_next_input=None, # a function that feeds gradio components actively with the next input
    csv_logger=False, # True if you want to log to a CSV
    dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()

BulkInterface

Built on top of the Dash, plotly-express, umap-learn, and Embedder to embed and understand your distribution and annotate your data.

https://github.com/user-attachments/assets/5e96c06d-e37f-45a0-9633-1a8e714d71ed

Hub dataset

text-visualization
from dataset_viber import BulkInterface
from datasets import load_dataset

ds = load_dataset("SetFit/ag_news", split="train[:2000]")

interface: BulkInterface = BulkInterface.for_text_visualization(
    ds.to_pandas()[["text", "label_text"]],
    content_column='text',
    label_column='label_text',
)
interface.launch()
text-classification
from dataset_viber import BulkInterface
from datasets import load_dataset

ds = load_dataset("SetFit/ag_news", split="train[:2000]")
df = ds.to_pandas()[["text", "label_text"]]

interface = BulkInterface.for_text_classification(
    dataframe=df,
    content_column='text',
    label_column='label_text',
    labels=df['label_text'].unique().tolist()
)
interface.launch()
chat-visualization
from dataset_viber.bulk import BulkInterface
from datasets import load_dataset

ds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]

interface = BulkInterface.for_chat_visualization(
    dataframe=df,
    chat_column='chosen',
)
interface.launch()
chat-classification
from dataset_viber.bulk import BulkInterface
from datasets import load_dataset

ds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]

interface = BulkInterface.for_chat_classification(
    dataframe=df,
    chat_column='chosen',
    labels=["math", "science", "history", "question seeking"],
)
interface.launch()

Embedder

Built on top of the onnx and optimum to efficiently embed data.

Embedder
from dataset_viber.embedder import Embedder

embedder = Embedder(model_id="sentence-transformers/all-MiniLM-L6-v2")
embedder.encode(["Anthony Bourdain was an amazing chef in New York."])

Utils

Shuffle inputs in the same order

When working with multiple inputs, you might want to shuffle them in the same order.

def shuffle_lists(*lists):
    if not lists:
        return []

    # Get the length of the first list
    length = len(lists[0])

    # Check if all lists have the same length
    if not all(len(lst) == length for lst in lists):
        raise ValueError("All input lists must have the same length")

    # Create a list of indices and shuffle it
    indices = list(range(length))
    random.shuffle(indices)

    # Reorder each list based on the shuffled indices
    return [
        [lst[i] for i in indices]
        for lst in lists
    ]
Random swap to randomize completions

When working with multiple completions, you might want to swap out the completions at the same index, where each completion index x is swapped with a random completion at the same index. This is useful for preference learning.

def swap_completions(*lists):
    # Assuming all lists are of the same length
    length = len(lists[0])

    # Check if all lists have the same length
    if not all(len(lst) == length for lst in lists):
        raise ValueError("All input lists must have the same length")

    # Convert the input lists (which are tuples) to a list of lists
    lists = [list(lst) for lst in lists]

    # Iterate over each index
    for i in range(length):
        # Get the elements at index i from all lists
        elements = [lst[i] for lst in lists]

        # Randomly shuffle the elements
        random.shuffle(elements)

        # Assign the shuffled elements back to the lists
        for j, lst in enumerate(lists):
            lst[i] = elements[j]

    return lists
Load remote image URLs from Hugging Face Hub

When working with images, you might want to load remote URLs from the Hugging Face Hub.

from datasets import Dataset, Image, load_dataset

dataset = load_dataset(
    "my_hf_org/my_image_dataset"
).cast_column("my_image_column", Image(decode=False))
dataset[0]["my_image_column"]
# {'bytes': None, 'path': 'path_to_image.jpg'}

Contribute and development setup

First, install PDM.

Then, install the environment, this will automatically create a .venv virtual env and install the dev environment.

pdm install

Lastly, run pre-commit for formatting on commit.

pre-commit install

Follow this guide on making first contributions.

References

Logo

Keyboard icons created by srip - Flaticon

Inspirations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_viber-0.2.0rc4.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

dataset_viber-0.2.0rc4-py3-none-any.whl (43.4 kB view details)

Uploaded Python 3

File details

Details for the file dataset_viber-0.2.0rc4.tar.gz.

File metadata

  • Download URL: dataset_viber-0.2.0rc4.tar.gz
  • Upload date:
  • Size: 36.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.18.0 CPython/3.12.4 Darwin/23.5.0

File hashes

Hashes for dataset_viber-0.2.0rc4.tar.gz
Algorithm Hash digest
SHA256 63a659756d3d0d03942e8592aad5173702223bea7accec2dea4e2238e7adde74
MD5 a40f4a34b013a6371151f4d2d3cca7fe
BLAKE2b-256 5ec1969450ae2cc2b6a7aeaf0c6c022d5266ebf9d3011657aa5bc944ec61040f

See more details on using hashes here.

File details

Details for the file dataset_viber-0.2.0rc4-py3-none-any.whl.

File metadata

File hashes

Hashes for dataset_viber-0.2.0rc4-py3-none-any.whl
Algorithm Hash digest
SHA256 5249740677f85e1c9c9b652ff26e569f39af22658517dc5e89f7d0ec666c986a
MD5 50a78c6bfd2e60b217f76f0cb98a0246
BLAKE2b-256 45dd7fccf317068c294a61bc25a4102f18fcc56c3e5c02651bfc55f3c6b16c9d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page