Dataset Viber is your chill repo for data collection, annotation and vibe checks.
Project description
Dataset Viber
Avoid the hype, check the vibe!
I've cooked up Dataset Viber, a cool set of tools to make your life easier when dealing with data for AI models. Dataset Viber is all about making your data prep journey smooth and fun. It's not for team collaboration or production, nor trying to be all fancy and formal - just a bunch of cool tools to help you collect feedback and do vibe-checks as an AI engineer or lover. Want to see it in action? Just plug it in and start vibing with your data. It's that easy!
- CollectorInterface: Lazily collect data of model interactions without human annotation.
- AnnotatorInterface: Walk through your data and annotate it with models in the loop.
- BulkInterface: Explore your data distribution and annotate in bulk.
- Embdedder: Efficiently embed data with ONNX-optimized speeds.
Need any tweaks or want to hear more about a specific tool? Just open an issue or give me a shout!
[!NOTE]
- Data is logged to a local CSV or directly to the Hugging Face Hub.
- All tools also run in
.ipynb
notebooks.- Models in the loop through
fn_model
.- Input data streamers through
fn_next_input
.- It supports various tasks for
text
,chat
andimage
modalities.- Import and export from the Hugging Face Hub or CSV files.
[!TIP] Examples can be found in src/dataset_viber/examples.
Installation
You can install the package via pip:
pip install dataset-viber
Or install BulkInterface
dependencies:
pip install dataset-viber[bulk]
How are we vibing?
CollectorInterface
Built on top of the
gr.Interface
andgr.ChatInterface
to lazily collect data for interactions automatically.
https://github.com/user-attachments/assets/4ddac8a1-62ab-4b3b-9254-f924f5898075
CollectorInterface
import gradio as gr
from dataset_viber import CollectorInterface
def calculator(num1, operation, num2):
if operation == "add":
return num1 + num2
elif operation == "subtract":
return num1 - num2
elif operation == "multiply":
return num1 * num2
elif operation == "divide":
return num1 / num2
inputs = ["number", gr.Radio(["add", "subtract", "multiply", "divide"]), "number"]
outputs = "number"
interface = CollectorInterface(
fn=calculator,
inputs=inputs,
outputs=outputs,
csv_logger=False, # True if you want to log to a CSV
dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()
CollectorInterface.from_interface
interface = gr.Interface(
fn=calculator,
inputs=inputs,
outputs=outputs
)
interface = CollectorInterface.from_interface(
interface=interface,
csv_logger=False, # True if you want to log to a CSV
dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()
CollectorInterface.from_pipeline
from transformers import pipeline
from dataset_viber import CollectorInterface
pipeline = pipeline("text-classification", model="mrm8488/bert-tiny-finetuned-sms-spam-detection")
interface = CollectorInterface.from_pipeline(
pipeline=pipeline,
csv_logger=False, # True if you want to log to a CSV
dataset_name="<my_hf_org>/<my_dataset>"
)
interface.launch()
AnnotatorInterface
Built on top of the
CollectorInterface
to collect and annotate data and log it to the Hub.
Text
https://github.com/user-attachments/assets/d1abda66-9972-4c60-89d2-7626f5654f15
text-classification
/multi-label-text-classification
from dataset_viber import AnnotatorInterFace
texts = [
"Anthony Bourdain was an amazing chef!",
"Anthony Bourdain was a terrible tv persona!"
]
labels = ["positive", "negative"]
interface = AnnotatorInterFace.for_text_classification(
texts=texts,
labels=labels,
multi_label=False, # True if you have multi-label data
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
token-classification
from dataset_viber import AnnotatorInterFace
texts = ["Anthony Bourdain was an amazing chef in New York."]
labels = ["NAME", "LOC"]
interface = AnnotatorInterFace.for_token_classification(
texts=texts,
labels=labels,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
extractive-question-answering
from dataset_viber import AnnotatorInterFace
questions = ["Where was Anthony Bourdain located?"]
contexts = ["Anthony Bourdain was an amazing chef in New York."]
interface = AnnotatorInterFace.for_question_answering(
questions=questions,
contexts=contexts,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
text-generation
/translation
/completion
from dataset_viber import AnnotatorInterFace
prompts = ["Tell me something about Anthony Bourdain."]
completions = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]
interface = AnnotatorInterFace.for_text_generation(
prompts=prompts, # source
completions=completions, # optional to show initial completion / target
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
text-generation-preference
from dataset_viber import AnnotatorInterFace
prompts = ["Tell me something about Anthony Bourdain."]
completions_a = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]
completions_b = ["Anthony Michael Bourdain was an cool guy that knew how to cook."]
interface = AnnotatorInterFace.for_text_generation_preference(
prompts=prompts,
completions_a=completions_a,
completions_b=completions_b,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
Chat and multi-modal chat
https://github.com/user-attachments/assets/fe7f0139-95a3-40e8-bc03-e37667d4f7a9
[!TIP] I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done using Hugging Face Datasets. As shown in utils. Additionally GradioChatbot shows how to use the chatbot interface for multi-modal.
chat-classification
from dataset_viber import AnnotatorInterFace
prompts = [
[
{
"role": "user",
"content": "Tell me something about Anthony Bourdain."
},
{
"role": "assistant",
"content": "Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."
}
]
]
interface = AnnotatorInterFace.for_chat_classification(
prompts=prompts,
labels=["toxic", "non-toxic"],
multi_label=False, # True if you have multi-label data
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
chat-generation
from dataset_viber import AnnotatorInterFace
prompts = [
[
{
"role": "user",
"content": "Tell me something about Anthony Bourdain."
}
]
]
completions = [
"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]
interface = AnnotatorInterFace.for_chat_generation(
prompts=prompts,
completions=completions,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
chat-generation-preference
from dataset_viber import AnnotatorInterFace
prompts = [
[
{
"role": "user",
"content": "Tell me something about Anthony Bourdain."
}
]
]
completions_a = [
"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]
completions_b = [
"Anthony Michael Bourdain was an cool guy that knew how to cook."
]
interface = AnnotatorInterFace.for_chat_generation_preference(
prompts=prompts,
completions_a=completions_a,
completions_b=completions_b,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
Image and multi-modal
https://github.com/user-attachments/assets/57d89edf-ae40-4942-a20a-bf8443100b66
[!TIP] I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done using Hugging Face Datasets. As shown in utils.
image-classification
/multi-label-image-classification
from dataset_viber import AnnotatorInterFace
images = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
labels = ["anthony-bourdain", "not-anthony-bourdain"]
interface = AnnotatorInterFace.for_image_classification(
images=images,
labels=labels,
multi_label=False, # True if you have multi-label data
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-generation
from dataset_viber import AnnotatorInterFace
prompts = [
"Anthony Bourdain laughing",
"David Chang wearing a suit"
]
images = [
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]
interface = AnnotatorInterFace.for_image_generation(
prompts=prompts,
completions=images,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-description
from dataset_viber import AnnotatorInterFace
images = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
descriptions = ["Anthony Bourdain laughing", "David Chang wearing a suit"]
interface = AnnotatorInterFace.for_image_description(
images=images,
descriptions=descriptions, # optional to show initial descriptions
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-question-answering
/visual-question-answering
from dataset_viber import AnnotatorInterFace
images = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
questions = ["Who is this?", "What is he wearing?"]
answers = ["Anthony Bourdain", "a suit"]
interface = AnnotatorInterFace.for_image_question_answering(
images=images,
questions=questions, # optional to show initial questions
answers=answers, # optional to show initial answers
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
image-generation-preference
from dataset_viber import AnnotatorInterFace
prompts = [
"Anthony Bourdain laughing",
"David Chang wearing a suit"
]
images_a = [
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]
images_b = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
interface = AnnotatorInterFace.for_image_generation_preference(
prompts=prompts,
completions_a=images_a,
completions_b=images_b,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "<my_hf_org>/<my_dataset>" if you want to log to the hub
)
interface.launch()
BulkInterface
Built on top of the
Dash
,plotly-express
,umap-learn
, andEmbedder
to embed and understand your distribution and annotate your data.
https://github.com/user-attachments/assets/5e96c06d-e37f-45a0-9633-1a8e714d71ed
text-visualization
from dataset_viber import BulkInterface
from datasets import load_dataset
ds = load_dataset("SetFit/ag_news", split="train[:2000]")
interface: BulkInterface = BulkInterface.for_text_visualization(
ds.to_pandas()[["text", "label_text"]],
content_column='text',
label_column='label_text',
)
interface.launch()
text-classification
from dataset_viber import BulkInterface
from datasets import load_dataset
ds = load_dataset("SetFit/ag_news", split="train[:2000]")
df = ds.to_pandas()[["text", "label_text"]]
interface = BulkInterface.for_text_classification(
dataframe=df,
content_column='text',
label_column='label_text',
labels=df['label_text'].unique().tolist()
)
interface.launch()
chat-visualization
from dataset_viber.bulk import BulkInterface
from datasets import load_dataset
ds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]
interface = BulkInterface.for_chat_visualization(
dataframe=df,
chat_column='chosen',
)
interface.launch()
chat-classification
from dataset_viber.bulk import BulkInterface
from datasets import load_dataset
ds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]
interface = BulkInterface.for_chat_classification(
dataframe=df,
chat_column='chosen',
labels=["math", "science", "history", "question seeking"],
)
interface.launch()
Embedder
Built on top of the
onnx
andoptimum
to efficiently embed data.
Embedder
from dataset_viber.embedder import Embedder
embedder = Embedder(model_id="sentence-transformers/all-MiniLM-L6-v2")
embedder.encode(["Anthony Bourdain was an amazing chef in New York."])
Utils
Shuffle inputs in the same order
When working with multiple inputs, you might want to shuffle them in the same order.
def shuffle_lists(*lists):
if not lists:
return []
# Get the length of the first list
length = len(lists[0])
# Check if all lists have the same length
if not all(len(lst) == length for lst in lists):
raise ValueError("All input lists must have the same length")
# Create a list of indices and shuffle it
indices = list(range(length))
random.shuffle(indices)
# Reorder each list based on the shuffled indices
return [
[lst[i] for i in indices]
for lst in lists
]
Random swap to randomize completions
When working with multiple completions, you might want to swap out the completions at the same index, where each completion index x is swapped with a random completion at the same index. This is useful for preference learning.
def swap_completions(*lists):
# Assuming all lists are of the same length
length = len(lists[0])
# Check if all lists have the same length
if not all(len(lst) == length for lst in lists):
raise ValueError("All input lists must have the same length")
# Convert the input lists (which are tuples) to a list of lists
lists = [list(lst) for lst in lists]
# Iterate over each index
for i in range(length):
# Get the elements at index i from all lists
elements = [lst[i] for lst in lists]
# Randomly shuffle the elements
random.shuffle(elements)
# Assign the shuffled elements back to the lists
for j, lst in enumerate(lists):
lst[i] = elements[j]
return lists
Load remote image URLs from Hugging Face Hub
When working with images, you might want to load remote URLs from the Hugging Face Hub.
from datasets import Dataset, Image, load_dataset
dataset = load_dataset(
"my_hf_org/my_image_dataset"
).cast_column("my_image_column", Image(decode=False))
dataset[0]["my_image_column"]
# {'bytes': None, 'path': 'path_to_image.jpg'}
Contribute and development setup
First, install PDM.
Then, install the environment, this will automatically create a .venv
virtual env and install the dev environment.
pdm install
Lastly, run pre-commit for formatting on commit.
pre-commit install
Follow this guide on making first contributions.
References
Logo
Keyboard icons created by srip - Flaticon
Inspirations
- https://huggingface.co/spaces/davidberenstein1957/llm-human-feedback-collector-chat-interface-dpo
- https://huggingface.co/spaces/davidberenstein1957/llm-human-feedback-collector-chat-interface-kto
- https://medium.com/@oxenai/collecting-data-from-human-feedback-for-generative-ai-ec9e20bf01b9
- https://hamel.dev/notes/llm/finetuning/04_data_cleaning.html
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dataset_viber-0.2.0rc5.tar.gz
.
File metadata
- Download URL: dataset_viber-0.2.0rc5.tar.gz
- Upload date:
- Size: 36.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.18.0 CPython/3.12.4 Darwin/23.5.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9717baf7ffb843ca4b1dc6f56eeb15f72eac658df238556f6008fc0a2cfd342 |
|
MD5 | 2721df25456de832fac5deb508670950 |
|
BLAKE2b-256 | b1938e8f9cbf988b328c0b58d6948c6852a387722dc8bb23932649f5ca96556c |
File details
Details for the file dataset_viber-0.2.0rc5-py3-none-any.whl
.
File metadata
- Download URL: dataset_viber-0.2.0rc5-py3-none-any.whl
- Upload date:
- Size: 43.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.18.0 CPython/3.12.4 Darwin/23.5.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0328a95a8395390f0feee24fe12d1c2e7b9f2f2825a69d3c3c4fb4d296ca4ca1 |
|
MD5 | e964dcdb02d8a1e6f52ed9ad4b05bc6a |
|
BLAKE2b-256 | 69654bb4391af8500d0e53721f0bebfafb656aff91e23a4102bbf53abb5e9178 |