Skip to main content

Build datasets with natural language

Project description


title: Synthetic Data Generator short_description: Build datasets using natural language emoji: 🧬 colorFrom: yellow colorTo: pink sdk: gradio sdk_version: 4.44.1 app_file: app.py pinned: true license: apache-2.0 hf_oauth: true #header: mini hf_oauth_scopes:

  • read-repos
  • write-repos
  • manage-repos
  • inference-api

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

🧬 Synthetic Data Generator

Build datasets using natural language


This repository contains the code for the free Synthetic Data Generator app, which is hosted on the Hugging Face Hub.

How it works?

Synthetic Data Generator

Distilabel Synthetic Data Generator is a tool that allows you to easily create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and advanced language models to generate synthetic data tailored to your specific needs.

This tool simplifies the process of creating custom datasets, enabling you to:

  • Define the characteristics of your desired application
  • Generate system prompts and tasks automatically
  • Create sample datasets for quick iteration
  • Produce full-scale datasets with customizable parameters
  • Push your generated datasets directly to the Hugging Face Hub

By using Distilabel Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.

Do you want to run this locally?

You can simply clone the repository and run it locally with:

pip install -r requirements.txt
python app.py

Note that you do need to have an HF_TOKEN that can make calls to the free serverless Hugging Face Inference Endpoints. You can get one here.

Do you need more control?

Each pipeline is based on a distilabel component, so you can easily run it locally or with other LLMs.

Check out the distilabel library for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic_dataset_generator-0.1.0.tar.gz (23.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file synthetic_dataset_generator-0.1.0.tar.gz.

File metadata

File hashes

Hashes for synthetic_dataset_generator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fe543e14e0419ef182080fc0cecd49d0e8aae507158e7c823364ca0fdd2e178a
MD5 c6a4e1a42386758b4254a6f2b7c2740d
BLAKE2b-256 70b2890a04899b75ea5bd94dd6987ec6d0d6b7eee828bc417d6fa7ec876f7645

See more details on using hashes here.

File details

Details for the file synthetic_dataset_generator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for synthetic_dataset_generator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e5eebba9b044058fa631ca74ac4b64fbbeefe2017a15712ddd81f82a8ef48e35
MD5 2b342592a7fac651f541148bf328a921
BLAKE2b-256 a12b1e6bb9a9662efc8059b96cc6bb48c02900acb712cd8d94a89918d8aea46b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page