Model, Code & Data for the EMNLP'23 paper "Making Large Language Models Better Data Creators"
☆138Oct 19, 2023Updated 2 years ago
Alternatives and similar repositories for llm-data-creation
Users that are interested in llm-data-creation are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ICML 2025 Spotlight, PCEvolve: Private Contrastive Evolution for Synthetic Dataset Generation via Few-Shot Private Data and Generative AP…☆14Jun 27, 2025Updated 9 months ago
- Targeted Data Generation with Large Language Models☆19Jun 25, 2024Updated last year
- ☆24May 19, 2024Updated last year
- This repository contains the code for the EMNLP'23 paper "AdaSent: Efficient Domain-Adapted Sentence Embeddings for Few-Shot Classificati…☆16Jun 3, 2024Updated last year
- Code Roberta version of RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder☆10Mar 16, 2023Updated 3 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- source code of (quasi-)Givens Orthogonal Fine Tuning integrated to peft lib☆17Mar 13, 2025Updated last year
- Sales Conversion Optimization MLOps: Boost revenue with AI-powered insights. Features H2O AutoML, ZenML pipelines, Neptune.ai tracking, d…☆21Mar 22, 2025Updated last year
- Your Python AI Coder!☆36May 21, 2025Updated 10 months ago
- Synthetic Data for LLM Fine-Tuning☆122Dec 5, 2023Updated 2 years ago
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆841Mar 17, 2025Updated last year
- Collection of scripts to pretrain T5 in unsupervised text, using PyTorch Lightning. CORD-19 pretraining provided as example.☆32Apr 26, 2021Updated 4 years ago
- ☆20Aug 5, 2024Updated last year
- ☆13Nov 5, 2024Updated last year
- Auto Data is a library designed for quick and effortless creation of datasets tailored for fine-tuning Large Language Models (LLMs).☆106Oct 31, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Slim tool definitions. Auto-compressed responses. Context efficiency on both end. Ships with 7 example servers, reduces context consumpti…☆55Jan 24, 2026Updated 2 months ago
- ☆30Aug 2, 2024Updated last year
- ViSD4SA, a Vietnamese Span Detection for Aspect-based sentiment analysis dataset☆17Oct 4, 2022Updated 3 years ago
- [AAAI 2024] DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning☆15Apr 29, 2024Updated last year
- Reference code base for ML Engineering in Action, Manning Publications Author: Ben Wilson☆21Oct 22, 2023Updated 2 years ago
- Creates an Azure AI Service and deploys the specified models.☆18Aug 22, 2025Updated 7 months ago
- ☆26Jun 17, 2024Updated last year
- [NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.☆155Nov 2, 2023Updated 2 years ago
- For converting LLM datasets from one format into another.☆22Nov 12, 2025Updated 5 months ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- [ACL 2024] Progressive LLaMA with Block Expansion.☆513May 20, 2024Updated last year
- DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤☆1,108Feb 2, 2025Updated last year
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆367Sep 6, 2024Updated last year
- [AAAI 2025]Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity☆30Mar 17, 2025Updated last year
- DOMAINEVAL is an auto-constructed benchmark for multi-domain code generation that consists of 2k+ subjects (i.e., description, reference …☆14Dec 12, 2024Updated last year
- ☆31Mar 23, 2024Updated 2 years ago
- ☆11Dec 15, 2025Updated 3 months ago
- Notus is a collection of fine-tuned LLMs using SFT, DPO, SFT+DPO, and/or any other RLHF techniques, while always keeping a data-first app…☆168Jan 15, 2024Updated 2 years ago
- awesome synthetic (text) datasets☆327Jan 8, 2026Updated 3 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆3,158Apr 6, 2026Updated last week
- Few-shot Learning with Auxiliary Data☆31Dec 8, 2023Updated 2 years ago
- InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning☆285Aug 20, 2023Updated 2 years ago
- (NeurIPS 2024) QuanTA: Efficient High-Rank Fine-Tuning of LLMs with Quantum-Informed Tensor Adaptation☆35Nov 18, 2025Updated 4 months ago
- ☆567Nov 20, 2024Updated last year
- We aim to provide the best references to search, select, and synthesize high-quality and large-quantity data for post-training your LLMs.☆61Oct 3, 2024Updated last year
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆35May 24, 2024Updated last year