A live reading list for LLM data synthesis (Updated to July, 2025).
β464Aug 26, 2025Updated 7 months ago
Alternatives and similar repositories for LLM-Synthetic-Data
Users that are interested in LLM-Synthetic-Data are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A reading list on LLM based Synthetic Data Generation π₯β1,531Jun 5, 2025Updated 9 months ago
- Our code for ICLR'25 paper "DataMan: Data Manager for Pre-training Large Language Models".β121Feb 7, 2026Updated last month
- Our code for ICCV'23 paper "CAME: Contrastive Automated Model Evaluation".β27Jan 25, 2024Updated 2 years ago
- Our code for EMNLP'22 Oral paper "Distill the Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation".β30Jan 25, 2024Updated 2 years ago
- Our code for ICLR'24 paper "Energy-based Automated Model Evaluation".β23Feb 13, 2025Updated last year
- NordVPN Threat Protection Proβ’ β’ AdTake your cybersecurity to the next level. Block phishing, malware, trackers, and ads. Lightweight app that works with all browsers.
- β648Jul 29, 2025Updated 7 months ago
- β17Jul 6, 2023Updated 2 years ago
- An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)β9,231Updated this week
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data β¦β836Mar 17, 2025Updated last year
- Awesome-RAG: Collect typical RAG papers and systems.β456Aug 4, 2025Updated 7 months ago
- β762Dec 23, 2025Updated 3 months ago
- Code implementation of synthetic continued pretrainingβ157Jan 6, 2025Updated last year
- β54Nov 14, 2024Updated last year
- Reproducing R1 for Code with Reliable Rewardsβ300May 5, 2025Updated 10 months ago
- Wordpress hosting with auto-scaling on Cloudways β’ AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- Implementation of EMNLP 2023 Findings: Improving Question Generation with Multi-level Content Planningβ18Nov 30, 2023Updated 2 years ago
- Simple RL training for reasoningβ3,842Dec 23, 2025Updated 3 months ago
- β52Jun 14, 2024Updated last year
- A Survey on Data Selection for Language Modelsβ255Apr 29, 2025Updated 10 months ago
- A Comprehensive Benchmark of Imbalanced Graph Learning (Accepted by ICLR 2025 Spotlight)β12Apr 17, 2025Updated 11 months ago
- verl: Volcano Engine Reinforcement Learning for LLMsβ20,097Updated this week
- A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 π and reasoning techniques.β6,903Dec 17, 2025Updated 3 months ago
- This is the official implementation of our ACL 2025 Main paper "Balancing Diversity and Risk in LLM Sampling".β17Oct 16, 2025Updated 5 months ago
- Collection of training data management explorations for large language modelsβ337Aug 2, 2024Updated last year
- NordVPN Special Discount Offer β’ AdSave on top-rated NordVPN 1 or 2-year plans with secure browsing, privacy protection, and support for for all major platforms.
- DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. β π€π€β1,101Feb 2, 2025Updated last year
- The RedStone repository includes code for preparing extensive datasets used in training large language models.β161Jan 22, 2026Updated 2 months ago
- [ICLR'25] DataGen: Unified Synthetic Dataset Generation via Large Language Modelsβ66Mar 8, 2025Updated last year
- Retrieval and Retrieval-augmented LLMsβ11,443Mar 10, 2026Updated 2 weeks ago
- [NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other moβ¦β416Jun 25, 2025Updated 9 months ago
- A Survey of Reinforcement Learning for Large Reasoning Modelsβ2,393Nov 9, 2025Updated 4 months ago
- A curated list of retrieval-augmented generation (RAG) in large language modelsβ378Dec 1, 2025Updated 3 months ago
- [ACL 2025] An official pytorch implement of the paper: Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinementβ39May 28, 2025Updated 9 months ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluationsβ146Nov 13, 2025Updated 4 months ago
- GPU virtual machines on DigitalOcean Gradient AI β’ AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scaleβ267Jul 8, 2025Updated 8 months ago
- β1,118Jan 10, 2026Updated 2 months ago
- Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)β69,106Updated this week
- Reproduce R1 Zero on Logic Puzzleβ2,442Mar 20, 2025Updated last year
- Reproducing R1 for Code with Reliable Rewardsβ12Apr 9, 2025Updated 11 months ago
- This is the official implementation of ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweightingβ24Jul 30, 2024Updated last year
- From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 πβ3,575May 7, 2025Updated 10 months ago