pengr/LLM-Synthetic-Data

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/pengr/LLM-Synthetic-Data)

pengr / LLM-Synthetic-Data

A live reading list for LLM data synthesis (Updated to July, 2025).

☆491

Alternatives and similar repositories for LLM-Synthetic-Data

Users that are interested in LLM-Synthetic-Data are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

pengr / DataMan
View on GitHub
Our code for ICLR'25 paper "DataMan: Data Manager for Pre-training Large Language Models".
☆130Feb 7, 2026Updated 5 months ago
wasiahmad / Awesome-LLM-Synthetic-Data
View on GitHub
A reading list on LLM based Synthetic Data Generation 🔥
☆1,547Jun 5, 2025Updated last year
pengr / Contrastive_AutoEval
View on GitHub
Our code for ICCV'23 paper "CAME: Contrastive Automated Model Evaluation".
☆28Jan 25, 2024Updated 2 years ago
pengr / IKD-MMT
View on GitHub
Our code for EMNLP'22 Oral paper "Distill the Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation".
☆31Jan 25, 2024Updated 2 years ago
Zhen-Tan-dmml / LLM4Annotation
View on GitHub
☆647Jul 29, 2025Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
mghiasvand1 / Awesome-VLM-Synthetic-Data
View on GitHub
☆39Aug 29, 2025Updated 11 months ago
YijianLiu / hgbi
View on GitHub
☆15Jun 26, 2023Updated 3 years ago
pengr / Energy_AutoEval
View on GitHub
Our code for ICLR'24 paper "Energy-based Automated Model Evaluation".
☆24Feb 13, 2025Updated last year
ZitongYang / Synthetic_Continued_Pretraining
View on GitHub
Code implementation of synthetic continued pretraining
☆162Jan 6, 2025Updated last year
magpie-align / magpie
View on GitHub
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆875Mar 17, 2025Updated last year
pengr / learn-claude-code
View on GitHub
Learn Claude Code — 基于源码的完整技术分析文档集，15章深度解析 Agent Loop、工具系统、权限系统等核心机制
☆27Mar 31, 2026Updated 3 months ago
OpenRLHF / OpenRLHF
View on GitHub
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Asy…
☆9,855Jul 14, 2026Updated 2 weeks ago
InternLM / Condor
View on GitHub
[ACL 2025] An official pytorch implement of the paper: Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
☆40May 28, 2025Updated last year
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆261Apr 29, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
yiming-zh / DPL
View on GitHub
☆17Jul 6, 2023Updated 3 years ago
verl-project / verl
View on GitHub
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
☆22,699Updated this week
OpenDataBox / awesome-data-llm
View on GitHub
Official Repository of "LLM × DATA" Survey Paper
☆807Jun 15, 2026Updated last month
TsinghuaC3I / Awesome-RL-for-LRMs
View on GitHub
A Survey of Reinforcement Learning for Large Reasoning Models
☆2,470Nov 9, 2025Updated 8 months ago
awesome-rag / awesome-rag
View on GitHub
Awesome-RAG: Collect typical RAG papers and systems.
☆462Aug 4, 2025Updated 11 months ago
jaehunjung1 / prismatic-synthesis
View on GitHub
☆28May 27, 2025Updated last year
HowieHwong / DataGen
View on GitHub
[ICLR'25] DataGen: Unified Synthetic Dataset Generation via Large Language Models
☆69Mar 8, 2025Updated last year
gydpku / Data_Synthesis_RL
View on GitHub
☆123May 26, 2025Updated last year
hkust-nlp / simpleRL-reason
View on GitHub
Simple RL training for reasoning
☆3,871Dec 23, 2025Updated 7 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
FlagOpen / Infinity-Instruct
View on GitHub
☆51Jun 14, 2024Updated 2 years ago
hijkzzz / Awesome-LLM-Strawberry
View on GitHub
A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 🍓 and reasoning techniques.
☆6,893Dec 17, 2025Updated 7 months ago
JLZhong23 / awesome-reward-models
View on GitHub
☆170May 28, 2025Updated last year
coree / awesome-rag
View on GitHub
A curated list of retrieval-augmented generation (RAG) in large language models
☆429Dec 1, 2025Updated 7 months ago
huggingface / Math-Verify
View on GitHub
☆1,172Jan 10, 2026Updated 6 months ago
areal-project / AReaL
View on GitHub
The RL Bridge for LLM-based Agent Applications. Made Simple & Flexible.
☆5,612Updated this week
Qihoo360 / Light-R1
View on GitHub
☆765Dec 23, 2025Updated 7 months ago
cxcscmu / MATES
View on GitHub
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆80Nov 14, 2024Updated last year
Unakar / Logic-RL
View on GitHub
Reproduce R1 Zero on Logic Puzzle
☆2,450Mar 20, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
tianyi-lab / Cherry_LLM
View on GitHub
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…
☆417Jun 25, 2025Updated last year
datadreamer-dev / DataDreamer
View on GitHub
DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤
☆1,115Feb 2, 2025Updated last year
huggingface / cosmopedia
View on GitHub
☆573Nov 20, 2024Updated last year
yyDing1 / ScaleQuest
View on GitHub
[ACL 2025] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLM…
☆69Oct 27, 2024Updated last year
leopoldwhite / GraphDancer
View on GitHub
GraphDancer: Training LLMs to Explore and Reason over Graphs via Curriculum Reinforcement Learning
☆20May 25, 2026Updated 2 months ago
hiyouga / LlamaFactory
View on GitHub
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
☆73,582Updated this week
yikee / ScienceMeter
View on GitHub
ScienceMeter: Tracking Scientific Knowledge Updates in Language Models, COLM 2026
☆17Jun 28, 2025Updated last year