ZitongYang/Synthetic_Continued_Pretraining

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ZitongYang/Synthetic_Continued_Pretraining)

ZitongYang / Synthetic_Continued_Pretraining

Code implementation of synthetic continued pretraining

☆162

Alternatives and similar repositories for Synthetic_Continued_Pretraining

Users that are interested in Synthetic_Continued_Pretraining are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

huggingface / cosmopedia
View on GitHub
☆573Nov 20, 2024Updated last year
liziniu / GEM
View on GitHub
Code for Paper (Preserving Diversity in Supervised Fine-tuning of Large Language Models)
☆58May 12, 2025Updated last year
nlp-anonymous-happy / anonymous-KG-guided-NLP
View on GitHub
☆15Oct 26, 2021Updated 4 years ago
zhang-wei-chao / DC-PDD
View on GitHub
This repository presents the original implementation of Pretraining Data Detection for Large Language Models: A Divergence-based Calibrat…
☆23May 21, 2025Updated last year
pengr / LLM-Synthetic-Data
View on GitHub
A live reading list for LLM data synthesis (Updated to July, 2025).
☆491Apr 9, 2026Updated 3 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
yaodongyu / ProjNorm
View on GitHub
Predicting Out-of-Distribution Error with the Projection Norm
☆19Jul 27, 2022Updated 4 years ago
yyDing1 / ScaleQuest
View on GitHub
[ACL 2025] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLM…
☆69Oct 27, 2024Updated last year
yuanyehome / PALT
View on GitHub
This is the source code of our paper PALT in EMNLP2022.
☆12Nov 19, 2022Updated 3 years ago
tonghuikang / automatic-prompt-engineer
View on GitHub
Generates and optimizes Haiku system and user prompts for classification
☆15Oct 27, 2025Updated 9 months ago
wasiahmad / Awesome-LLM-Synthetic-Data
View on GitHub
A reading list on LLM based Synthetic Data Generation 🔥
☆1,547Jun 5, 2025Updated last year
yuhui-zh15 / C3
View on GitHub
Official implementation of "Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data" (ICLR 2024)
☆36Oct 16, 2024Updated last year
HowieHwong / DataGen
View on GitHub
[ICLR'25] DataGen: Unified Synthetic Dataset Generation via Large Language Models
☆69Mar 8, 2025Updated last year
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆600Dec 9, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
MadryLab / datamodels-data
View on GitHub
Data for "Datamodels: Predicting Predictions with Training Data"
☆97May 25, 2023Updated 3 years ago
yaolu-zjut / DDInterpreter
View on GitHub
☆15May 28, 2024Updated 2 years ago
MadryLab / journey-TRAK
View on GitHub
Code for the paper "The Journey, Not the Destination: How Data Guides Diffusion Models"
☆26Dec 12, 2023Updated 2 years ago
meowpass / FollowComplexInstruction
View on GitHub
Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large L…
☆55Jun 24, 2024Updated 2 years ago
mukhal / intrinsic-source-citation
View on GitHub
[COLM '24] Source-Aware Training Enables Knowledge Attribution in Language Models
☆19Apr 1, 2025Updated last year
rtaori / data_feedback
View on GitHub
Code for the paper "Data Feedback Loops: Model-driven Amplification of Dataset Biases"
☆18Sep 9, 2022Updated 3 years ago
LHL3341 / MetaLadder
View on GitHub
MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer (EMNLP 2025)
☆12Apr 18, 2025Updated last year
Leey21 / A-Data-Centric-Study
View on GitHub
☆18Mar 2, 2026Updated 4 months ago
XiangLi1999 / AutoBencher
View on GitHub
☆33Jul 11, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
Victorwz / tod_as_nlg
View on GitHub
Official implementation of SIGIR 2022 Paper "Task-Oriented Dialogue System as Natural Language Generation".
☆14Apr 6, 2022Updated 4 years ago
zhengxxn / UDA-KNN
View on GitHub
☆12Aug 31, 2021Updated 4 years ago
GAIR-NLP / OctoThinker
View on GitHub
Revisiting Mid-training in the Era of Reinforcement Learning Scaling
☆189Jul 23, 2025Updated last year
dual-view-molecule-pretraining / dmp
View on GitHub
☆11Jun 4, 2021Updated 5 years ago
haotiansun14 / BBox-Adapter
View on GitHub
Lightweight Adapting for Black-Box Large Language Models
☆26Feb 15, 2024Updated 2 years ago
OFA-Sys / InsTag
View on GitHub
InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning
☆288Aug 20, 2023Updated 2 years ago
multimodal-art-projection / MAP-NEO
View on GitHub
☆986Feb 7, 2025Updated last year
hrtan / MoSo
View on GitHub
[NeurIPS-2023] The PyTorch Implementation of MoSo. The algorithms are based on our paper: "Data Pruning via Moving-one-Sample-out". MoSo …
☆10May 21, 2026Updated 2 months ago
orrzohar / LOVM
View on GitHub
[NeurIPS 2023] Official Pytorch code for LOVM: Language-Only Vision Model Selection
☆21Feb 3, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
rioyokotalab / swallow-code-math
View on GitHub
Ongoing research project for code&math LLMs
☆32Jul 4, 2025Updated last year
Graph-COM / Neural_Higher-order_Pattern_Prediction
View on GitHub
☆15Mar 4, 2022Updated 4 years ago
feiyang-k / AutoScale
View on GitHub
Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…
☆14Aug 8, 2025Updated 11 months ago
microsoft / rho
View on GitHub
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
☆471Apr 18, 2024Updated 2 years ago
wanglg20 / Pose_Guided_Diffusion
View on GitHub
recurrence of the paper "Consistent View Synthesis with Pose-Guided Diffusion Models"
☆14Aug 2, 2023Updated 2 years ago
huangJC0429 / Mid-GCN
View on GitHub
This is the code of paper: Robust Mid-Pass Filtering Graph Convolutional Networks.(paper accepted by WWW2023)
☆13Feb 17, 2023Updated 3 years ago
vasishtduddu / GraphLeaks
View on GitHub
Code for the paper "Quantifying Privacy Leakage in Graph Embedding" published in MobiQuitous 2020
☆18Nov 11, 2021Updated 4 years ago