sail-sg/regmix

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/sail-sg/regmix)

sail-sg / regmix

[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)

☆184

Alternatives and similar repositories for regmix

Users that are interested in regmix are comparing it to the libraries listed below

Sorting:

sail-sg / scaling-with-vocab
View on GitHub
[NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623
☆89Sep 26, 2024Updated last year
yegcjs / mixinglaws
View on GitHub
☆109Jul 15, 2025Updated 7 months ago
feiyang-k / AutoScale
View on GitHub
Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…
☆13Aug 8, 2025Updated 6 months ago
hkust-nlp / llm-compression-intelligence
View on GitHub
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆147Sep 20, 2024Updated last year
cxcscmu / MATES
View on GitHub
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆79Nov 14, 2024Updated last year
LLM360 / TxT360
View on GitHub
☆23Dec 18, 2024Updated last year
sail-sg / SkyLadder
View on GitHub
The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling
☆42Dec 29, 2025Updated 2 months ago
sail-sg / dice
View on GitHub
Official implementation of Bootstrapping Language Models via DPO Implicit Rewards
☆47Apr 15, 2025Updated 10 months ago
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆200Dec 8, 2025Updated 2 months ago
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,421Sep 9, 2025Updated 5 months ago
sani903 / OpenAgentSafety
View on GitHub
A Framework for Evaluating AI Agent Safety in Realistic Environments
☆30Oct 2, 2025Updated 5 months ago
sail-sg / ActivePRM
View on GitHub
☆20Apr 16, 2025Updated 10 months ago
real-absolute-AI / Unnatural_Language
View on GitHub
The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'
☆24May 20, 2025Updated 9 months ago
QwenLM / AutoIF
View on GitHub
☆325Jul 25, 2024Updated last year
bigcode-project / astraios
View on GitHub
Astraios: Parameter-Efficient Instruction Tuning Code Language Models
☆63Apr 10, 2024Updated last year
GAIR-NLP / ProX
View on GitHub
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
☆266Jul 8, 2025Updated 7 months ago
john-hewitt / implicit-ins
View on GitHub
Codebase for Instruction Following without Instruction Tuning
☆36Sep 24, 2024Updated last year
Yifan-Song793 / GoodBadGreedy
View on GitHub
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
☆30Jul 17, 2024Updated last year
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆589Dec 9, 2024Updated last year
formll / resolving-scaling-law-discrepancies
View on GitHub
☆20Nov 4, 2025Updated 4 months ago
chenllliang / MMEvalPro
View on GitHub
[NAACL 2025] Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs
☆24Sep 26, 2024Updated last year
sail-sg / Agent-Smith
View on GitHub
[ICML 2024] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
☆118Mar 26, 2024Updated last year
Muennighoff / FLAN
View on GitHub
Provides a minimal implementation to extract FLAN datasets for further processing
☆11Feb 1, 2023Updated 3 years ago
sangmichaelxie / doremi
View on GitHub
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆351Dec 26, 2023Updated 2 years ago
sail-sg / sdft
View on GitHub
[ACL 2024] The official codebase for the paper "Self-Distillation Bridges Distribution Gap in Language Model Fine-tuning".
☆159Nov 2, 2024Updated last year
Danau5tin / tbench-agentic-data-pipeline
View on GitHub
Multi-agent synthetic data generation pipeline capable of generating and validating long horizon terminal/coding tasks for RL training
☆55Jul 28, 2025Updated 7 months ago
sail-sg / I-FSJ
View on GitHub
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Jan 11, 2025Updated last year
CodeEditorBench / CodeEditorBench
View on GitHub
☆56May 28, 2024Updated last year
RUC-GSAI / YuLan-Mini
View on GitHub
A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.
☆224Jul 25, 2025Updated 7 months ago
sail-sg / lm-random-memory-access
View on GitHub
☆15Mar 12, 2024Updated last year
jinzhuoran / RAG-RewardBench
View on GitHub
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
☆16Dec 19, 2024Updated last year
xypan0 / G-DIG
View on GitHub
☆12Jun 30, 2024Updated last year
thu-wyz / inference_scaling
View on GitHub
☆79Nov 19, 2024Updated last year
microsoft / rho
View on GitHub
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
☆459Apr 18, 2024Updated last year
FengHZ / BAFFLE
View on GitHub
The official implement of paper "Does Federated Learning Really Need Backpropagation?"
☆23Feb 9, 2023Updated 3 years ago
mlfoundations / scaling
View on GitHub
Language models scale reliably with over-training and on downstream tasks
☆100Apr 2, 2024Updated last year
sail-sg / CPO
View on GitHub
[NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.
☆134Mar 21, 2025Updated 11 months ago
HKUNLP / STRING
View on GitHub
[ICLR'25] Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"
☆78Nov 25, 2024Updated last year
haonan3 / AnchorContext
View on GitHub
AnchorAttention: Improved attention for LLMs long-context training
☆214Jan 15, 2025Updated last year