sail-sg / regmixView external linksLinks
[ICLR 2025] 𧬠RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
β185Feb 17, 2025Updated 11 months ago
Alternatives and similar repositories for regmix
Users that are interested in regmix are comparing it to the libraries listed below
Sorting:
- [NeurIPS-2024] π Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623β89Sep 26, 2024Updated last year
- β109Jul 15, 2025Updated 6 months ago
- Official Code Repository for [AutoScaleπ: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*β¦β13Aug 8, 2025Updated 6 months ago
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]β147Sep 20, 2024Updated last year
- Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]β79Nov 14, 2024Updated last year
- β23Dec 18, 2024Updated last year
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Schedulingβ42Dec 29, 2025Updated last month
- [ICML 2024] Selecting High-Quality Data for Training Language Modelsβ201Dec 8, 2025Updated 2 months ago
- DataComp for Language Modelsβ1,416Sep 9, 2025Updated 5 months ago
- A Framework for Evaluating AI Agent Safety in Realistic Environmentsβ30Oct 2, 2025Updated 4 months ago
- β20Apr 16, 2025Updated 9 months ago
- The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'β24May 20, 2025Updated 8 months ago
- β322Jul 25, 2024Updated last year
- Astraios: Parameter-Efficient Instruction Tuning Code Language Modelsβ63Apr 10, 2024Updated last year
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scaleβ266Jul 8, 2025Updated 7 months ago
- Codebase for Instruction Following without Instruction Tuningβ36Sep 24, 2024Updated last year
- The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinismβ30Jul 17, 2024Updated last year
- Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]β588Dec 9, 2024Updated last year
- [ACL 2024] The official codebase for the paper "Self-Distillation Bridges Distribution Gap in Language Model Fine-tuning".β146Nov 2, 2024Updated last year
- [NAACL 2025] Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMsβ24Sep 26, 2024Updated last year
- Multi-agent synthetic data generation pipeline capable of generating and validating long horizon terminal/coding tasks for RL trainingβ51Jul 28, 2025Updated 6 months ago
- [ICML 2024] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fastβ118Mar 26, 2024Updated last year
- Provides a minimal implementation to extract FLAN datasets for further processingβ11Feb 1, 2023Updated 3 years ago
- Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasetsβ350Dec 26, 2023Updated 2 years ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)β65Jan 11, 2025Updated last year
- β56May 28, 2024Updated last year
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.β223Jul 25, 2025Updated 6 months ago
- RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignmentβ16Dec 19, 2024Updated last year
- β12Jun 30, 2024Updated last year
- β15Mar 12, 2024Updated last year
- β78Nov 19, 2024Updated last year
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.β459Apr 18, 2024Updated last year
- The official implement of paper "Does Federated Learning Really Need Backpropagation?"β23Feb 9, 2023Updated 3 years ago
- Language models scale reliably with over-training and on downstream tasksβ99Apr 2, 2024Updated last year
- [NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.β134Mar 21, 2025Updated 10 months ago
- [ICLR'25] Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"β78Nov 25, 2024Updated last year
- A lightweight script for processing HTML page to markdown format with support for code blocksβ82Apr 14, 2024Updated last year
- π± Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMsβ71Mar 21, 2025Updated 10 months ago
- [COLM 2025] An Open Math Pre-trainng Dataset with 370B Tokens.β109Apr 4, 2025Updated 10 months ago