[ICLR 2025] 𧬠RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
β190Feb 17, 2025Updated last year
Alternatives and similar repositories for regmix
Users that are interested in regmix are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [NeurIPS-2024] π Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623β118Sep 26, 2024Updated last year
- β110Jul 15, 2025Updated 9 months ago
- Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]β79Nov 14, 2024Updated last year
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]β147Sep 20, 2024Updated last year
- Official Code Repository for [AutoScaleπ: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*β¦β14Aug 8, 2025Updated 8 months ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Schedulingβ42Dec 29, 2025Updated 4 months ago
- β22Dec 18, 2024Updated last year
- β20Apr 16, 2025Updated last year
- The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'β24May 20, 2025Updated 11 months ago
- [ICML 2024] Selecting High-Quality Data for Training Language Modelsβ202Dec 8, 2025Updated 4 months ago
- DataComp for Language Modelsβ1,439Sep 9, 2025Updated 7 months ago
- Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasetsβ352Dec 26, 2023Updated 2 years ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Modelsβ63Apr 10, 2024Updated 2 years ago
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scaleβ269Jul 8, 2025Updated 9 months ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Multi-agent synthetic data generation pipeline capable of generating and validating long horizon terminal/coding tasks for RL trainingβ62Jul 28, 2025Updated 9 months ago
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewardsβ47Apr 15, 2025Updated last year
- Organize the Web: Constructing Domains Enhances Pre-Training Data Curationβ80May 2, 2025Updated last year
- A lightweight script for processing HTML page to markdown format with support for code blocksβ82Apr 14, 2024Updated 2 years ago
- β15Mar 12, 2024Updated 2 years ago
- [ICML 2024] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fastβ119Mar 26, 2024Updated 2 years ago
- The official implement of paper "Does Federated Learning Really Need Backpropagation?"β23Feb 9, 2023Updated 3 years ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)β65Jan 11, 2025Updated last year
- β330Jul 25, 2024Updated last year
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- A Framework for Evaluating AI Agent Safety in Realistic Environmentsβ32Oct 2, 2025Updated 7 months ago
- π± Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMsβ71Mar 21, 2025Updated last year
- Intriguing Properties of Data Attribution on Diffusion Models (ICLR 2024)β39Jan 23, 2024Updated 2 years ago
- Language models scale reliably with over-training and on downstream tasksβ101Apr 2, 2024Updated 2 years ago
- Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]β593Dec 9, 2024Updated last year
- Collection of training data management explorations for large language modelsβ339Aug 2, 2024Updated last year
- Provides a minimal implementation to extract FLAN datasets for further processingβ11Feb 1, 2023Updated 3 years ago
- Explore what LLMs are really leanring over SFTβ28Mar 30, 2024Updated 2 years ago
- [ACL 2024] The official codebase for the paper "Self-Distillation Bridges Distribution Gap in Language Model Fine-tuning".β166Nov 2, 2024Updated last year
- Deploy open-source AI quickly and easily - Special Bonus Offer β’ AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- [EMNLP-2024] βοΈ Sailor: Open Language Models for South-East Asiaβ138Dec 21, 2024Updated last year
- Simple and scalable tools for data-driven pretraining data selection.β29Jun 9, 2025Updated 10 months ago
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.β465Apr 18, 2024Updated 2 years ago
- β12Jun 30, 2024Updated last year
- [NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.β136Mar 21, 2025Updated last year
- π’ Data Toolkit for Sailor Language Modelsβ96Feb 24, 2025Updated last year
- Download, parse, and filter data from Phil Papers. Data-ready for The-Pile.β20Aug 28, 2023Updated 2 years ago