[ICLR 2025] 𧬠RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
β186Feb 17, 2025Updated last year
Alternatives and similar repositories for regmix
Users that are interested in regmix are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [NeurIPS-2024] π Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623β89Sep 26, 2024Updated last year
- β109Jul 15, 2025Updated 8 months ago
- Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]β79Nov 14, 2024Updated last year
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]β147Sep 20, 2024Updated last year
- Official Code Repository for [AutoScaleπ: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*β¦β13Aug 8, 2025Updated 7 months ago
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Schedulingβ42Dec 29, 2025Updated 2 months ago
- β23Dec 18, 2024Updated last year
- β20Apr 16, 2025Updated 11 months ago
- The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'β24May 20, 2025Updated 10 months ago
- [ICML 2024] Selecting High-Quality Data for Training Language Modelsβ201Dec 8, 2025Updated 3 months ago
- DataComp for Language Modelsβ1,427Sep 9, 2025Updated 6 months ago
- Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasetsβ351Dec 26, 2023Updated 2 years ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Modelsβ63Apr 10, 2024Updated last year
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scaleβ267Jul 8, 2025Updated 8 months ago
- GPU virtual machines on DigitalOcean Gradient AI β’ AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Multi-agent synthetic data generation pipeline capable of generating and validating long horizon terminal/coding tasks for RL trainingβ58Jul 28, 2025Updated 7 months ago
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewardsβ47Apr 15, 2025Updated 11 months ago
- Organize the Web: Constructing Domains Enhances Pre-Training Data Curationβ79May 2, 2025Updated 10 months ago
- A lightweight script for processing HTML page to markdown format with support for code blocksβ82Apr 14, 2024Updated last year
- β15Mar 12, 2024Updated 2 years ago
- [ICML 2024] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fastβ118Mar 26, 2024Updated 2 years ago
- The official implement of paper "Does Federated Learning Really Need Backpropagation?"β23Feb 9, 2023Updated 3 years ago
- β327Jul 25, 2024Updated last year
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)β65Jan 11, 2025Updated last year
- NordVPN Special Discount Offer β’ AdSave on top-rated NordVPN 1 or 2-year plans with secure browsing, privacy protection, and support for for all major platforms.
- Intriguing Properties of Data Attribution on Diffusion Models (ICLR 2024)β38Jan 23, 2024Updated 2 years ago
- A Framework for Evaluating AI Agent Safety in Realistic Environmentsβ30Oct 2, 2025Updated 5 months ago
- π± Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMsβ71Mar 21, 2025Updated last year
- Language models scale reliably with over-training and on downstream tasksβ101Apr 2, 2024Updated last year
- Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]β591Dec 9, 2024Updated last year
- Collection of training data management explorations for large language modelsβ337Aug 2, 2024Updated last year
- Provides a minimal implementation to extract FLAN datasets for further processingβ11Feb 1, 2023Updated 3 years ago
- Explore what LLMs are really leanring over SFTβ28Mar 30, 2024Updated last year
- [EMNLP-2024] βοΈ Sailor: Open Language Models for South-East Asiaβ138Dec 21, 2024Updated last year
- Simple, predictable pricing with DigitalOcean hosting β’ AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- [ACL 2024] The official codebase for the paper "Self-Distillation Bridges Distribution Gap in Language Model Fine-tuning".β162Nov 2, 2024Updated last year
- Simple and scalable tools for data-driven pretraining data selection.β29Jun 9, 2025Updated 9 months ago
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.β464Apr 18, 2024Updated last year
- β12Jun 30, 2024Updated last year
- [NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.β135Mar 21, 2025Updated last year
- π’ Data Toolkit for Sailor Language Modelsβ96Feb 24, 2025Updated last year
- Download, parse, and filter data from Phil Papers. Data-ready for The-Pile.β19Aug 28, 2023Updated 2 years ago