cxcscmu/MATES

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/cxcscmu/MATES)

cxcscmu / MATES

Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]

☆80

Alternatives and similar repositories for MATES

Users that are interested in MATES are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
sail-sg / regmix
View on GitHub
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆194Feb 17, 2025Updated last year
CodeCreator / WebOrganizer
View on GitHub
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
☆83May 2, 2025Updated last year
feiyang-k / AutoScale
View on GitHub
Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…
☆14Aug 8, 2025Updated 11 months ago
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆260Apr 29, 2025Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
alon-albalak / FLAD
View on GitHub
Few-shot Learning with Auxiliary Data
☆31Dec 8, 2023Updated 2 years ago
IBM / ColPret
View on GitHub
Efficient Scaling laws and collaborative pretraining.
☆23Updated this week
yidingjiang / ado
View on GitHub
The repository contains code for Adaptive Data Optimization
☆37Dec 9, 2024Updated last year
huawei-lin / RapidIn
View on GitHub
RapidIn: Scalable Influence Estimation for Large Language Models (LLMs). The implementation for paper "Token-wise Influential Training Da…
☆22Mar 10, 2026Updated 4 months ago
logix-project / logix
View on GitHub
AI Logging for Interpretability and Explainability🔬
☆147Jun 7, 2024Updated 2 years ago
HazyResearch / aioli
View on GitHub
Aioli: A unified optimization framework for language model data mixing
☆33Jan 17, 2025Updated last year
p-lambda / dsir
View on GitHub
DSIR large-scale data selection framework for language model training
☆275Apr 7, 2024Updated 2 years ago
itayle / diverse-demonstrations
View on GitHub
Diverse Demonstrations Improve In-context Compositional Generalization
☆12Jul 7, 2023Updated 3 years ago
microsoft / rho
View on GitHub
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
☆470Apr 18, 2024Updated 2 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
daeveraert / gradient-information-optimization
View on GitHub
Implementation of Gradient Information Optimization (GIO) for effective and scalable training data selection
☆14Jun 22, 2023Updated 3 years ago
alon-albalak / online-data-mixing
View on GitHub
An implementation of online data mixing for the Pile dataset, based on the GPT-NeoX library.
☆14Jan 9, 2024Updated 2 years ago
princeton-nlp / LESS
View on GitHub
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
☆531Oct 20, 2024Updated last year
davidbrandfonbrener / color-filter-olmo
View on GitHub
☆13Dec 12, 2025Updated 7 months ago
MadryLab / D3M
View on GitHub
Debiasing Through Data Attribution
☆13May 23, 2024Updated 2 years ago
2003pro / ScaleBiO
View on GitHub
This is the official implementation of ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting
☆25Jul 30, 2024Updated last year
HazyResearch / skill-it
View on GitHub
Skill-It! A Data-Driven Skills Framework for Understanding and Training Language Models
☆48Oct 31, 2023Updated 2 years ago
MadryLab / trak
View on GitHub
A fast, effective data attribution method for neural networks in PyTorch
☆243Nov 18, 2024Updated last year
ShiZhengyan / InstructionModelling
View on GitHub
[NeurIPS 2024 Main Track] Code for the paper titled "Instruction Tuning With Loss Over Instructions"
☆38May 24, 2024Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
sail-sg / SkyLadder
View on GitHub
The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling
☆43Dec 29, 2025Updated 6 months ago
JingXuTHU / Random-Masking-Finds-Winning-Tickets-for-Parameter-Efficient-Fine-tuning
View on GitHub
☆14May 4, 2024Updated 2 years ago
cxcscmu / General-AgentBench
View on GitHub
Benchmark Test-Time Scaling of General LLM Agents
☆20Apr 14, 2026Updated 3 months ago
CryptoAILab / MergeGuard
View on GitHub
[CCS-LAMPS'24] LLM IP Protection Against Model Merging
☆16Oct 14, 2024Updated last year
Trustworthy-ML-Lab / ThinkEdit
View on GitHub
[EMNLP 25] An effective and interpretable weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study un…
☆19Dec 17, 2025Updated 7 months ago
Olivia-fsm / DoGE
View on GitHub
Codebase for ICML submission "DOGE: Domain Reweighting with Generalization Estimation"
☆21Feb 29, 2024Updated 2 years ago
ByteDance-Seed / DATAMASK
View on GitHub
Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning
☆21Jan 4, 2026Updated 6 months ago
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,454Sep 9, 2025Updated 10 months ago
zijian678 / TDD
View on GitHub
☆14Apr 22, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
2003pro / TAGCOS
View on GitHub
This is the official implementation of TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data
☆13Jul 21, 2024Updated 2 years ago
NEUIR / LISRec
View on GitHub
[KDD '26] This is the code repo for our KDD '26 paper "LISRec: Modeling User Preferences with Learned Item Shortcuts for Sequential Recom…
☆18Jul 2, 2026Updated 2 weeks ago
abertsch72 / long-context-icl
View on GitHub
Data and code for the preprint "In-Context Learning with Long-Context Models: An In-Depth Exploration"
☆44Aug 20, 2024Updated last year
googleinterns / localizing-paragraph-memorization
View on GitHub
☆15Feb 21, 2024Updated 2 years ago
ZaydH / influence_analysis_papers
View on GitHub
Influence Analysis and Estimation - Survey, Papers, and Taxonomy
☆90Feb 27, 2024Updated 2 years ago
JJchy / CG_score
View on GitHub
Data Valuation without Training of a Model, submitted to ICLR'23
☆22Dec 30, 2022Updated 3 years ago
oriyor / turning_tables
View on GitHub
Implementation of the paper: "Turning Tables: Generating Examples from Semi-structured Tables for Endowing Language Models with Reasoning…
☆22Nov 2, 2021Updated 4 years ago