Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
β79Nov 14, 2024Updated last year
Alternatives and similar repositories for MATES
Users that are interested in MATES are comparing it to the libraries listed below
Sorting:
- [ICML 2024] Selecting High-Quality Data for Training Language Modelsβ201Dec 8, 2025Updated 2 months ago
- Official Code Repository for [AutoScaleπ: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*β¦β13Aug 8, 2025Updated 6 months ago
- β51Jan 24, 2024Updated 2 years ago
- The repository contains code for Adaptive Data Optimizationβ32Dec 9, 2024Updated last year
- Few-shot Learning with Auxiliary Dataβ31Dec 8, 2023Updated 2 years ago
- A Survey on Data Selection for Language Modelsβ253Apr 29, 2025Updated 10 months ago
- [ICLR 2025] 𧬠RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)β184Feb 17, 2025Updated last year
- An implementation of online data mixing for the Pile dataset, based on the GPT-NeoX library.β13Jan 9, 2024Updated 2 years ago
- β43Oct 13, 2023Updated 2 years ago
- Aioli: A unified optimization framework for language model data mixingβ32Jan 17, 2025Updated last year
- AI Logging for Interpretability and Explainabilityπ¬β140Jun 7, 2024Updated last year
- Implementation of Gradient Information Optimization (GIO) for effective and scalable training data selectionβ14Jun 22, 2023Updated 2 years ago
- RapidIn: Scalable Influence Estimation for Large Language Models (LLMs). The implementation for paper "Token-wise Influential Training Daβ¦β21May 4, 2025Updated 9 months ago
- Skill-It! A Data-Driven Skills Framework for Understanding and Training Language Modelsβ48Oct 31, 2023Updated 2 years ago
- [ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuningβ512Oct 20, 2024Updated last year
- DSIR large-scale data selection framework for language model trainingβ270Apr 7, 2024Updated last year
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.β459Apr 18, 2024Updated last year
- This is the official implementation of ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweightingβ24Jul 30, 2024Updated last year
- `dattri` is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms.β113Updated this week
- This is the official implementation for our ACL 2024 paper: "Causal Estimation of Memorisation Profiles".β24Mar 25, 2025Updated 11 months ago
- Debiasing Through Data Attributionβ12May 23, 2024Updated last year
- Code for paper: βWhat Data Benefits My Classifier?β Enhancing Model Performance and Interpretability through Influence-Based Data Selectiβ¦β23May 17, 2024Updated last year
- A fast, effective data attribution method for neural networks in PyTorchβ232Nov 18, 2024Updated last year
- β12Apr 22, 2024Updated last year
- [Findings of ACL-2023] This is the official implementation of On the Difference of BERT-style and CLIP-style Text Encoders.β14Jun 7, 2023Updated 2 years ago
- Code for the paper "Pretrained Models for Multilingual Federated Learning" at NAACL 2022β11Aug 9, 2022Updated 3 years ago
- [NeurIPS 2024 Spotlight] CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning.β14Dec 12, 2024Updated last year
- Provides a minimal implementation to extract FLAN datasets for further processingβ11Feb 1, 2023Updated 3 years ago
- Implementation of VQ-VAE with a GPT-style sampler in the JAX and Haiku ecosystem.β12Nov 23, 2023Updated 2 years ago
- This is the official implementation of TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Dataβ13Jul 21, 2024Updated last year
- "μμ°μ΄μ²λ¦¬ μκ³ λ¦¬μ¦μ νμ©ν λλ¦°νμ΅μ κ΅μ‘ 컨ν μΈ μ μ" νλ‘μ νΈ "μ μκΈΈ" νμ λλ€. λ°μ΄ν° μμ§(ν¬λ‘€λ§)/EDA/Preprocessing, μ¬μ΄λ§ μμ±μμ½ AI λͺ¨λΈλ§(NLP - KoBERT, KoBART), νλ‘ν νμ μ μμ μ§ννμ΅λλ€β¦β13Mar 24, 2022Updated 3 years ago
- The LM Contamination Index is a manually created database of contamination evidences for LMs.β82Apr 11, 2024Updated last year
- Data Valuation without Training of a Model, submitted to ICLR'23β22Dec 30, 2022Updated 3 years ago
- [NeurIPS'24] Official PyTorch implementation for paper "Knowledge Composition using Task Vectors with Learned Anisotropic Scaling"β27Feb 24, 2025Updated last year
- Code for paper "Merging Multi-Task Models via Weight-Ensembling Mixture of Experts"β30Jun 7, 2024Updated last year
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)β79Oct 3, 2024Updated last year
- β109Jul 15, 2025Updated 7 months ago
- Influence Analysis and Estimation - Survey, Papers, and Taxonomyβ87Feb 27, 2024Updated 2 years ago
- Code for "Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model", EMNLP Findings 20β¦β28Nov 2, 2023Updated 2 years ago