ZigeW/data_management_LLM

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ZigeW/data_management_LLM)

ZigeW / data_management_LLM

Collection of training data management explorations for large language models

☆343

Alternatives and similar repositories for data_management_LLM

Users that are interested in data_management_LLM are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

tongzhou21 / Oasis
View on GitHub
☆23Aug 7, 2023Updated 2 years ago
dheeraj7596 / Small2Large
View on GitHub
☆18Feb 20, 2024Updated 2 years ago
tianyi-lab / Cherry_LLM
View on GitHub
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…
☆416Jun 25, 2025Updated last year
OFA-Sys / InsTag
View on GitHub
InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning
☆287Aug 20, 2023Updated 2 years ago
princeton-nlp / LESS
View on GitHub
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
☆532Oct 20, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆260Apr 29, 2025Updated last year
tianyi-lab / Superfiltering
View on GitHub
[ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
☆189Jun 25, 2025Updated last year
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆600Dec 9, 2024Updated last year
2003pro / TAGCOS
View on GitHub
This is the official implementation of TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data
☆13Jul 21, 2024Updated 2 years ago
alycialee / beyond-scale-language-data-diversity
View on GitHub
☆13Updated this week
p-lambda / dsir
View on GitHub
DSIR large-scale data selection framework for language model training
☆275Apr 7, 2024Updated 2 years ago
sail-sg / regmix
View on GitHub
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆194Feb 17, 2025Updated last year
Bolin97 / awesome-instruction-selector
View on GitHub
Paper list and datasets for the paper: A Survey on Data Selection for LLM Instruction Tuning
☆48Jan 22, 2026Updated 6 months ago
datajuicer / data-juicer
View on GitHub
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
☆6,769Updated this week
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
daeveraert / gradient-information-optimization
View on GitHub
Implementation of Gradient Information Optimization (GIO) for effective and scalable training data selection
☆14Jun 22, 2023Updated 3 years ago
heyblackC / BetterMixture-Top1-Solution
View on GitHub
天池算法比赛《BetterMixture - 大模型数据混合挑战赛》的第一名top1解决方案
☆33Jul 7, 2024Updated 2 years ago
CASIA-LM / MoDS
View on GitHub
☆153Apr 16, 2024Updated 2 years ago
fanqiwan / Explore-Instruct
View on GitHub
EMNLP'2023: Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration
☆36Mar 10, 2024Updated 2 years ago
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
YJiangcm / FollowBench
View on GitHub
[ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
☆118Jun 12, 2025Updated last year
microsoft / rho
View on GitHub
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
☆470Apr 18, 2024Updated 2 years ago
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
RUCAIBox / QuantizedEmpirical
View on GitHub
☆15Sep 24, 2023Updated 2 years ago
lmmlzn / Awesome-LLMs-Datasets
View on GitHub
Summarize existing representative LLMs text datasets.
☆1,479Mar 11, 2026Updated 4 months ago
PhoebusSi / Alpaca-CoT
View on GitHub
We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tunin…
☆2,791Dec 12, 2023Updated 2 years ago
cxcscmu / MATES
View on GitHub
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆80Nov 14, 2024Updated last year
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,526Nov 5, 2025Updated 8 months ago
xiatingyu / SFT-DataSelection-at-scale
View on GitHub
☆34Feb 9, 2025Updated last year
QwenLM / AutoIF
View on GitHub
☆336Jul 25, 2024Updated last year
yifanzhang-pro / AutoMathText
View on GitHub
[ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts (https://huggingface.co/papers…
☆92Nov 23, 2025Updated 8 months ago
GAIR-NLP / weak-to-strong-reasoning
View on GitHub
☆59Sep 2, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
daochenzha / data-centric-AI
View on GitHub
A curated, but incomplete, list of data-centric AI resources.
☆1,153Jun 26, 2024Updated 2 years ago
OpenDataBox / awesome-data-llm
View on GitHub
Official Repository of "LLM × DATA" Survey Paper
☆805Jun 15, 2026Updated last month
ChengpengLi1003 / DotaMath
View on GitHub
☆30Dec 27, 2024Updated last year
OpenRLHF / OpenRLHF
View on GitHub
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Asy…
☆9,841Jul 14, 2026Updated last week
sangmichaelxie / doremi
View on GitHub
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆357Dec 26, 2023Updated 2 years ago
Zjh-819 / LLMDataHub
View on GitHub
A quick guide (especially) for trending instruction finetuning datasets
☆3,403Nov 28, 2023Updated 2 years ago
facebookresearch / SemDeDup
View on GitHub
Code for "SemDeDup", a simple method for identifying and removing semantic duplicates from a dataset (data pairs which are semantically s…
☆157Oct 1, 2023Updated 2 years ago