Bolin97/awesome-instruction-selector

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Bolin97/awesome-instruction-selector)

Bolin97 / awesome-instruction-selector

Paper list and datasets for the paper: A Survey on Data Selection for LLM Instruction Tuning

☆47

Alternatives and similar repositories for awesome-instruction-selector

Users that are interested in awesome-instruction-selector are comparing it to the libraries listed below

Sorting:

JinLi-i / MoDiCF
View on GitHub
The source code of [WWW 2025] MoDiCF
☆12Jul 12, 2025Updated 7 months ago
Lichang-Chen / AlpaGasus
View on GitHub
A better Alpaca Model Trained with Less Data (only 9k instructions of the original set)
☆24Jul 26, 2024Updated last year
pzs19 / LEMMA
View on GitHub
☆16Sep 4, 2025Updated 6 months ago
LHL3341 / MetaLadder
View on GitHub
MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer (EMNLP 2025)
☆11Apr 18, 2025Updated 10 months ago
CASIA-LM / MoDS
View on GitHub
☆148Apr 16, 2024Updated last year
pldlgb / nuggets
View on GitHub
☆87Dec 29, 2023Updated 2 years ago
rrphys / KernelHerding
View on GitHub
Kernel Herding for probability density estimation
☆14Feb 23, 2016Updated 10 years ago
yuleiqin / fantastic-data-engineering
View on GitHub
Fantastic Data Engineering for Large Language Models
☆93Dec 29, 2024Updated last year
ADaM-BJTU / W2SG
View on GitHub
The code of “Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning”
☆17Feb 26, 2024Updated 2 years ago
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆589Dec 9, 2024Updated last year
sylinrl / CalibratedMath
View on GitHub
Teaching Models to Express Their Uncertainty in Words
☆39May 26, 2022Updated 3 years ago
NLP2CT / kNN-TL
View on GitHub
[ACL 2023] kNN-TL: k-Nearest-Neighbor Transfer Learning for Low-Resource Neural Machine Translation
☆17Jul 27, 2023Updated 2 years ago
CCIIPLab / CET
View on GitHub
This is the source code for: Context-aware Entity Typing in Knowledge Graphs.
☆16May 10, 2022Updated 3 years ago
quanshr / DMoERM
View on GitHub
[ACL2024 Findings]DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling
☆18Jun 6, 2024Updated last year
ncsu-dk-lab / Acc-DD
View on GitHub
☆14Apr 21, 2023Updated 2 years ago
RulinShao / RAG-evaluation-harnesses
View on GitHub
An evaluation suite for Retrieval-Augmented Generation (RAG).
☆23Apr 26, 2025Updated 10 months ago
MIV-XJTU / SPEED
View on GitHub
PyTorch implementation of paper "Sparse Parameterization for Epitomic Dataset Distillation" in NeurIPS 2023.
☆20Jun 28, 2024Updated last year
tianyi-lab / Cherry_LLM
View on GitHub
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…
☆416Jun 25, 2025Updated 8 months ago
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆254Apr 29, 2025Updated 10 months ago
rohinmanvi / Capability-Aware-and-Mid-Generation-Self-Evaluations
View on GitHub
☆21Jul 25, 2025Updated 7 months ago
flageval-baai / HalluDial
View on GitHub
☆21Aug 19, 2024Updated last year
CogComp / faithful_summarization
View on GitHub
☆18May 5, 2021Updated 4 years ago
shuyhere / about-super-alignment
View on GitHub
Feeling confused about super alignment? Here is a reading list
☆44Jan 9, 2024Updated 2 years ago
ZigeW / data_management_LLM
View on GitHub
Collection of training data management explorations for large language models
☆336Aug 2, 2024Updated last year
xiami2019 / UAR
View on GitHub
[Findings of EMNLP'2024] Unified Active Retrieval for Retrieval Augmented Generation
☆23Sep 30, 2024Updated last year
ZNLP / Language-Imbalance-Driven-Rewarding
View on GitHub
[ICLR 2025] Language Imbalance Driven Rewarding for Multilingual Self-improving
☆24Aug 25, 2025Updated 6 months ago
princeton-nlp / LESS
View on GitHub
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
☆512Oct 20, 2024Updated last year
toolCHAINZ / jingle
View on GitHub
SMT Modeling and Configurable Program Analysis for Ghidra's PCODE
☆32Updated this week
Zayne-sprague / To-CoT-or-not-to-CoT
View on GitHub
☆25Apr 10, 2025Updated 10 months ago
zhiming-xu / conad
View on GitHub
Contrastive Attributed Network Anomaly Detection with Data Augmentation (PAKDD'22)
☆29Aug 12, 2022Updated 3 years ago
dair-iitd / tkbi
View on GitHub
☆26May 29, 2022Updated 3 years ago
QizhiPei / MathFusion
View on GitHub
MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion (ACL 2025)
☆35Jul 16, 2025Updated 7 months ago
locuslab / T-MARS
View on GitHub
Code for T-MARS data filtering
☆35Aug 23, 2023Updated 2 years ago
Chen-X666 / privacy-preserving-prompt
View on GitHub
Privacy-Preserving Prompt Tuning for Large Language Model
☆29Mar 19, 2024Updated last year
GX-XinGao / GRA
View on GitHub
The Code and Script of "David's Slingshot: A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis"
☆34Jun 13, 2025Updated 8 months ago
kennethorq / SMORE
View on GitHub
[WSDM 2025] Source code for "Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recommendation".
☆36Dec 22, 2024Updated last year
HReynaud / EchoNet-Synthetic
View on GitHub
MICCAI 2024 code for the paper: EchoNet-Synthetic: Privacy-preserving Video Generation for Safe Medical Data Sharing. EchoNet-Synthetic i…
☆36Jun 16, 2025Updated 8 months ago
mmmistgun / Tipdm_Data_Analysis_II
View on GitHub
第二届“泰迪杯”数据分析职业技能大赛A题
☆10Sep 15, 2020Updated 5 years ago
CHERIoT-Platform / cheriot-sail
View on GitHub
Sail code model of the CHERIoT ISA
☆47Feb 25, 2026Updated last week