dada-qin/Data-Centric_LLM_Studies

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/dada-qin/Data-Centric_LLM_Studies)

dada-qin / Data-Centric_LLM_Studies

A list of papers about data quality in Large Language Models (LLMs)

☆27

Alternatives and similar repositories for Data-Centric_LLM_Studies

Users that are interested in Data-Centric_LLM_Studies are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

OpenDCAI / DataMind
View on GitHub
All-in-one intelligent assistant powered by LlamaIndex — RAG, GraphRAG, NL2SQL, Skills & Memory with multimodal support.
☆22Jul 8, 2026Updated last week
TemporaryLoRA / FreeLM
View on GitHub
☆15Feb 10, 2026Updated 5 months ago
OpenDCAI / DataFlow-WebUI
View on GitHub
☆26Jul 7, 2026Updated 2 weeks ago
limenlp / safer-instruct
View on GitHub
This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"
☆17Feb 22, 2024Updated 2 years ago
feiyang-k / AutoScale
View on GitHub
Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…
☆14Aug 8, 2025Updated 11 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
hhnqqq / py_hfd
View on GitHub
A python script for downloading huggingface datasets and models.
☆20Apr 10, 2025Updated last year
NathanGodey / qfilters
View on GitHub
Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)
☆34Mar 7, 2025Updated last year
meng-li / sampling-study
View on GitHub
Just a demonstration of some sampling techniques (rejection sampling, importance sampling, sampling importance resampling, Metropolis sam…
☆11Aug 24, 2013Updated 12 years ago
cbenge509 / arxiv-ai-analysis
View on GitHub
A visualization experience of AI/ML academic papers hosted on ArXiV - for project work at the University of California, Berkeley MIDS pro…
☆10Feb 10, 2023Updated 3 years ago
QwenLM / Confident-Decoding
View on GitHub
☆31Jun 30, 2026Updated 3 weeks ago
Lancelot-Xie / MAG-SQL
View on GitHub
MAG-SQL: Multi-Agent Generative Approach with Soft Schema Linking and Iterative Sub-SQL Refinement for Text-to-SQL
☆19Jul 10, 2025Updated last year
tongzhou21 / Oasis
View on GitHub
☆23Aug 7, 2023Updated 2 years ago
XMZhangAI / MetaMind
View on GitHub
[2025 NeurlPS Spotlight] MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems
☆83Aug 19, 2025Updated 11 months ago
sail-sg / SimLayerKV
View on GitHub
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆54Oct 18, 2024Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
Frostlinx / Socratic-Zero
View on GitHub
Socratic-Zero is a fully autonomous framework that generates high-quality training data for mathematical reasoning
☆37Oct 26, 2025Updated 8 months ago
zwt233 / GAMLP
View on GitHub
☆19Mar 21, 2022Updated 4 years ago
BrainToSpeech / BTS_Tutorials
View on GitHub
brain to speech
☆13Mar 17, 2026Updated 4 months ago
JALB-epsilon / Fine-tuning-NOs
View on GitHub
☆15Mar 19, 2024Updated 2 years ago
Aurora-slz / MM-Verify
View on GitHub
☆19Oct 28, 2025Updated 8 months ago
junkangwu / QAE
View on GitHub
[ICLR 2026] Quantile Advantage Estimation for Entropy-Safe Reasoning
☆29Oct 14, 2025Updated 9 months ago
nkmjm / mental_img_recon
View on GitHub
Mental image reconstruction from human brain activity
☆17Jul 1, 2024Updated 2 years ago
gennadylaptev / FM_in_PyTorch
View on GitHub
My implementation of Factorization Machine in PyTorch.
☆18May 27, 2019Updated 7 years ago
OSU-NLP-Group / AutoSDT
View on GitHub
[EMNLP'25] AutoSDT is a fully automatic pipeline to collect data-driven scientific coding tasks to train co-scientist models.
☆21Aug 11, 2025Updated 11 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
hanrach / p2d_fast_solver
View on GitHub
☆17Dec 30, 2023Updated 2 years ago
alessiodevoto / l2compress
View on GitHub
Code for the EMNLP24 paper "A simple and effective L2 norm based method for KV Cache compression."
☆19Dec 13, 2024Updated last year
CERC-AAI / bfm
View on GitHub
Code for "General-Purpose Brain Foundation Models for Time-Series Neuroimaging Data"
☆15Dec 14, 2024Updated last year
935963004 / PhysioOmni
View on GitHub
☆15Oct 19, 2025Updated 9 months ago
CECNL / XBrainLab
View on GitHub
We introduce XBrainLab, an open-source user-friendly software, for accelerated interpretation of neural patterns from EEG data based on c…
☆14Dec 5, 2025Updated 7 months ago
RUC-GSAI / Llama-3-SynE
View on GitHub
Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …
☆40May 31, 2025Updated last year
5a7man / eeg_fConn
View on GitHub
Python library to compute functional connectivity measures from EEG
☆12Oct 14, 2023Updated 2 years ago
11xiaoyi11 / IQA-Survey
View on GitHub
A Survey on Image Quality Assessment: Insights, Analysis, and Future Outlook
☆20Jun 25, 2025Updated last year
InternScience / TrustGeoGen
View on GitHub
Official repository for "TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving"
☆23Sep 1, 2025Updated 10 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Jingyu6 / speculative_prefill
View on GitHub
☆63May 19, 2025Updated last year
jparkerholder / PB2
View on GitHub
Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.
☆20Apr 13, 2021Updated 5 years ago
xmed-lab / ToMo-UDA
View on GitHub
[ICML' 24] Unsupervised Domain Adaptation for Anatomical Structure Detection in Ultrasound Images.
☆11Jul 12, 2024Updated 2 years ago
key1589745 / decouple_predict
View on GitHub
☆14Nov 29, 2022Updated 3 years ago
Hope-Rita / THLM
View on GitHub
Codes for Pretraining Language Models with Text-Attributed Heterogeneous Graphs
☆16Oct 13, 2023Updated 2 years ago
UCSB-NLP-Chang / ThinkPrune
View on GitHub
☆46Sep 27, 2025Updated 9 months ago
AndreHe02 / rewarding-unlikely-release
View on GitHub
☆15Jun 10, 2025Updated last year