pengr/DataMan

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/pengr/DataMan)

pengr / DataMan

Our code for ICLR'25 paper "DataMan: Data Manager for Pre-training Large Language Models".

☆119

Alternatives and similar repositories for DataMan

Users that are interested in DataMan are comparing it to the libraries listed below

Sorting:

pengr / LLM-Synthetic-Data
View on GitHub
A live reading list for LLM data synthesis (Updated to July, 2025).
☆455Aug 26, 2025Updated 6 months ago
Lyun0912-wu / LongAttn
View on GitHub
LongAttn ：Selecting Long-context Training Data via Token-level Attention
☆15Jul 16, 2025Updated 7 months ago
ElvishElvis / LCA-on-the-line
View on GitHub
LCA-on-the-line (ICML 2024 Oral)
☆13Feb 13, 2025Updated last year
d3tk / REOrder
View on GitHub
Does patch ordering affect context-limited vision transformers?
☆17Oct 10, 2025Updated 4 months ago
RUC-GSAI / Llama-3-SynE
View on GitHub
Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …
☆37May 31, 2025Updated 9 months ago
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆200Dec 8, 2025Updated 2 months ago
yichengchen24 / MIG
View on GitHub
[ACL2025 Findings] Official code for MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Spac…
☆28Aug 30, 2025Updated 6 months ago
THU-KEG / VerIF
View on GitHub
[EMNLP 2025] Verification Engineering for RL in Instruction Following
☆51Jan 5, 2026Updated 2 months ago
S1s-Z / CANOE
View on GitHub
[AAAI'26, Oral 🌟] Code for "Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Lea…
☆43Jul 16, 2025Updated 7 months ago
quanshr / AugCon
View on GitHub
[AAAI 2025]Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity
☆26Mar 17, 2025Updated 11 months ago
gydpku / Data_Synthesis_RL
View on GitHub
☆118May 26, 2025Updated 9 months ago
samsja / muon_fsdp_2
View on GitHub
Muon fsdp 2
☆55Aug 8, 2025Updated 6 months ago
ZitongYang / Synthetic_Continued_Pretraining
View on GitHub
Code implementation of synthetic continued pretraining
☆152Jan 6, 2025Updated last year
xiatingyu / SFT-DataSelection-at-scale
View on GitHub
☆31Feb 9, 2025Updated last year
yaof20 / DenseMixer
View on GitHub
Official implementation for DenseMixer: Improving MoE Post-Training with Precise Router Gradient
☆66Aug 3, 2025Updated 7 months ago
banksy23 / XCoder
View on GitHub
☆36Jul 7, 2025Updated 8 months ago
deepmancer / vlm-toolbox
View on GitHub
Vision-Language Models Toolbox: Your all-in-one solution for multimodal research and experimentation
☆12Feb 16, 2025Updated last year
Adlik / vllm
View on GitHub
A high-throughput and memory-efficient inference and serving engine for LLMs
☆12Nov 14, 2025Updated 3 months ago
QwenLM / Self-Lengthen
View on GitHub
☆96Nov 6, 2024Updated last year
microsoft / RedStone
View on GitHub
The RedStone repository includes code for preparing extensive datasets used in training large language models.
☆156Jan 22, 2026Updated last month
InternLM / Condor
View on GitHub
[ACL 2025] An official pytorch implement of the paper: Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
☆39May 28, 2025Updated 9 months ago
Unakar / Bike-REID
View on GitHub
在监控画质下实现对校园自行车的重识别，包含REID模型识别，向量数据库检索，UI展示
☆10Feb 13, 2024Updated 2 years ago
UCSC-VLAA / m1
View on GitHub
[ML4H'25] m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models
☆48Dec 21, 2025Updated 2 months ago
SohamD34 / VeriGen
View on GitHub
This is an attempt to fine tune SOTA Large Language Models so as to generate Verilog (VHDL) programmes, detect syntax, logic and human er…
☆17Aug 7, 2025Updated 7 months ago
bit-ml / AVH-Align
View on GitHub
☆23Jun 19, 2025Updated 8 months ago
video-reality-test / video-reality-test
View on GitHub
☆22Dec 23, 2025Updated 2 months ago
qsguo / THU-AI-Intro-Assignments-Tests
View on GitHub
清华大学人工智能导论（龙明盛老师）课程课件，作业以及试题
☆14Jun 26, 2023Updated 2 years ago
inclusionAI / ABench
View on GitHub
ABench is an evolving open-source benchmark suite designed to rigorously evaluate and enhance Large Language Models (LLMs) on complex cro…
☆24Sep 29, 2025Updated 5 months ago
RichardGanaye / Ireland-Rosen
View on GitHub
Solutions to Ireland, Rosen exercises in "A Classical Introduction to Modern Number Theory"
☆13Nov 7, 2024Updated last year
rabeehk / vibert
View on GitHub
Implementation for Variational Information Bottleneck for Effective Low-resource Fine-tuning, ICLR 2021
☆43May 10, 2021Updated 4 years ago
JFBarryLi / ITSegmenter
View on GitHub
Image Text Segmentation using FAST corner detection and DBSCAN clustering with k-d tree data structure
☆14Feb 27, 2019Updated 7 years ago
ZFancy / DivOE
View on GitHub
[NeurIPS 2023] "Diversified Outlier Exposure for Out-of-Distribution Detection via Informative Extrapolation"
☆11Oct 6, 2023Updated 2 years ago
rnreich / ubuntu-tensorflow-gpu-all-versions
View on GitHub
How to really install tensorflow-gpu from source on a clean instance of Ubuntu
☆11Sep 29, 2023Updated 2 years ago
MDI-Benchmark / MDI-Benchmark
View on GitHub
☆14Dec 18, 2024Updated last year
G-JWLee / TAMP
View on GitHub
☆13May 15, 2025Updated 9 months ago
lavinal712 / control-lora-v3
View on GitHub
☆11Dec 15, 2025Updated 2 months ago
raitonoberu / vimego
View on GitHub
Search, download Vimeo videos and retrieve metadata in Go.
☆11Feb 10, 2022Updated 4 years ago
EricLee8 / SPACE
View on GitHub
The official codes for our paper at COLING 2022: Semantic-Preserving Adversarial Code Comprehension
☆12Oct 23, 2022Updated 3 years ago
dunzeng / MORE
View on GitHub
Code for EMNLP'24 paper - On Diversified Preferences of Large Language Model Alignment
☆16Aug 6, 2024Updated last year