Our code for ICLR'25 paper "DataMan: Data Manager for Pre-training Large Language Models".
☆119Feb 7, 2026Updated last month
Alternatives and similar repositories for DataMan
Users that are interested in DataMan are comparing it to the libraries listed below
Sorting:
- A live reading list for LLM data synthesis (Updated to July, 2025).☆455Aug 26, 2025Updated 6 months ago
- LongAttn :Selecting Long-context Training Data via Token-level Attention☆15Jul 16, 2025Updated 7 months ago
- LCA-on-the-line (ICML 2024 Oral)☆13Feb 13, 2025Updated last year
- Does patch ordering affect context-limited vision transformers?☆17Oct 10, 2025Updated 4 months ago
- Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …☆37May 31, 2025Updated 9 months ago
- [ICML 2024] Selecting High-Quality Data for Training Language Models☆200Dec 8, 2025Updated 2 months ago
- [ACL2025 Findings] Official code for MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Spac…☆28Aug 30, 2025Updated 6 months ago
- [EMNLP 2025] Verification Engineering for RL in Instruction Following☆51Jan 5, 2026Updated 2 months ago
- [AAAI'26, Oral 🌟] Code for "Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Lea…☆43Jul 16, 2025Updated 7 months ago
- [AAAI 2025]Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity☆26Mar 17, 2025Updated 11 months ago
- ☆118May 26, 2025Updated 9 months ago
- Muon fsdp 2☆55Aug 8, 2025Updated 6 months ago
- Code implementation of synthetic continued pretraining☆152Jan 6, 2025Updated last year
- ☆31Feb 9, 2025Updated last year
- Official implementation for DenseMixer: Improving MoE Post-Training with Precise Router Gradient☆66Aug 3, 2025Updated 7 months ago
- ☆36Jul 7, 2025Updated 8 months ago
- Vision-Language Models Toolbox: Your all-in-one solution for multimodal research and experimentation☆12Feb 16, 2025Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆12Nov 14, 2025Updated 3 months ago
- ☆96Nov 6, 2024Updated last year
- The RedStone repository includes code for preparing extensive datasets used in training large language models.☆156Jan 22, 2026Updated last month
- [ACL 2025] An official pytorch implement of the paper: Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement☆39May 28, 2025Updated 9 months ago
- 在监控画质下实现对校园自行车的重识别,包含REID模型识别,向量数据库检索,UI展示☆10Feb 13, 2024Updated 2 years ago
- [ML4H'25] m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models☆48Dec 21, 2025Updated 2 months ago
- This is an attempt to fine tune SOTA Large Language Models so as to generate Verilog (VHDL) programmes, detect syntax, logic and human er…☆17Aug 7, 2025Updated 7 months ago
- ☆23Jun 19, 2025Updated 8 months ago
- ☆22Dec 23, 2025Updated 2 months ago
- 清华大学人工智能导论(龙明盛老师)课程课件,作业以及试题☆14Jun 26, 2023Updated 2 years ago
- ABench is an evolving open-source benchmark suite designed to rigorously evaluate and enhance Large Language Models (LLMs) on complex cro…☆24Sep 29, 2025Updated 5 months ago
- Solutions to Ireland, Rosen exercises in "A Classical Introduction to Modern Number Theory"☆13Nov 7, 2024Updated last year
- Implementation for Variational Information Bottleneck for Effective Low-resource Fine-tuning, ICLR 2021☆43May 10, 2021Updated 4 years ago
- Image Text Segmentation using FAST corner detection and DBSCAN clustering with k-d tree data structure☆14Feb 27, 2019Updated 7 years ago
- [NeurIPS 2023] "Diversified Outlier Exposure for Out-of-Distribution Detection via Informative Extrapolation"☆11Oct 6, 2023Updated 2 years ago
- How to really install tensorflow-gpu from source on a clean instance of Ubuntu☆11Sep 29, 2023Updated 2 years ago
- ☆14Dec 18, 2024Updated last year
- ☆13May 15, 2025Updated 9 months ago
- ☆11Dec 15, 2025Updated 2 months ago
- Search, download Vimeo videos and retrieve metadata in Go.☆11Feb 10, 2022Updated 4 years ago
- The official codes for our paper at COLING 2022: Semantic-Preserving Adversarial Code Comprehension☆12Oct 23, 2022Updated 3 years ago
- Code for EMNLP'24 paper - On Diversified Preferences of Large Language Model Alignment☆16Aug 6, 2024Updated last year