The RedStone repository includes code for preparing extensive datasets used in training large language models.
☆156Jan 22, 2026Updated last month
Alternatives and similar repositories for RedStone
Users that are interested in RedStone are comparing it to the libraries listed below
Sorting:
- Heuristic filtering framework for RefineCode☆82Mar 13, 2025Updated 11 months ago
- PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing☆21Mar 18, 2025Updated 11 months ago
- ☆211Oct 27, 2025Updated 4 months ago
- ☆63Jun 12, 2025Updated 8 months ago
- ☆109Jul 15, 2025Updated 7 months ago
- DataComp for Language Models☆1,419Sep 9, 2025Updated 5 months ago
- Math24o: 高中奥林匹克数学竞赛测评集 High School Olympiad Mathematics Chinese Benchmark☆11Mar 27, 2025Updated 11 months ago
- ☆94Feb 11, 2026Updated 2 weeks ago
- LongAttn :Selecting Long-context Training Data via Token-level Attention☆15Jul 16, 2025Updated 7 months ago
- UnitEval is a benchmarking and evaluation tools for AutoDev Coder.☆13Jan 2, 2024Updated 2 years ago
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Oct 11, 2024Updated last year
- triton ver of gqa flash attn, based on the tutorial☆12Aug 4, 2024Updated last year
- Official Repo for Open-Reasoner-Zero☆2,087Jun 2, 2025Updated 8 months ago
- ☆129Jun 6, 2025Updated 8 months ago
- ☆51May 11, 2025Updated 9 months ago
- LCA-on-the-line (ICML 2024 Oral)☆13Feb 13, 2025Updated last year
- ☆18Jun 14, 2025Updated 8 months ago
- Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information☆38Dec 2, 2024Updated last year
- PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [Official, CVPR 2025]☆70Feb 11, 2026Updated 2 weeks ago
- ☆52May 19, 2025Updated 9 months ago
- ☆46Dec 30, 2024Updated last year
- ☆565Nov 20, 2024Updated last year
- My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated☆33Aug 14, 2024Updated last year
- Muon fsdp 2☆54Aug 8, 2025Updated 6 months ago
- A robust web archive analytics toolkit☆131Oct 15, 2025Updated 4 months ago
- ☆63May 16, 2025Updated 9 months ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆486Mar 19, 2024Updated last year
- ☆167May 2, 2024Updated last year
- ☆12Jan 9, 2024Updated 2 years ago
- [COLM 2025] An Open Math Pre-trainng Dataset with 370B Tokens.☆109Apr 4, 2025Updated 10 months ago
- Use the tokenizer in parallel to achieve superior acceleration☆20Mar 21, 2024Updated last year
- AvatarGo: Plug and Play self-avatars for VR☆21Nov 22, 2022Updated 3 years ago
- Dependency Grammar Induction☆18Feb 11, 2019Updated 7 years ago
- ☆13Aug 20, 2021Updated 4 years ago
- Muon is Scalable for LLM Training☆1,440Aug 3, 2025Updated 6 months ago
- Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …☆37May 31, 2025Updated 9 months ago
- ☆17Mar 5, 2025Updated 11 months ago
- ☆62Jun 17, 2024Updated last year
- Measuring the Signal to Noise Ratio in Language Model Evaluation☆28Aug 19, 2025Updated 6 months ago