The RedStone repository includes code for preparing extensive datasets used in training large language models.
☆161Jan 22, 2026Updated last month
Alternatives and similar repositories for RedStone
Users that are interested in RedStone are comparing it to the libraries listed below
Sorting:
- Heuristic filtering framework for RefineCode☆83Mar 13, 2025Updated last year
- ☆213Oct 27, 2025Updated 4 months ago
- ☆167May 2, 2024Updated last year
- DataComp for Language Models☆1,426Sep 9, 2025Updated 6 months ago
- ☆63Jun 12, 2025Updated 9 months ago
- PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [Official, CVPR 2025]☆76Feb 11, 2026Updated last month
- WanJuan3.0(“万卷·丝路”)一个作为综合性的纯文本语料库,采集了多个国家地区的网络公开信息、文献、专利等资料,数据总规模超1.2TB,Token总数超过300B,处于国际领先水平,首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成,每个子集的数据…☆43Feb 13, 2025Updated last year
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Oct 11, 2024Updated last year
- ☆109Jul 15, 2025Updated 8 months ago
- My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated☆34Aug 14, 2024Updated last year
- ☆567Nov 20, 2024Updated last year
- Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …☆38May 31, 2025Updated 9 months ago
- PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing☆20Mar 18, 2025Updated last year
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆490Mar 19, 2024Updated 2 years ago
- ☆97Feb 11, 2026Updated last month
- Official Repo for Open-Reasoner-Zero☆2,086Jun 2, 2025Updated 9 months ago
- Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information☆38Dec 2, 2024Updated last year
- LongAttn :Selecting Long-context Training Data via Token-level Attention☆15Jul 16, 2025Updated 8 months ago
- Code and data for paper "Context-faithful Prompting for Large Language Models".☆42Mar 23, 2023Updated 2 years ago
- triton ver of gqa flash attn, based on the tutorial☆12Aug 4, 2024Updated last year
- ☆52May 19, 2025Updated 10 months ago
- EMNLP 2025 | TongSearch-QR☆41Dec 4, 2025Updated 3 months ago
- ☆64Apr 9, 2024Updated last year
- A robust web archive analytics toolkit☆134Oct 15, 2025Updated 5 months ago
- ☆43Nov 1, 2024Updated last year
- Muon is Scalable for LLM Training☆1,446Aug 3, 2025Updated 7 months ago
- Advancing LLM with Diverse Coding Capabilities☆80Jul 25, 2024Updated last year
- DeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang☆44Nov 19, 2025Updated 4 months ago
- ☆15May 23, 2022Updated 3 years ago
- Use the tokenizer in parallel to achieve superior acceleration☆20Mar 21, 2024Updated 2 years ago
- ☆47Dec 30, 2024Updated last year
- Math24o: 高中奥林匹克数学竞赛测评集 High School Olympiad Mathematics Chinese Benchmark☆11Mar 27, 2025Updated 11 months ago
- [ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)☆186Feb 17, 2025Updated last year
- [COLM 2025] An Open Math Pre-trainng Dataset with 370B Tokens.☆110Apr 4, 2025Updated 11 months ago
- LCA-on-the-line (ICML 2024 Oral)☆13Feb 13, 2025Updated last year
- Muon fsdp 2☆55Aug 8, 2025Updated 7 months ago
- ☆133Jun 6, 2025Updated 9 months ago
- A platform to display the carbon neutralization information for researchers, decision-makers, and other participants in the community.☆18Aug 16, 2022Updated 3 years ago
- Web Content Extraction Benchmark☆22Dec 16, 2025Updated 3 months ago