The RedStone repository includes code for preparing extensive datasets used in training large language models.
☆160Mar 26, 2026Updated 2 weeks ago
Alternatives and similar repositories for RedStone
Users that are interested in RedStone are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Heuristic filtering framework for RefineCode☆83Mar 13, 2025Updated last year
- ☆13Aug 20, 2021Updated 4 years ago
- ☆225Oct 27, 2025Updated 5 months ago
- ☆172May 2, 2024Updated last year
- DataComp for Language Models☆1,430Sep 9, 2025Updated 7 months ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- ☆63Jun 12, 2025Updated 9 months ago
- PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [Official, CVPR 2025]☆82Updated this week
- WanJuan3.0(“万卷·丝路”)一个作为综合性的纯文本语料库,采集了多个国家地区的网络公开信息、文献、专利等资料,数据总规模超1.2TB,Token总数超过300B,处于国际领先水平,首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成,每个子集的数据…☆44Feb 13, 2025Updated last year
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Oct 11, 2024Updated last year
- ☆109Jul 15, 2025Updated 8 months ago
- My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated☆34Aug 14, 2024Updated last year
- ☆567Nov 20, 2024Updated last year
- Ongoing research project for code&math LLMs☆29Jul 4, 2025Updated 9 months ago
- Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …☆40May 31, 2025Updated 10 months ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing☆21Mar 18, 2025Updated last year
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆497Mar 19, 2024Updated 2 years ago
- ☆99Feb 11, 2026Updated 2 months ago
- Official Repo for Open-Reasoner-Zero☆2,089Jun 2, 2025Updated 10 months ago
- Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information☆38Dec 2, 2024Updated last year
- LongAttn :Selecting Long-context Training Data via Token-level Attention☆15Jul 16, 2025Updated 8 months ago
- Code and data for paper "Context-faithful Prompting for Large Language Models".☆42Mar 23, 2023Updated 3 years ago
- triton ver of gqa flash attn, based on the tutorial☆12Aug 4, 2024Updated last year
- EMNLP 2025 | TongSearch-QR☆43Dec 4, 2025Updated 4 months ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- ☆52May 19, 2025Updated 10 months ago
- ☆64Apr 9, 2024Updated 2 years ago
- ☆51May 11, 2025Updated 11 months ago
- A robust web archive analytics toolkit☆135Apr 2, 2026Updated last week
- ☆43Nov 1, 2024Updated last year
- Muon is Scalable for LLM Training☆1,453Aug 3, 2025Updated 8 months ago
- Advancing LLM with Diverse Coding Capabilities☆80Jul 25, 2024Updated last year
- DeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang☆44Nov 19, 2025Updated 4 months ago
- ☆15May 23, 2022Updated 3 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Our code for ICLR'25 paper "DataMan: Data Manager for Pre-training Large Language Models".☆122Feb 7, 2026Updated 2 months ago
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆121Dec 10, 2024Updated last year
- ☆47Dec 30, 2024Updated last year
- Use the tokenizer in parallel to achieve superior acceleration☆20Mar 21, 2024Updated 2 years ago
- Math24o: 高中奥林匹克数学竞赛测评集 High School Olympiad Mathematics Chinese Benchmark☆11Mar 27, 2025Updated last year
- [COLM 2025] An Open Math Pre-trainng Dataset with 370B Tokens.☆110Apr 4, 2025Updated last year
- [ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)☆189Feb 17, 2025Updated last year