The RedStone repository includes code for preparing extensive datasets used in training large language models.
☆161Apr 21, 2026Updated 2 weeks ago
Alternatives and similar repositories for RedStone
Users that are interested in RedStone are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Heuristic filtering framework for RefineCode☆84Mar 13, 2025Updated last year
- ☆13Aug 20, 2021Updated 4 years ago
- ☆228Oct 27, 2025Updated 6 months ago
- ☆171May 2, 2024Updated 2 years ago
- DataComp for Language Models☆1,439Sep 9, 2025Updated 7 months ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- ☆63Jun 12, 2025Updated 10 months ago
- WanJuan3.0(“万卷·丝路”)一个作为综合性的纯文本语料库,采集了多个国家地区的网络公开信息、文献、专利等资料,数据总规模超1.2TB,Token总数超过300B,处于国际领先水平,首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成,每个子集的数据…☆46Feb 13, 2025Updated last year
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Oct 11, 2024Updated last year
- ☆110Jul 15, 2025Updated 9 months ago
- My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated☆35Aug 14, 2024Updated last year
- ☆567Nov 20, 2024Updated last year
- Ongoing research project for code&math LLMs☆31Jul 4, 2025Updated 10 months ago
- Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …☆40May 31, 2025Updated 11 months ago
- PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing☆21Mar 18, 2025Updated last year
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆497Mar 19, 2024Updated 2 years ago
- ☆101Feb 11, 2026Updated 2 months ago
- Official Repo for Open-Reasoner-Zero☆2,093Jun 2, 2025Updated 11 months ago
- Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information☆38Dec 2, 2024Updated last year
- LongAttn :Selecting Long-context Training Data via Token-level Attention☆15Jul 16, 2025Updated 9 months ago
- Code and data for paper "Context-faithful Prompting for Large Language Models".☆42Mar 23, 2023Updated 3 years ago
- triton ver of gqa flash attn, based on the tutorial☆12Aug 4, 2024Updated last year
- ☆52May 19, 2025Updated 11 months ago
- EMNLP 2025 | TongSearch-QR☆44Dec 4, 2025Updated 5 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- ☆64Apr 9, 2024Updated 2 years ago
- ☆52May 11, 2025Updated 11 months ago
- A robust web archive analytics toolkit☆137Apr 28, 2026Updated last week
- ☆43Nov 1, 2024Updated last year
- Advancing LLM with Diverse Coding Capabilities☆79Jul 25, 2024Updated last year
- Muon is Scalable for LLM Training☆1,469Aug 3, 2025Updated 9 months ago
- DeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang☆44Nov 19, 2025Updated 5 months ago
- ☆14May 23, 2022Updated 3 years ago
- Our code for ICLR'25 paper "DataMan: Data Manager for Pre-training Large Language Models".☆123Feb 7, 2026Updated 2 months ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆121Dec 10, 2024Updated last year
- Use the tokenizer in parallel to achieve superior acceleration☆20Mar 21, 2024Updated 2 years ago
- Math24o: 高中奥林匹克数学竞赛测评集 High School Olympiad Mathematics Chinese Benchmark☆11Mar 27, 2025Updated last year
- [COLM 2025] An Open Math Pre-trainng Dataset with 370B Tokens.☆109Apr 4, 2025Updated last year
- [ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)☆190Feb 17, 2025Updated last year
- ☆48Dec 30, 2024Updated last year
- LCA-on-the-line (ICML 2024 Oral)☆14Feb 13, 2025Updated last year