microsoft/rho

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/microsoft/rho)

microsoft / rho

Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.

☆470

Alternatives and similar repositories for rho

Users that are interested in rho are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

ZubinGou / math-evaluation-harness
View on GitHub
A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨
☆277Apr 26, 2024Updated 2 years ago
yifanzhang-pro / AutoMathText
View on GitHub
[ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts (https://huggingface.co/papers…
☆92Nov 23, 2025Updated 7 months ago
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
hkust-nlp / llm-compression-intelligence
View on GitHub
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆150Sep 20, 2024Updated last year
OpenBMB / Eurus
View on GitHub
☆322Sep 18, 2024Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆599Dec 9, 2024Updated last year
cxcscmu / MATES
View on GitHub
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆80Nov 14, 2024Updated last year
ChengpengLi1003 / DotaMath
View on GitHub
☆30Dec 27, 2024Updated last year
huggingface / cosmopedia
View on GitHub
☆572Nov 20, 2024Updated last year
RUCAIBox / JiuZhang3.0
View on GitHub
The code and data for the paper JiuZhang3.0
☆49May 26, 2024Updated 2 years ago
hkust-nlp / dart-math
View on GitHub
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
☆120Dec 10, 2024Updated last year
microsoft / ToRA
View on GitHub
ToRA is a series of Tool-integrated Reasoning LLM Agents designed to solve challenging mathematical reasoning problems by interacting wit…
☆1,118Feb 22, 2024Updated 2 years ago
Open-Reasoner-Zero / Open-Reasoner-Zero
View on GitHub
Official Repo for Open-Reasoner-Zero
☆2,096Jun 2, 2025Updated last year
ezelikman / quiet-star
View on GitHub
Code for Quiet-STaR
☆739Aug 21, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
yegcjs / mixinglaws
View on GitHub
☆113Jul 15, 2025Updated last year
locuslab / scaling_laws_data_filtering
View on GitHub
☆64Apr 9, 2024Updated 2 years ago
princeton-nlp / LESS
View on GitHub
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
☆531Oct 20, 2024Updated last year
HKUNLP / ChunkLlama
View on GitHub
[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"
☆450Oct 16, 2024Updated last year
hkust-nlp / simpleRL-reason
View on GitHub
Simple RL training for reasoning
☆3,868Dec 23, 2025Updated 6 months ago
allenai / easy-to-hard-generalization
View on GitHub
Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"
☆48Jan 17, 2024Updated 2 years ago
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
allenai / open-instruct
View on GitHub
AllenAI's post-training codebase
☆3,801Updated this week
TencentARC / LLaMA-Pro
View on GitHub
[ACL 2024] Progressive LLaMA with Block Expansion.
☆513May 20, 2024Updated 2 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
TIGER-AI-Lab / MAmmoTH2
View on GitHub
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆146Oct 27, 2024Updated last year
FranxYao / Long-Context-Data-Engineering
View on GitHub
Implementation of paper Data Engineering for Scaling Language Models to 128K Context
☆501Mar 19, 2024Updated 2 years ago
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆260Apr 29, 2025Updated last year
JIA-Lab-research / Step-DPO
View on GitHub
Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
☆398Jan 19, 2025Updated last year
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,526Nov 5, 2025Updated 8 months ago
princeton-nlp / LLM-Shearing
View on GitHub
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
☆640Mar 4, 2024Updated 2 years ago
lm-sys / llm-decontaminator
View on GitHub
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆324Dec 20, 2023Updated 2 years ago
alon-albalak / online-data-mixing
View on GitHub
An implementation of online data mixing for the Pile dataset, based on the GPT-NeoX library.
☆14Jan 9, 2024Updated 2 years ago
tianyi-lab / Superfiltering
View on GitHub
[ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
☆189Jun 25, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
cmnfriend / O-LoRA
View on GitHub
☆209Jul 13, 2024Updated 2 years ago
TIGER-AI-Lab / MAmmoTH
View on GitHub
Code and data for "MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning" [ICLR 2024]
☆383Aug 25, 2024Updated last year
tianyi-lab / Cherry_LLM
View on GitHub
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…
☆416Jun 25, 2025Updated last year
jzhang38 / EasyContext
View on GitHub
Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
☆759Sep 27, 2024Updated last year
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,214Updated this week
sramshetty / mixture-of-depths
View on GitHub
An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆35Jun 7, 2024Updated 2 years ago
fanqiwan / FuseAI
View on GitHub
FuseAI Project
☆600Jan 25, 2025Updated last year