ash-neupane / multi-token-predLinks
Train toy models using multi-token prediction objective
☆13Updated last year
Alternatives and similar repositories for multi-token-pred
Users that are interested in multi-token-pred are comparing it to the libraries listed below
Sorting:
- official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and…☆66Updated 7 months ago
 - ☆47Updated 6 months ago
 - ☆53Updated 8 months ago
 - ☆64Updated 4 months ago
 - Reinforcing General Reasoning without Verifiers☆91Updated 4 months ago
 - A Sober Look at Language Model Reasoning☆87Updated 3 weeks ago
 - [ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)☆132Updated 3 months ago
 - The code for creating the iGSM datasets in papers "Physics of Language Models Part 2.1, Grade-School Math and the Hidden Reasoning Proces…☆79Updated 9 months ago
 - Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"☆103Updated 3 weeks ago
 - ☆103Updated last year
 - Reproduction of "RLCD Reinforcement Learning from Contrast Distillation for Language Model Alignment☆69Updated 2 years ago
 - [EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance sampling☆87Updated 2 years ago
 - ☆34Updated last year
 - [ICLR 2025] MiniPLM: Knowledge Distillation for Pre-Training Language Models☆64Updated 11 months ago
 - M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models☆44Updated 3 months ago
 - ☆75Updated 11 months ago
 - Official implementation of Bootstrapping Language Models via DPO Implicit Rewards☆44Updated 6 months ago
 - [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Models☆55Updated 8 months ago
 - [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆55Updated 2 years ago
 - ☆95Updated 8 months ago
 - Kinetics: Rethinking Test-Time Scaling Laws☆81Updated 3 months ago
 - ☆104Updated last month
 - ☆28Updated 4 months ago
 - ☆129Updated 7 months ago
 - Reference implementation for Token-level Direct Preference Optimization(TDPO)☆148Updated 8 months ago
 - [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.☆97Updated 10 months ago
 - Exploration of automated dataset selection approaches at large scales.☆48Updated 8 months ago
 - A curated list of awesome resources dedicated to Scaling Laws for LLMs☆79Updated 2 years ago
 - exploring whether LLMs perform case-based or rule-based reasoning☆29Updated last year
 - The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":☆40Updated last year