ash-neupane / multi-token-predLinks
Train toy models using multi-token prediction objective
☆14Updated last year
Alternatives and similar repositories for multi-token-pred
Users that are interested in multi-token-pred are comparing it to the libraries listed below
Sorting:
- A Sober Look at Language Model Reasoning☆92Updated 2 months ago
- official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and…☆70Updated 10 months ago
- [ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)☆154Updated 7 months ago
- ☆48Updated 10 months ago
- Representation Surgery for Multi-Task Model Merging. ICML, 2024.☆47Updated last year
- ☆75Updated 7 months ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆56Updated 2 years ago
- ☆33Updated 2 months ago
- [EMNLP 2023, Main Conference] Sparse Low-rank Adaptation of Pre-trained Language Models☆84Updated last year
- PhyX: Does Your Model Have the "Wits" for Physical Reasoning?☆50Updated last month
- ☆78Updated last year
- [ICLR 2025] MiniPLM: Knowledge Distillation for Pre-Training Language Models☆73Updated last year
- Code for "Reasoning to Learn from Latent Thoughts"☆124Updated 10 months ago
- The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"☆41Updated last year
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"☆109Updated 4 months ago
- [NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models☆57Updated 8 months ago
- ☆108Updated last year
- [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Models☆60Updated last year
- [ACL'25] We propose a novel fine-tuning method, Separate Memory and Reasoning, which combines prompt tuning with LoRA.☆84Updated 3 months ago
- LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning☆36Updated last year
- Less is More: Task-aware Layer-wise Distillation for Language Model Compression (ICML2023)☆40Updated 2 years ago
- Optimizing Anytime Reasoning via Budget Relative Policy Optimization☆51Updated 6 months ago
- A family of efficient edge language models in 100M~1B sizes.☆19Updated 11 months ago
- A curated list of resources on Reinforcement Learning with Verifiable Rewards (RLVR) and the reasoning capability boundary of Large Langu…☆85Updated 2 months ago
- [EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance sampling☆87Updated 2 years ago
- Test-time-training on nearest neighbors for large language models☆49Updated last year
- [NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623☆89Updated last year
- Reference implementation for Token-level Direct Preference Optimization(TDPO)☆151Updated 11 months ago
- ☆144Updated 11 months ago
- ☆34Updated 9 months ago