raymin0223 / mixture_of_recursionsLinks
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Thinking
☆64Updated last month
Alternatives and similar repositories for mixture_of_recursions
Users that are interested in mixture_of_recursions are comparing it to the libraries listed below
Sorting:
- ☆88Updated last month
- ☆75Updated last week
- ☆51Updated last week
- Remasking Discrete Diffusion Models with Inference-Time Scaling☆34Updated 4 months ago
- [ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization☆76Updated last month
- The official code implementation for paper "R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing"☆39Updated this week
- ☆174Updated 3 weeks ago
- ☆91Updated 2 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆90Updated 3 weeks ago
- [NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623☆86Updated 9 months ago
- ☆86Updated last month
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆116Updated 2 weeks ago
- Triton implement of bi-directional (non-causal) linear attention☆52Updated 5 months ago
- [ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)☆99Updated last week
- Official Implementation of LaViDa: :A Large Diffusion Language Model for Multimodal Understanding☆115Updated last month
- [ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…☆25Updated 6 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆206Updated 6 months ago
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.☆83Updated 7 months ago
- Dimple, the first Discrete Diffusion Multimodal Large Language Model☆78Updated last week
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆223Updated 2 months ago
- ✈️ [ICCV 2025] Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints☆71Updated last week
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆76Updated 7 months ago
- Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"☆296Updated 2 weeks ago
- A collection of papers on discrete diffusion models☆152Updated 2 weeks ago
- The official code of "VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning"☆128Updated last month
- Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cache…☆128Updated this week
- Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation☆59Updated last week
- Matryoshka Multimodal Models☆111Updated 5 months ago
- Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]☆73Updated 3 weeks ago
- Large Language Diffusion with Ordered Unmasking☆38Updated last month