Haiyang-W / TokenFormerLinks
[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β587Updated last year
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below
Sorting:
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ452Updated 3 months ago
- [ICLR 2025 Oral] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ950Updated 7 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ797Updated 5 months ago
- β307Updated 9 months ago
- H-Net: Hierarchical Network with Dynamic Chunkingβ810Updated 2 months ago
- β661Updated 9 months ago
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ931Updated last year
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,318Updated last year
- [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β445Updated 2 weeks ago
- Helpful tools and examples for working with flex-attentionβ1,118Updated 3 weeks ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β452Updated 8 months ago
- Muon is an optimizer for hidden layers in neural networksβ2,267Updated 3 weeks ago
- When it comes to optimizers, it's always better to be safe than sorryβ402Updated 4 months ago
- Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorchβ378Updated last year
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ293Updated 8 months ago
- [ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)β700Updated last year
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorchβ344Updated 10 months ago
- Pretraining and inference code for a large-scale depth-recurrent language modelβ863Updated last month
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ452Updated 4 months ago
- [NeurIPS 2024] Simple and Effective Masked Diffusion Language Modelβ619Updated 4 months ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ549Updated 8 months ago
- β579Updated 4 months ago
- Official PyTorch implementation for ICLR2025 paper "Scaling up Masked Diffusion Models on Text"β364Updated last year
- [ICLR2025] DiffuGPT and DiffuLLaMA: Scaling Diffusion Language Models via Adaptation from Autoregressive Modelsβ362Updated 8 months ago
- Official Implementation for the paper "d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning"β402Updated 2 weeks ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ236Updated 3 months ago
- Dream 7B, a large diffusion language modelβ1,164Updated 2 months ago
- [ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modelingβ944Updated 2 months ago
- Annotated version of the Mamba paperβ496Updated last year
- Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAIβ1,322Updated 2 weeks ago