Haiyang-W / TokenFormerLinks
[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β587Updated last year
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below
Sorting:
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ453Updated 3 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ797Updated 5 months ago
- [ICLR 2025 Oral] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ950Updated 7 months ago
- H-Net: Hierarchical Network with Dynamic Chunkingβ812Updated 2 months ago
- β307Updated 9 months ago
- β661Updated 10 months ago
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ293Updated 8 months ago
- Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorchβ378Updated last year
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,318Updated last year
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ931Updated last year
- [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β445Updated 2 weeks ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ452Updated 4 months ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ236Updated 3 months ago
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorchβ344Updated 10 months ago
- When it comes to optimizers, it's always better to be safe than sorryβ402Updated 4 months ago
- [NeurIPS 2024] Simple and Effective Masked Diffusion Language Modelβ619Updated 4 months ago
- [ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)β700Updated last year
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β452Updated 8 months ago
- Muon is an optimizer for hidden layers in neural networksβ2,267Updated 3 weeks ago
- Helpful tools and examples for working with flex-attentionβ1,127Updated this week
- Official PyTorch implementation for ICLR2025 paper "Scaling up Masked Diffusion Models on Text"β364Updated last year
- Pretraining and inference code for a large-scale depth-recurrent language modelβ863Updated last month
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ549Updated 8 months ago
- Annotated version of the Mamba paperβ496Updated last year
- β208Updated last year
- [ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modelingβ944Updated 2 months ago
- Normalized Transformer (nGPT)β198Updated last year
- [ECCV 2024] Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer"β436Updated 3 months ago
- [ICLR2025] DiffuGPT and DiffuLLaMA: Scaling Diffusion Language Models via Adaptation from Autoregressive Modelsβ362Updated 8 months ago
- β579Updated 4 months ago