Haiyang-W / TokenFormerLinks
[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β581Updated 11 months ago
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below
Sorting:
- β304Updated 8 months ago
- [ICLR 2025 Oral] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ940Updated 6 months ago
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ446Updated 2 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ791Updated 5 months ago
- H-Net: Hierarchical Network with Dynamic Chunkingβ801Updated last month
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ294Updated 7 months ago
- [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β445Updated this week
- Muon is an optimizer for hidden layers in neural networksβ2,179Updated last month
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ421Updated 4 months ago
- β655Updated 9 months ago
- When it comes to optimizers, it's always better to be safe than sorryβ397Updated 3 months ago
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ920Updated last year
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β449Updated 8 months ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ549Updated 8 months ago
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,300Updated last year
- [NeurIPS 2024] Simple and Effective Masked Diffusion Language Modelβ605Updated 3 months ago
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorchβ342Updated 9 months ago
- Helpful tools and examples for working with flex-attentionβ1,108Updated this week
- Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorchβ375Updated last year
- Normalized Transformer (nGPT)β195Updated last year
- β578Updated 3 months ago
- [ICLR2025] DiffuGPT and DiffuLLaMA: Scaling Diffusion Language Models via Adaptation from Autoregressive Modelsβ359Updated 7 months ago
- Dream 7B, a large diffusion language modelβ1,139Updated last month
- Pretraining and inference code for a large-scale depth-recurrent language modelβ859Updated 2 weeks ago
- [ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)β689Updated last year
- [ECCV 2024] Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer"β433Updated 2 months ago
- Annotated version of the Mamba paperβ494Updated last year
- Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (NeurIPS 2025)β530Updated 3 months ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ234Updated 3 months ago
- Official PyTorch implementation for ICLR2025 paper "Scaling up Masked Diffusion Models on Text"β356Updated last year