Haiyang-W / TokenFormerLinks
[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β584Updated 11 months ago
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below
Sorting:
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ452Updated 3 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ796Updated 5 months ago
- H-Net: Hierarchical Network with Dynamic Chunkingβ810Updated 2 months ago
- [ICLR 2025 Oral] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ949Updated 6 months ago
- β307Updated 9 months ago
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ293Updated 8 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β451Updated 8 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ446Updated 4 months ago
- [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β445Updated last week
- When it comes to optimizers, it's always better to be safe than sorryβ402Updated 4 months ago
- β658Updated 9 months ago
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,315Updated last year
- [ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)β698Updated last year
- Helpful tools and examples for working with flex-attentionβ1,116Updated 2 weeks ago
- Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorchβ378Updated last year
- [NeurIPS 2024] Simple and Effective Masked Diffusion Language Modelβ616Updated 4 months ago
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ930Updated last year
- Normalized Transformer (nGPT)β198Updated last year
- Muon is an optimizer for hidden layers in neural networksβ2,242Updated 2 weeks ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ549Updated 8 months ago
- [ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modelingβ943Updated 2 months ago
- β579Updated 4 months ago
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorchβ343Updated 10 months ago
- Some preliminary explorations of Mamba's context scaling.β218Updated last year
- Annotated version of the Mamba paperβ495Updated last year
- Official PyTorch implementation for ICLR2025 paper "Scaling up Masked Diffusion Models on Text"β364Updated last year
- PyTorch implementation of Infini-Transformer from "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attentionβ¦β294Updated last year
- Pretraining and inference code for a large-scale depth-recurrent language modelβ861Updated last month
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ237Updated 3 months ago
- β206Updated last year