Haiyang-W / TokenFormerLinks
[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β579Updated 10 months ago
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below
Sorting:
- [ICLR 2025 Oral] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ920Updated 5 months ago
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ436Updated last month
- H-Net: Hierarchical Network with Dynamic Chunkingβ797Updated last month
- β303Updated 8 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ790Updated 4 months ago
- Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorchβ374Updated last year
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ294Updated 6 months ago
- [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β435Updated last week
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ232Updated 2 months ago
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ897Updated last year
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ404Updated 3 months ago
- [NeurIPS 2024] Simple and Effective Masked Diffusion Language Modelβ590Updated 2 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β445Updated 7 months ago
- β647Updated 8 months ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ549Updated 7 months ago
- Muon is an optimizer for hidden layers in neural networksβ2,116Updated last month
- When it comes to optimizers, it's always better to be safe than sorryβ397Updated 3 months ago
- [ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)β676Updated last year
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorchβ337Updated 8 months ago
- [ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modelingβ933Updated last month
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,294Updated last year
- Helpful tools and examples for working with flex-attentionβ1,089Updated last week
- β565Updated 3 months ago
- Pretraining and inference code for a large-scale depth-recurrent language modelβ856Updated 2 months ago
- Official PyTorch implementation for ICLR2025 paper "Scaling up Masked Diffusion Models on Text"β351Updated last year
- [ECCV 2024] Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer"β427Updated last month
- Dream 7B, a large diffusion language modelβ1,115Updated last month
- Official Implementation for the paper "d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning"β387Updated this week
- [ICLR2025] DiffuGPT and DiffuLLaMA: Scaling Diffusion Language Models via Adaptation from Autoregressive Modelsβ352Updated 6 months ago
- β205Updated last year