[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β588Feb 11, 2025Updated last year
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below
Sorting:
- Minimal implementation of TokenFormer for inference and learningβ13Nov 6, 2024Updated last year
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performanceβ¦β156Apr 7, 2025Updated 10 months ago
- Code for BLT research paperβ2,029Nov 3, 2025Updated 4 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β372Dec 12, 2024Updated last year
- code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"β1,169Nov 9, 2025Updated 3 months ago
- Helpful tools and examples for working with flex-attentionβ1,140Feb 8, 2026Updated 3 weeks ago
- π Efficient implementations of state-of-the-art linear attention modelsβ4,428Updated this week
- β91Aug 18, 2024Updated last year
- Next-Token Prediction is All You Needβ2,355Jan 12, 2026Updated last month
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's lβ¦β54Jan 12, 2026Updated last month
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruningβ140Feb 25, 2026Updated last week
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β453May 13, 2025Updated 9 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β92Oct 30, 2024Updated last year
- [ECCV2024 Oralπ₯] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"β360Jan 14, 2025Updated last year
- FlexAttention w/ FlashAttention3 Supportβ27Oct 5, 2024Updated last year
- A suite of image and video neural tokenizersβ1,711Feb 11, 2025Updated last year
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projectionβ1,678Oct 28, 2024Updated last year
- The official repo of continuous speculative decodingβ31Mar 28, 2025Updated 11 months ago
- Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"β18Mar 15, 2024Updated last year
- Muon is an optimizer for hidden layers in neural networksβ2,329Jan 19, 2026Updated last month
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β251Jan 31, 2025Updated last year
- Mamba SSM architectureβ17,257Feb 18, 2026Updated 2 weeks ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β248Jun 6, 2025Updated 8 months ago
- Checkpointable dataset utilities for foundation model trainingβ32Jan 29, 2024Updated 2 years ago
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"β562Dec 28, 2024Updated last year
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β163Apr 13, 2025Updated 10 months ago
- [ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Thinkβ1,553Mar 16, 2025Updated 11 months ago
- Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.β2,084Jul 29, 2024Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ133Dec 3, 2024Updated last year
- Code for paper "Patch-Level Training for Large Language Models"β96Nov 10, 2025Updated 3 months ago
- Annotated version of the Mamba paperβ497Feb 27, 2024Updated 2 years ago
- Official implementation of ECCV24 paper: POAβ24Aug 8, 2024Updated last year
- Official PyTorch Implementation of "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers"β1,096Dec 22, 2025Updated 2 months ago
- Understand and test language model architectures on synthetic tasks.β257Feb 24, 2026Updated last week
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ238Oct 14, 2025Updated 4 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ341Feb 23, 2025Updated last year
- Official PyTorch implementation for "Large Language Diffusion Models"β3,609Nov 12, 2025Updated 3 months ago
- Pretraining and inference code for a large-scale depth-recurrent language modelβ864Dec 29, 2025Updated 2 months ago
- DeMo: Decoupled Momentum Optimizationβ198Dec 2, 2024Updated last year