[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β589Feb 11, 2025Updated last year
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Minimal implementation of TokenFormer for inference and learningβ13Nov 6, 2024Updated last year
- [ECCV2024 Oralπ₯] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"β360Jan 14, 2025Updated last year
- code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"β1,193Nov 9, 2025Updated 4 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β374Dec 12, 2024Updated last year
- Code for BLT research paperβ2,030Nov 3, 2025Updated 4 months ago
- Helpful tools and examples for working with flex-attentionβ1,161Feb 8, 2026Updated last month
- Next-Token Prediction is All You Needβ2,374Jan 12, 2026Updated 2 months ago
- π Efficient implementations of state-of-the-art linear attention modelsβ4,630Updated this week
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performanceβ¦β157Apr 7, 2025Updated 11 months ago
- FlexAttention w/ FlashAttention3 Supportβ27Oct 5, 2024Updated last year
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruningβ144Feb 25, 2026Updated 3 weeks ago
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's lβ¦β56Updated this week
- A suite of image and video neural tokenizersβ1,716Feb 11, 2025Updated last year
- β91Aug 18, 2024Updated last year
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β453May 13, 2025Updated 10 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ341Feb 23, 2025Updated last year
- β52Jun 24, 2025Updated 9 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β92Oct 30, 2024Updated last year
- [ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Thinkβ1,585Mar 16, 2025Updated last year
- Muon is an optimizer for hidden layers in neural networksβ2,398Jan 19, 2026Updated 2 months ago
- Mamba SSM architectureβ17,524Updated this week
- PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838β1,879Feb 20, 2026Updated last month
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β163Apr 13, 2025Updated 11 months ago
- Pretraining and inference code for a large-scale depth-recurrent language modelβ865Dec 29, 2025Updated 2 months ago
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projectionβ1,681Oct 28, 2024Updated last year
- Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"β8,433May 31, 2024Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ133Dec 3, 2024Updated last year
- Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.β2,087Jul 29, 2024Updated last year
- Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"β18Mar 15, 2024Updated 2 years ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β250Jun 6, 2025Updated 9 months ago
- Official PyTorch implementation for "Large Language Diffusion Models"β3,682Nov 12, 2025Updated 4 months ago
- Annotated version of the Mamba paperβ499Feb 27, 2024Updated 2 years ago
- H-Net Dynamic Hierarchical Architectureβ81Sep 11, 2025Updated 6 months ago
- Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modelingβ42Feb 12, 2025Updated last year
- Official PyTorch Implementation of "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers"β1,120Dec 22, 2025Updated 3 months ago
- The official repo of continuous speculative decodingβ32Mar 28, 2025Updated 11 months ago
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"β563Dec 28, 2024Updated last year
- β19Dec 4, 2025Updated 3 months ago
- β308Apr 23, 2025Updated 11 months ago