[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β590Feb 11, 2025Updated last year
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Minimal implementation of TokenFormer for inference and learningβ13Nov 6, 2024Updated last year
- [ECCV2024 Oralπ₯] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"β362Jan 14, 2025Updated last year
- code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"β1,256Nov 9, 2025Updated 7 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β377Dec 12, 2024Updated last year
- Code for BLT research paperβ2,045Nov 3, 2025Updated 7 months ago
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Helpful tools and examples for working with flex-attentionβ1,195May 28, 2026Updated 2 weeks ago
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ294Jun 3, 2025Updated last year
- Next-Token Prediction is All You Needβ2,417Jan 12, 2026Updated 5 months ago
- π Efficient implementations for emerging model architecturesβ5,206Jun 6, 2026Updated last week
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performanceβ¦β156Apr 7, 2025Updated last year
- FlexAttention w/ FlashAttention3 Supportβ27Oct 5, 2024Updated last year
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruningβ149Feb 25, 2026Updated 3 months ago
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's lβ¦β57Mar 31, 2026Updated 2 months ago
- A suite of image and video neural tokenizersβ1,724Feb 11, 2025Updated last year
- Managed hosting for WordPress and PHP on Cloudways β’ AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- β93Aug 18, 2024Updated last year
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β458May 13, 2025Updated last year
- [ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Thinkβ1,640Mar 16, 2025Updated last year
- β52Jun 24, 2025Updated 11 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ345Feb 23, 2025Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β92Oct 30, 2024Updated last year
- Mamba SSM architectureβ18,429Updated this week
- PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838β1,932Feb 20, 2026Updated 3 months ago
- Muon is an optimizer for hidden layers in neural networksβ2,656May 24, 2026Updated 2 weeks ago
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β166Apr 13, 2025Updated last year
- Pretraining and inference code for a large-scale depth-recurrent language modelβ894Dec 29, 2025Updated 5 months ago
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projectionβ1,695Oct 28, 2024Updated last year
- Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"β8,617May 31, 2024Updated 2 years ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ134Dec 3, 2024Updated last year
- Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.β2,097Jul 29, 2024Updated last year
- Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"β18Mar 15, 2024Updated 2 years ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)β25Jun 6, 2024Updated 2 years ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β255Jun 6, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Official PyTorch implementation for "Large Language Diffusion Models"β3,821Nov 12, 2025Updated 7 months ago
- H-Net Dynamic Hierarchical Architectureβ81Sep 11, 2025Updated 9 months ago
- Annotated version of the Mamba paperβ501Feb 27, 2024Updated 2 years ago
- Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modelingβ42Feb 12, 2025Updated last year
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"β563Dec 28, 2024Updated last year
- Muon is Scalable for LLM Trainingβ1,492Aug 3, 2025Updated 10 months ago
- Official PyTorch Implementation of "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers"β1,170Dec 22, 2025Updated 5 months ago