[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β592Feb 11, 2025Updated last year
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Minimal implementation of TokenFormer for inference and learningβ13Nov 6, 2024Updated last year
- [ECCV2024 Oralπ₯] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"β362Jan 14, 2025Updated last year
- code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"β1,266Nov 9, 2025Updated 7 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β378Dec 12, 2024Updated last year
- Code for BLT research paperβ2,046Nov 3, 2025Updated 8 months ago
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Helpful tools and examples for working with flex-attentionβ1,205Jun 27, 2026Updated last week
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ300Jun 3, 2025Updated last year
- Next-Token Prediction is All You Needβ2,423Jan 12, 2026Updated 5 months ago
- π Efficient implementations for emerging model architecturesβ5,279Updated this week
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performanceβ¦β156Apr 7, 2025Updated last year
- FlexAttention w/ FlashAttention3 Supportβ27Oct 5, 2024Updated last year
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruningβ150Feb 25, 2026Updated 4 months ago
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's lβ¦β58Mar 31, 2026Updated 3 months ago
- A suite of image and video neural tokenizersβ1,726Feb 11, 2025Updated last year
- Managed Database hosting by DigitalOcean β’ AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- β93Aug 18, 2024Updated last year
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β458May 13, 2025Updated last year
- [ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Thinkβ1,661Mar 16, 2025Updated last year
- β52Jun 24, 2025Updated last year
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ345Feb 23, 2025Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β92Oct 30, 2024Updated last year
- Mamba SSM architectureβ18,534Jun 15, 2026Updated 2 weeks ago
- PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838β1,940Feb 20, 2026Updated 4 months ago
- Muon is an optimizer for hidden layers in neural networksβ2,683May 24, 2026Updated last month
- AI Agents on DigitalOcean Gradient AI Platform β’ AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β166Apr 13, 2025Updated last year
- Pretraining and inference code for a large-scale depth-recurrent language modelβ894Dec 29, 2025Updated 6 months ago
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projectionβ1,698Oct 28, 2024Updated last year
- Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"β8,653May 31, 2024Updated 2 years ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ133Dec 3, 2024Updated last year
- Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.β2,102Jul 29, 2024Updated last year
- Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"β18Mar 15, 2024Updated 2 years ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)β25Jun 6, 2024Updated 2 years ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β255Jun 6, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Official PyTorch implementation for "Large Language Diffusion Models"β3,842Jun 25, 2026Updated last week
- H-Net Dynamic Hierarchical Architectureβ81Sep 11, 2025Updated 9 months ago
- Annotated version of the Mamba paperβ501Feb 27, 2024Updated 2 years ago
- Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modelingβ42Feb 12, 2025Updated last year
- Muon is Scalable for LLM Trainingβ1,496Aug 3, 2025Updated 11 months ago
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"β562Dec 28, 2024Updated last year
- Official PyTorch Implementation of "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers"β1,178Dec 22, 2025Updated 6 months ago