[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β590Feb 11, 2025Updated last year
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Minimal implementation of TokenFormer for inference and learningβ13Nov 6, 2024Updated last year
- [ECCV2024 Oralπ₯] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"β362Jan 14, 2025Updated last year
- code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"β1,248Nov 9, 2025Updated 6 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β376Dec 12, 2024Updated last year
- Code for BLT research paperβ2,041Nov 3, 2025Updated 6 months ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Helpful tools and examples for working with flex-attentionβ1,187Apr 13, 2026Updated last month
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ294Jun 3, 2025Updated 11 months ago
- Next-Token Prediction is All You Needβ2,409Jan 12, 2026Updated 4 months ago
- π Efficient implementations for emerging model architecturesβ5,116May 17, 2026Updated last week
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performanceβ¦β156Apr 7, 2025Updated last year
- FlexAttention w/ FlashAttention3 Supportβ27Oct 5, 2024Updated last year
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruningβ150Feb 25, 2026Updated 2 months ago
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's lβ¦β57Mar 31, 2026Updated last month
- A suite of image and video neural tokenizersβ1,726Feb 11, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- β92Aug 18, 2024Updated last year
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β458May 13, 2025Updated last year
- [ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Thinkβ1,628Mar 16, 2025Updated last year
- β52Jun 24, 2025Updated 11 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ344Feb 23, 2025Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β92Oct 30, 2024Updated last year
- Mamba SSM architectureβ18,275May 10, 2026Updated 2 weeks ago
- Muon is an optimizer for hidden layers in neural networksβ2,595Jan 19, 2026Updated 4 months ago
- PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838β1,913Feb 20, 2026Updated 3 months ago
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β166Apr 13, 2025Updated last year
- Pretraining and inference code for a large-scale depth-recurrent language modelβ886Dec 29, 2025Updated 4 months ago
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projectionβ1,693Oct 28, 2024Updated last year
- Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"β8,579May 31, 2024Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ133Dec 3, 2024Updated last year
- Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.β2,096Jul 29, 2024Updated last year
- Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"β18Mar 15, 2024Updated 2 years ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)β25Jun 6, 2024Updated last year
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β254Jun 6, 2025Updated 11 months ago
- AI Agents on DigitalOcean Gradient AI Platform β’ AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Official PyTorch implementation for "Large Language Diffusion Models"β3,795Nov 12, 2025Updated 6 months ago
- H-Net Dynamic Hierarchical Architectureβ81Sep 11, 2025Updated 8 months ago
- Annotated version of the Mamba paperβ501Feb 27, 2024Updated 2 years ago
- Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modelingβ42Feb 12, 2025Updated last year
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"β563Dec 28, 2024Updated last year
- Muon is Scalable for LLM Trainingβ1,480Aug 3, 2025Updated 9 months ago
- Official PyTorch Implementation of "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers"β1,166Dec 22, 2025Updated 5 months ago