[ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
β588Feb 11, 2025Updated last year
Alternatives and similar repositories for TokenFormer
Users that are interested in TokenFormer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Minimal implementation of TokenFormer for inference and learningβ13Nov 6, 2024Updated last year
- [ECCV2024 Oralπ₯] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"β362Jan 14, 2025Updated last year
- code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"β1,230Nov 9, 2025Updated 5 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β375Dec 12, 2024Updated last year
- Code for BLT research paperβ2,036Nov 3, 2025Updated 6 months ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Helpful tools and examples for working with flex-attentionβ1,182Apr 13, 2026Updated 3 weeks ago
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ293Jun 3, 2025Updated 11 months ago
- Next-Token Prediction is All You Needβ2,402Jan 12, 2026Updated 3 months ago
- π Efficient implementations for emerging model architecturesβ5,032Updated this week
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performanceβ¦β156Apr 7, 2025Updated last year
- FlexAttention w/ FlashAttention3 Supportβ27Oct 5, 2024Updated last year
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruningβ150Feb 25, 2026Updated 2 months ago
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's lβ¦β57Mar 31, 2026Updated last month
- A suite of image and video neural tokenizersβ1,723Feb 11, 2025Updated last year
- AI Agents on DigitalOcean Gradient AI Platform β’ AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- β92Aug 18, 2024Updated last year
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β457May 13, 2025Updated 11 months ago
- β53Jun 24, 2025Updated 10 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ344Feb 23, 2025Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β92Oct 30, 2024Updated last year
- [ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Thinkβ1,621Mar 16, 2025Updated last year
- Mamba SSM architectureβ18,118Apr 27, 2026Updated last week
- PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838β1,904Feb 20, 2026Updated 2 months ago
- Muon is an optimizer for hidden layers in neural networksβ2,544Jan 19, 2026Updated 3 months ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β167Apr 13, 2025Updated last year
- Pretraining and inference code for a large-scale depth-recurrent language modelβ879Dec 29, 2025Updated 4 months ago
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projectionβ1,690Oct 28, 2024Updated last year
- Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"β8,539May 31, 2024Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ133Dec 3, 2024Updated last year
- Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.β2,094Jul 29, 2024Updated last year
- Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"β18Mar 15, 2024Updated 2 years ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β252Jun 6, 2025Updated 10 months ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)β25Jun 6, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI β’ AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Official PyTorch implementation for "Large Language Diffusion Models"β3,758Nov 12, 2025Updated 5 months ago
- H-Net Dynamic Hierarchical Architectureβ81Sep 11, 2025Updated 7 months ago
- Annotated version of the Mamba paperβ501Feb 27, 2024Updated 2 years ago
- Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modelingβ41Feb 12, 2025Updated last year
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"β562Dec 28, 2024Updated last year
- Official PyTorch Implementation of "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers"β1,160Dec 22, 2025Updated 4 months ago
- β19Dec 4, 2025Updated 5 months ago