facebookresearch / Mixture-of-TransformersLinks
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models. TMLR 2025.
☆88Updated 2 months ago
Alternatives and similar repositories for Mixture-of-Transformers
Users that are interested in Mixture-of-Transformers are comparing it to the libraries listed below
Sorting:
- Official PyTorch Implementation for Vision-Language Models Create Cross-Modal Task Representations, ICML 2025☆29Updated 3 months ago
- ☆83Updated 11 months ago
- RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best…☆51Updated 4 months ago
- Python Library to evaluate VLM models' robustness across diverse benchmarks☆210Updated 2 weeks ago
- Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun☆55Updated 4 months ago
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆118Updated last month
- MatFormer repo☆56Updated 7 months ago
- [ICML 2025] Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction☆55Updated 2 months ago
- Esoteric Language Models☆89Updated last week
- LL3M: Large Language and Multi-Modal Model in Jax☆72Updated last year
- Resa: Transparent Reasoning Models via SAEs☆41Updated last month
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆127Updated 11 months ago
- Official implementation of the paper: "ZClip: Adaptive Spike Mitigation for LLM Pre-Training".☆131Updated last month
- Implementation of 🥥 Coconut, Chain of Continuous Thought, in Pytorch☆178Updated last month
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆76Updated 7 months ago
- Geometric-Mean Policy Optimization☆26Updated last week
- Easily run PyTorch on multiple GPUs & machines☆46Updated last month
- ☆34Updated 2 months ago
- [ICCV 2025] Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.☆146Updated 3 weeks ago
- DPO, but faster 🚀☆43Updated 8 months ago
- Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)☆34Updated 4 months ago
- Implementation of a multimodal diffusion transformer in Pytorch☆102Updated last year
- Exploration into the proposed "Self Reasoning Tokens" by Felipe Bonetto☆56Updated last year
- Official repo of paper LM2☆41Updated 5 months ago
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling☆33Updated last week
- The official implementation of Regularized Policy Gradient (RPG) (https://arxiv.org/abs/2505.17508)☆35Updated last week
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆160Updated 3 months ago
- The evaluation framework for training-free sparse attention in LLMs☆86Updated last month
- Remasking Discrete Diffusion Models with Inference-Time Scaling☆36Updated 4 months ago
- Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"☆176Updated last year