kyegomez / MHMoE
Community Implementation of the paper: "Multi-Head Mixture-of-Experts" In PyTorch
☆21Updated 2 weeks ago
Alternatives and similar repositories for MHMoE:
Users that are interested in MHMoE are comparing it to the libraries listed below
- A single repo with all scripts and utils to train / fine-tune the Mamba model with or without FIM☆50Updated 10 months ago
- Implementation of MoE Mamba from the paper: "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts" in Pytorch and Ze…☆93Updated 2 weeks ago
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆36Updated last year
- Implementation of a Light Recurrent Unit in Pytorch☆48Updated 4 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆96Updated 4 months ago
- GoldFinch and other hybrid transformer components☆43Updated 6 months ago
- HGRN2: Gated Linear RNNs with State Expansion☆52Updated 5 months ago
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆114Updated 3 months ago
- Explorations into improving ViTArc with Slot Attention☆37Updated 3 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆149Updated last month
- Official Implementation Of The Paper: `DeciMamba: Exploring the Length Extrapolation Potential of Mamba'☆23Updated 6 months ago
- Implementation of Agent Attention in Pytorch☆89Updated 7 months ago
- A byte-level decoder architecture that matches the performance of tokenized Transformers.☆65Updated 9 months ago
- Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…☆53Updated 3 months ago
- ☆43Updated 3 months ago
- ☆71Updated 5 months ago
- An open source replication of the stawberry method that leverages Monte Carlo Search with PPO and or DPO☆27Updated this week
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆81Updated this week
- Pytorch Implementation of the paper: "Learning to (Learn at Test Time): RNNs with Expressive Hidden States"☆24Updated this week
- My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated☆31Updated 6 months ago
- Griffin MQA + Hawk Linear RNN Hybrid☆85Updated 9 months ago
- RWKV, in easy to read code☆65Updated 2 months ago
- Implementation of BitNet-1.58 instruct tuning☆19Updated 10 months ago
- Collection of autoregressive model implementation☆81Updated this week
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆96Updated 5 months ago
- Implementation of Google's USM speech model in Pytorch☆28Updated 2 weeks ago
- Implementation of a modular, high-performance, and simplistic mamba for high-speed applications☆33Updated 3 months ago