kyegomez / FlashMHA
An simple pytorch implementation of Flash MultiHead Attention
☆21Updated last year
Alternatives and similar repositories for FlashMHA:
Users that are interested in FlashMHA are comparing it to the libraries listed below
- Linear Attention Sequence Parallelism (LASP)☆79Updated 9 months ago
- ☆30Updated 10 months ago
- Implementation of the model "Hedgehog" from the paper: "The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry"☆13Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆44Updated 8 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆80Updated 4 months ago
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation☆47Updated 8 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 9 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 9 months ago
- ☆67Updated 8 months ago
- A minimal implementation of vllm.☆37Updated 8 months ago
- Beyond KV Caching: Shared Attention for Efficient LLMs☆16Updated 8 months ago
- ☆46Updated last year
- Patch convolution to avoid large GPU memory usage of Conv2D☆84Updated 2 months ago
- HGRN2: Gated Linear RNNs with State Expansion☆53Updated 7 months ago
- FocusLLM: Scaling LLM’s Context by Parallel Decoding☆39Updated 3 months ago
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Updated 5 months ago
- Triton implement of bi-directional (non-causal) linear attention☆44Updated last month
- "Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding" Zhenyu Zhang, Runjin Chen, Shiw…☆28Updated 10 months ago
- ☆52Updated this week
- ☆72Updated last week
- Quantized Attention on GPU☆45Updated 4 months ago
- Here we will test various linear attention designs.☆60Updated 11 months ago
- The implementation for MLSys 2023 paper: "Cuttlefish: Low-rank Model Training without All The Tuning"☆44Updated last year
- 🔥 A minimal training framework for scaling FLA models☆92Updated last week
- A WebUI for Side-by-Side Comparison of Media (Images/Videos) Across Multiple Folders☆21Updated last month
- ☆39Updated last month
- DPO, but faster 🚀☆40Updated 3 months ago
- [ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.☆23Updated last year
- Using FlexAttention to compute attention with different masking patterns☆42Updated 6 months ago