kyegomez / FlashMHALinks
An simple pytorch implementation of Flash MultiHead Attention
☆21Updated last year
Alternatives and similar repositories for FlashMHA
Users that are interested in FlashMHA are comparing it to the libraries listed below
Sorting:
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆41Updated last month
- Linear Attention Sequence Parallelism (LASP)☆85Updated last year
- "Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding" Zhenyu Zhang, Runjin Chen, Shiw…☆29Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Updated last year
- ☆31Updated last year
- patches for huggingface transformers to save memory☆24Updated 3 weeks ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆72Updated this week
- [ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆84Updated 7 months ago
- ☆47Updated 2 weeks ago
- ☆50Updated last year
- ☆36Updated last week
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆25Updated last year
- ☆71Updated last month
- Fast and memory-efficient exact attention☆68Updated 3 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆70Updated last year
- Exploring Diffusion Transformer Designs via Grafting☆41Updated last week
- Using FlexAttention to compute attention with different masking patterns☆44Updated 9 months ago
- ☆114Updated 3 weeks ago
- Quantized Attention on GPU☆44Updated 7 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆47Updated 11 months ago
- ☆58Updated last week
- ☆21Updated 2 months ago
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation☆49Updated 11 months ago
- Official repository for ICML 2024 paper "MoRe Fine-Tuning with 10x Fewer Parameters"☆20Updated last month
- Implementation of the model "Hedgehog" from the paper: "The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry"☆14Updated last year
- Cascade Speculative Drafting☆29Updated last year
- Triton implement of bi-directional (non-causal) linear attention☆50Updated 4 months ago
- ☆76Updated 4 months ago
- Beyond KV Caching: Shared Attention for Efficient LLMs☆19Updated 11 months ago
- RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best…☆47Updated 3 months ago