tridao / flash-attention-wheels
☆44Updated 11 months ago
Related projects ⓘ
Alternatives and complementary repositories for flash-attention-wheels
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆79Updated this week
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆32Updated 3 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆34Updated 8 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆20Updated last week
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 10 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆57Updated 5 months ago
- Here we will test various linear attention designs.☆56Updated 6 months ago
- ☆47Updated 2 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- QuIP quantization☆46Updated 8 months ago
- FlexAttention w/ FlashAttention3 Support☆27Updated last month
- ☆96Updated last month
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆57Updated this week
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 5 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆74Updated 5 months ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆56Updated last month
- Simple and fast low-bit matmul kernels in CUDA / Triton☆145Updated this week
- ring-attention experiments☆97Updated last month
- ☆35Updated 2 weeks ago
- My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated☆30Updated 3 months ago
- ☆55Updated 5 months ago
- Using FlexAttention to compute attention with different masking patterns☆40Updated last month
- KV cache compression for high-throughput LLM inference☆87Updated this week
- A repository for research on medium sized language models.☆74Updated 5 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆92Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆147Updated 4 months ago
- This repository contains code for the MicroAdam paper.☆12Updated 4 months ago
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆11Updated this week
- An algorithm for static activation quantization of LLMs☆77Updated last week