tridao / flash-attention-wheels
β46Updated last year
Alternatives and similar repositories for flash-attention-wheels:
Users that are interested in flash-attention-wheels are comparing it to the libraries listed below
- Repository for CPU Kernel Generation for LLM Inferenceβ25Updated last year
- DPO, but faster πβ40Updated 3 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Promptsβ39Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundryβ40Updated last year
- Load compute kernels from the Hubβ99Updated this week
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clustersβ44Updated 8 months ago
- π₯ A minimal training framework for scaling FLA modelsβ82Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.β104Updated this week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ62Updated this week
- β30Updated 10 months ago
- β51Updated 2 weeks ago
- Odysseus: Playground of LLM Sequence Parallelismβ66Updated 9 months ago
- Transformers components but in Tritonβ32Updated last week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ110Updated 3 months ago
- QuIP quantizationβ52Updated last year
- Linear Attention Sequence Parallelism (LASP)β79Updated 9 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsityβ71Updated 6 months ago
- β65Updated 2 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin toβ¦β23Updated last month
- Here we will test various linear attention designs.β60Updated 11 months ago
- PyTorch bindings for CUTLASS grouped GEMM.β74Updated 4 months ago
- GPU operators for sparse tensor operationsβ31Updated last year
- Using FlexAttention to compute attention with different masking patternsβ42Updated 6 months ago
- Fast and memory-efficient exact attentionβ67Updated 3 weeks ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.β68Updated 10 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ157Updated 8 months ago
- Quantized Attention on GPUβ45Updated 4 months ago
- β101Updated 7 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMsβ80Updated 4 months ago
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernelsβ102Updated last year