XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆116Updated this week
Alternatives and similar repositories for native-sparse-attention-triton:
Users that are interested in native-sparse-attention-triton are comparing it to the libraries listed below
- 🔥 A minimal training framework for scaling FLA models☆82Updated this week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆62Updated this week
- ☆63Updated last month
- ☆118Updated last month
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆102Updated this week
- ☆70Updated 2 weeks ago
- ☆36Updated this week
- A sparse attention kernel supporting mix sparse patterns☆168Updated last month
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆43Updated 5 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆164Updated last month
- 16-fold memory access reduction with nearly no loss☆81Updated last week
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.☆65Updated 3 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆145Updated this week
- ☆71Updated this week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆110Updated 3 months ago
- Fast and memory-efficient exact attention☆67Updated 3 weeks ago
- Odysseus: Playground of LLM Sequence Parallelism☆66Updated 9 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆125Updated 3 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆58Updated 2 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆208Updated 3 months ago
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆34Updated 9 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆166Updated 2 weeks ago
- Triton implementation of FlashAttention2 that adds Custom Masks.☆103Updated 7 months ago
- ☆87Updated 6 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆87Updated this week
- Here we will test various linear attention designs.☆60Updated 11 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆157Updated 8 months ago
- ☆232Updated 10 months ago
- qwen-nsa☆42Updated last week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆71Updated 6 months ago