z-lab / flash-colreduceLinks
Fast, memory-efficient attention column reduction (e.g., sum, mean, max)
β28Updated this week
Alternatives and similar repositories for flash-colreduce
Users that are interested in flash-colreduce are comparing it to the libraries listed below
Sorting:
- d3LLM: Ultra-Fast Diffusion LLM πβ33Updated this week
- A sparse attention kernel supporting mix sparse patternsβ406Updated 10 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoringβ256Updated 5 months ago
- [ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsityβ64Updated 5 months ago
- Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cacheβ¦β187Updated last month
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ158Updated 2 months ago
- Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"β736Updated 3 weeks ago
- β207Updated 3 weeks ago
- [NeurIPS 2024] The official implementation of ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identificationβ31Updated 8 months ago
- 16-fold memory access reduction with nearly no lossβ109Updated 8 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMsβ177Updated 2 months ago
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruningβ75Updated 2 weeks ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ357Updated 5 months ago
- [ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafterβ105Updated last week
- SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable SparseβLinear Attentionβ146Updated last month
- [NeurIPS 2024 Oralπ₯] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs.β177Updated last year
- β187Updated 11 months ago
- A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attentionβ252Updated 2 weeks ago
- [ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generationβ140Updated 8 months ago
- A record of reading list on some MLsys popular topicβ17Updated 8 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Trainingβ252Updated 4 months ago
- ByteCheckpoint: An Unified Checkpointing Library for LFMsβ256Updated last week
- [NeurIPS'25] dKV-Cache: The Cache for Diffusion Language Modelsβ123Updated 6 months ago
- β63Updated last year
- The Official Implementation of Ada-KV [NeurIPS 2025]β118Updated 3 weeks ago
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]β60Updated 2 months ago
- Implementation for FP8/INT8 Rollout for RL training without performence drop.β280Updated last month
- Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inferenceβ214Updated 2 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projectionβ149Updated 9 months ago
- Efficient triton implementation of Native Sparse Attention.β254Updated 6 months ago