mdy666 / mdy_tritonLinks

☆150

Alternatives and similar repositories for mdy_triton

Users that are interested in mdy_triton are comparing it to the libraries listed below

Sorting:

smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆132Updated last month
madsys-dev / deepseekv2-profile
☆152Updated 9 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆356Updated 4 months ago
InternLM / Awesome-LLM-Training-System
☆44Updated last year
HarryWu99 / llm_kvcache_sparsity
Implement some method of LLM KV Cache Sparsity
☆42Updated last year
mdy666 / Qwen-Native-Sparse-Attention
qwen-nsa
☆84Updated last month
hemingkx / Spec-Bench
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)
☆338Updated 7 months ago
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆148Updated 9 months ago
FFY0 / AdaKV
The Official Implementation of Ada-KV [NeurIPS 2025]
☆116Updated last week
shreyansh26 / FlashAttention-PyTorch
Implementation of FlashAttention in PyTorch
☆174Updated 10 months ago
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆292Updated 5 months ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆169Updated last month
FMInference / H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆487Updated last year
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆498Updated last week
dilab-zju / self-speculative-decoding
Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
☆210Updated 9 months ago
ruikangliu / FlatQuant
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆195Updated last week
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆175Updated 2 months ago
flagos-ai / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆285Updated last year
d-matrix-ai / keyformer-llm
☆58Updated last year
pprp / Awesome-Efficient-MoE
Efficient Mixture of Experts for LLM Paper List
☆145Updated 2 months ago
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆250Updated 3 months ago
FasterDecoding / SnapKV
☆290Updated 4 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆155Updated last month
LDLINGLINGLING / nano_vllm_note
注释的nano_vllm仓库，并且完成了MiniCPM4的适配以及注册新模型的功能
☆108Updated 3 months ago
October2001 / Awesome-KV-Cache-Compression
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆614Updated 2 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆97Updated 11 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆108Updated 8 months ago
pprp / Awesome-LLM-Prune
Awesome list for LLM pruning.
☆276Updated last month
ClubieDong / QAQ-KVCacheQuantization
QAQ: Quality Adaptive Quantization for LLM KV Cache
☆55Updated last year
TreeAI-Lab / Awesome-KV-Cache-Management
This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…
☆255Updated 4 months ago