linxihui / dkernelLinks

☆20

Alternatives and similar repositories for dkernel

Users that are interested in dkernel are comparing it to the libraries listed below

Sorting:

IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
sail-sg / SimLayerKV
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆49Updated last year
thunlp / Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆110Updated 7 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆84Updated last year
sail-sg / VocabularyParallelism
Vocabulary Parallelism
☆23Updated 7 months ago
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
linfeng93 / BiTA
An innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
☆26Updated 6 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆142Updated 8 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆102Updated 2 weeks ago
Infini-AI-Lab / gsm_infinite
☆55Updated 4 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
FasterDecoding / TEAL
☆145Updated 8 months ago
MayDomine / Burst-Attention
Distributed IO-aware Attention algorithm
☆21Updated last month
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
rlite-project / RLite
A lightweight reinforcement learning framework that integrates seamlessly into your codebase, empowering developers to focus on algorithm…
☆68Updated 2 months ago
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆51Updated 6 months ago
AnswerDotAI / cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆147Updated last year
IST-DASLab / RoSA
Official implementation of the ICML 2024 paper RoSA (Robust Adaptation)
☆44Updated last year
hyx1999 / SAM-Decoding
Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton
☆33Updated 8 months ago
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆60Updated last year
CASE-Lab-UMD / Unified-MoE-Compression
The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".
☆79Updated 7 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆46Updated 3 months ago
wdlctc / mini-s
☆52Updated 11 months ago
lfsszd / CS-Drafting
Cascade Speculative Drafting
☆31Updated last year
princeton-pli / PruLong
Code for the preprint "Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?"
☆47Updated 2 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆246Updated 3 weeks ago
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆45Updated 10 months ago