fw-ai / llama-cuda-graph-exampleLinks

Example of applying CUDA graphs to LLaMA-v2

☆12

Alternatives and similar repositories for llama-cuda-graph-example

Users that are interested in llama-cuda-graph-example are comparing it to the libraries listed below

Sorting:

meta-pytorch / BackendBench
How to ensure correctness and ship LLM generated kernels in PyTorch
☆117Updated this week
deepspeedai / DeepSpeed-Kernels
☆71Updated 7 months ago
meta-pytorch / kraken
Triton-based Symmetric Memory operators and examples
☆62Updated last month
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆85Updated last year
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆85Updated last month
ademeure / QuickRunCUDA
☆13Updated 2 weeks ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆51Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆155Updated last year
ScalingIntelligence / hydragen
Hydragen: High-Throughput LLM Inference with Shared Prefixes
☆44Updated last year
vedantroy / gpu_kernels
☆27Updated last year
tile-ai / AttentionEngine
☆50Updated 5 months ago
cchan / tccl
extensible collectives library in triton
☆91Updated 7 months ago
microsoft / AttentionEngine
☆106Updated 5 months ago
stanford-futuredata / stk
☆112Updated last year
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 6 months ago
feifeibear / DPSKV3MFU
Estimate MFU for DeepSeekV3
☆26Updated 10 months ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆128Updated last week
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
exists-forall / striped_attention
☆41Updated 2 years ago
open-lm-engine / accelerated-model-architectures
A bunch of kernels that might make stuff slower 😉
☆64Updated last week
flashinfer-ai / cutlass-viz
☆65Updated 6 months ago
hao-ai-lab / LookaheadReasoning
[NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning
☆51Updated 2 weeks ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆51Updated 4 months ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆123Updated this week
Dao-AILab / grouped-latent-attention
☆130Updated 5 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆127Updated 5 months ago
mayank31398 / ladder-residual-inference
☆14Updated 4 months ago
ademeure / cuda-side-boost
☆49Updated 6 months ago