mscheong01 / speculative_decoding.cLinks

minimal C implementation of speculative decoding based on llama2.c

☆25

Alternatives and similar repositories for speculative_decoding.c

Users that are interested in speculative_decoding.c are comparing it to the libraries listed below

Sorting:

salykova / sgemm.cu
High-Performance FP32 GEMM on CUDA devices
☆117Updated last year
thomaschlt / mla.c
Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.
☆19Updated last year
mit-han-lab / tinychat-tutorial
☆77Updated last year
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆46Updated 7 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆95Updated 4 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆127Updated last year
vedantroy / gpu_kernels
☆27Updated 2 years ago
triton-lang / kernels
☆104Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆260Updated last year
flashinfer-ai / cutlass-viz
☆65Updated 9 months ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆165Updated 2 months ago
TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆18Updated last year
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
AutonomicPerfectionist / PipeInfer
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
☆32Updated last year
NTT123 / cute-viz
Cute layout visualization
☆29Updated 3 weeks ago
meta-pytorch / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆49Updated 5 months ago
moritztng / grayskull-attention
Attention in SRAM on Tenstorrent Grayskull
☆40Updated last year
tile-ai / AttentionEngine
☆52Updated 8 months ago
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆135Updated last year
ademeure / cuda-side-boost
☆53Updated 9 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆106Updated 7 months ago
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆99Updated 8 months ago
gpu-mode / ring-attention
ring-attention experiments
☆165Updated last year
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 9 months ago
microsoft / AttentionEngine
☆118Updated 8 months ago
sgl-project / whl
Kernel Library Wheel for SGLang
☆17Updated last week
gpu-mode / popcorn-cli
☆95Updated this week
ACA-Lab-SJTU / token-ring
☆13Updated last year
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆93Updated last year