andrewkchan / yalmLinks

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

☆503

Alternatives and similar repositories for yalm

Users that are interested in yalm are comparing it to the libraries listed below

Sorting:

HazyResearch / Megakernels
kernels, of the mega variety
☆579Updated 2 weeks ago
perplexityai / pplx-kernels
Perplexity GPU Kernels
☆482Updated 3 weeks ago
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆230Updated 5 months ago
sgl-project / sgl-learning-materials
Materials for learning SGLang
☆597Updated last week
tspeterkim / flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
☆941Updated 9 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆261Updated last month
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆379Updated 2 weeks ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆897Updated 3 weeks ago
pranjalssh / fast.cu
Fastest kernels written from scratch
☆368Updated 3 weeks ago
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆765Updated 7 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆242Updated this week
zeux / calm
CUDA/Metal accelerated language model inference
☆613Updated 4 months ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆613Updated this week
bertmaher / simplegemm
☆119Updated 6 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆426Updated 4 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆424Updated 4 months ago
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆296Updated last month
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆270Updated this week
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆417Updated this week
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆390Updated 3 months ago
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆123Updated last year
SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆537Updated 2 weeks ago
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆691Updated 2 months ago
NVIDIA / nvshmem
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆328Updated 3 weeks ago
NVIDIA / kvpress
LLM KV cache compression made easy
☆642Updated this week
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆424Updated 4 months ago
ScalingIntelligence / KernelBench
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
☆601Updated this week
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆220Updated 2 months ago
hkproj / triton-flash-attention
☆206Updated 9 months ago
andrewkchan / deepseek.cpp
CPU inference for the DeepSeek family of large language models in C++
☆313Updated last week