andrewkchan / yalmView external linksLinks
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
☆551Sep 13, 2025Updated 5 months ago
Alternatives and similar repositories for yalm
Users that are interested in yalm are comparing it to the libraries listed below
Sorting:
- CUDA/Metal accelerated language model inference☆626May 29, 2025Updated 8 months ago
- CPU inference for the DeepSeek family of large language models in C++☆315Oct 2, 2025Updated 4 months ago
- 使用 CUDA C++ 实现 的 llama 模型推理框架☆64Nov 8, 2024Updated last year
- FlashInfer: Kernel Library for LLM Serving☆4,935Updated this week
- Material for gpu-mode lectures☆5,726Feb 1, 2026Updated last week
- 📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉☆9,617Feb 5, 2026Updated last week
- flash attention tutorial written in python, triton, cuda, cutlass☆486Jan 20, 2026Updated 3 weeks ago
- TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.☆14Nov 23, 2024Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 8 months ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)☆792Jan 20, 2026Updated 3 weeks ago
- Fast low-bit matmul kernels in Triton☆429Feb 1, 2026Updated last week
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆226Jan 14, 2026Updated last month
- Examples of CUDA implementations by Cutlass CuTe☆270Jul 1, 2025Updated 7 months ago
- Learnings and programs related to CUDA☆433Jun 29, 2025Updated 7 months ago
- Flash Attention in ~100 lines of CUDA (forward pass only)☆1,068Dec 30, 2024Updated last year
- A throughput-oriented high-performance serving framework for LLMs☆945Oct 29, 2025Updated 3 months ago
- ☆41Nov 1, 2025Updated 3 months ago
- 📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉☆4,969Jan 18, 2026Updated 3 weeks ago
- how to optimize some algorithm in cuda.☆2,819Jan 31, 2026Updated 2 weeks ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 10 months ago
- Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.☆19Jan 15, 2025Updated last year
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.☆4,701Updated this week
- Efficient Triton Kernels for LLM Training☆6,123Feb 7, 2026Updated last week
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆250Feb 5, 2026Updated last week
- High-Performance FP32 GEMM on CUDA devices☆117Jan 21, 2025Updated last year
- A lightweight design for computation-communication overlap.☆219Jan 20, 2026Updated 3 weeks ago
- Step-by-step optimization of CUDA SGEMM☆431Mar 30, 2022Updated 3 years ago
- Mirage Persistent Kernel: Compiling LLMs into a MegaKernel☆2,120Jan 29, 2026Updated 2 weeks ago
- ☆288Feb 4, 2026Updated last week
- SGLang is a high-performance serving framework for large language models and multimodal models.☆23,439Updated this week
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆407Jan 2, 2025Updated last year
- Puzzles for learning Triton, play it with minimal environment configuration!☆624Dec 28, 2025Updated last month
- Tensor library for machine learning☆13,923Feb 7, 2026Updated last week
- ☆54May 5, 2025Updated 9 months ago
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…☆3,888Updated this week
- DINOv2 inference engine written in C/C++ using ggml and OpenCV.☆88May 6, 2025Updated 9 months ago
- Load and run Llama from safetensors files in C☆15Oct 24, 2024Updated last year
- Nano vLLM☆11,617Nov 3, 2025Updated 3 months ago