zeux / calmLinks

CUDA/Metal accelerated language model inference

☆599

Alternatives and similar repositories for calm

Users that are interested in calm are comparing it to the libraries listed below

Sorting:

andrewkchan / yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
☆396Updated 2 months ago
tspeterkim / flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
☆897Updated 7 months ago
ppl-ai / pplx-kernels
Perplexity GPU Kernels
☆418Updated 3 weeks ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆339Updated this week
mirage-project / mirage
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
☆1,647Updated last week
HazyResearch / Megakernels
kernels, of the mega variety
☆466Updated 2 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆862Updated last month
Deep-Learning-Profiling-Tools / triton-viz
☆227Updated this week
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆388Updated this week
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆870Updated 11 months ago
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆656Updated this week
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆289Updated 2 months ago
pranjalssh / fast.cu
Fastest kernels written from scratch
☆310Updated 4 months ago
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆731Updated 5 months ago
ROCm / aiter
AI Tensor Engine for ROCm
☆243Updated this week
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆98Updated 6 months ago
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆567Updated this week
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆69Updated 3 weeks ago
ByteDance-Seed / Triton-distributed
Distributed Compiler based on Triton for Parallel Systems
☆941Updated this week
gpu-mode / triton-index
Cataloging released Triton kernels.
☆247Updated 6 months ago
tile-ai / tilelang
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆1,489Updated this week
NVIDIA / compute-eval
Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…
☆58Updated last month
ScalingIntelligence / KernelBench
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
☆505Updated last week
siboehm / SGEMM_CUDA
Fast CUDA matrix multiplication from scratch
☆786Updated last year
sgl-project / sgl-learning-materials
Materials for learning SGLang
☆522Updated 3 weeks ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆405Updated 2 months ago
apple / ml-recurrent-drafter
☆215Updated 6 months ago
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆205Updated 3 months ago
kuterd / nv_isa_solver
Nvidia Instruction Set Specification Generator
☆286Updated last year
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆139Updated 2 months ago