zeux / calmLinks
CUDA/Metal accelerated language model inference
☆614Updated 3 months ago
Alternatives and similar repositories for calm
Users that are interested in calm are comparing it to the libraries listed below
Sorting:
- Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O☆497Updated last week
- Perplexity GPU Kernels☆469Updated last week
- Fast low-bit matmul kernels in Triton☆371Updated last week
- A Quirky Assortment of CuTe Kernels☆582Updated this week
- Flash Attention in ~100 lines of CUDA (forward pass only)☆937Updated 8 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆96Updated last week
- ☆237Updated last week
- kernels, of the mega variety☆502Updated this week
- Applied AI experiments and examples for PyTorch☆296Updated last month
- Fastest kernels written from scratch☆355Updated last week
- Cataloging released Triton kernels.☆260Updated 2 weeks ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆676Updated last month
- AI Tensor Engine for ROCm☆279Updated this week
- High-Performance SGEMM on CUDA devices☆101Updated 8 months ago
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆362Updated this week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆756Updated 6 months ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆318Updated last week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆900Updated last year
- A throughput-oriented high-performance serving framework for LLMs☆891Updated last week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆265Updated 2 months ago
- A profiler to disclose and quantify hardware features on GPUs.☆174Updated 3 years ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆576Updated last month
- Nvidia Instruction Set Specification Generator☆292Updated last year
- scalable and robust tree-based speculative decoding algorithm☆358Updated 7 months ago
- CUDA Matrix Multiplication Optimization☆222Updated last year
- An experimental CPU backend for Triton☆153Updated 3 months ago
- LLM training in simple, raw C/CUDA☆104Updated last year
- ☆199Updated 4 months ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆573Updated last week
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆69Updated last month