zhangpiu / llm.cppLinks
LLM training in simple, C++/CUDA(with Eigen3)
☆17Updated last year
Alternatives and similar repositories for llm.cpp
Users that are interested in llm.cpp are comparing it to the libraries listed below
Sorting:
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- A C++ port of karpathy/llm.c features a tiny torch library while maintaining overall simplicity.☆42Updated last year
- This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai☆113Updated last week
- ☆104Updated last year
- Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O☆550Updated 4 months ago
- TPU inference for vLLM, with unified JAX and PyTorch support.☆228Updated this week
- Perplexity GPU Kernels☆554Updated 3 months ago
- Learning about CUDA by writing PTX code.☆152Updated last year
- SynapseAI Core is a reference implementation of the SynapseAI API running on Habana Gaudi☆42Updated last year
- A GPU-driven system framework for scalable AI applications☆124Updated last year
- LLM training in simple, raw C/CUDA☆112Updated last year
- ☆286Updated this week
- Fast low-bit matmul kernels in Triton☆427Updated last week
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆64Updated 7 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆260Updated last year
- ☆96Updated 10 months ago
- Cataloging released Triton kernels.☆292Updated 4 months ago
- ring-attention experiments☆165Updated last year
- Fast and memory-efficient exact attention☆111Updated last week
- ☆27Updated 2 years ago
- Inference Llama 2 in one file of pure C & one file with CUDA☆32Updated 2 years ago
- Efficient LLM Inference over Long Sequences☆394Updated 7 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆324Updated this week
- A Python library transfers PyTorch tensors between CPU and NVMe☆125Updated last year
- Accelerating MoE with IO and Tile-aware Optimizations☆569Updated 2 weeks ago
- Fastest kernels written from scratch☆532Updated 4 months ago
- extensible collectives library in triton☆95Updated 10 months ago
- a minimal cache manager for PagedAttention, on top of llama3.☆135Updated last year
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆276Updated 6 months ago
- PyTorch distributed training acceleration framework☆55Updated 5 months ago