zhangpiu / llm.cppLinks
LLM training in simple, C++/CUDA(with Eigen3)
☆17Updated last year
Alternatives and similar repositories for llm.cpp
Users that are interested in llm.cpp are comparing it to the libraries listed below
Sorting:
- A C++ port of karpathy/llm.c features a tiny torch library while maintaining overall simplicity.☆42Updated last year
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O☆550Updated 4 months ago
- Inference Llama 2 in one file of pure C & one file with CUDA☆32Updated 2 years ago
- ☆137Updated last week
- A GPU-driven system framework for scalable AI applications☆124Updated last year
- Fast and efficient attention method exploration and implementation.☆25Updated 10 months ago
- LLM training in simple, raw C/CUDA☆112Updated last year
- Fast and memory-efficient exact attention☆111Updated last week
- OpenAI Triton backend for Intel® GPUs☆226Updated last week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆49Updated 5 months ago
- ☆89Updated 2 months ago
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆64Updated 7 months ago
- a minimal cache manager for PagedAttention, on top of llama3.☆135Updated last year
- Inference of Mamba and Mamba2 models in pure C☆196Updated 2 weeks ago
- Materials for learning SGLang☆738Updated last month
- Cataloging released Triton kernels.☆292Updated 5 months ago
- ☆125Updated last year
- Learning about CUDA by writing PTX code.☆152Updated last year
- Perplexity GPU Kernels☆554Updated 3 months ago
- torchcomms: a modern PyTorch communications API☆327Updated this week
- A curated list of awesome projects and papers for distributed training or inference☆265Updated last year
- Easy and Efficient Quantization for Transformers☆204Updated last week
- Use safetensors with ONNX 🤗☆84Updated 3 weeks ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆462Updated last month
- Dynamic Memory Management for Serving LLMs without PagedAttention☆457Updated 8 months ago
- TPU inference for vLLM, with unified JAX and PyTorch support.☆228Updated this week
- Accelerated General (FP32) Matrix Multiplication from scratch in CUDA☆182Updated last year
- Fast low-bit matmul kernels in Triton☆427Updated last week
- Inference Vision Transformer (ViT) in plain C/C++ with ggml☆306Updated last year