Zepan / llama.cppLinks
Port of Facebook's LLaMA model in C/C++
☆13Updated 2 years ago
Alternatives and similar repositories for llama.cpp
Users that are interested in llama.cpp are comparing it to the libraries listed below
Sorting:
- ☆172Updated this week
- ☆120Updated last year
- My develoopment fork of llama.cpp. For now working on RK3588 NPU and Tenstorrent backend☆114Updated 2 weeks ago
- ☆577Updated last year
- Repository of model demos using TT-Buda☆63Updated 10 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆749Updated 6 months ago
- Efficient Inference of Transformer models☆478Updated last year
- Tiny ASIC implementation for "The Era of 1-bit LLMs All Large Language Models are in 1.58 Bits" matrix multiplication unit☆175Updated last year
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆93Updated last week
- OpenAI Triton backend for Intel® GPUs☆226Updated this week
- GPTQ inference Triton kernel☆321Updated 2 years ago
- Development repository for the Triton language and compiler☆140Updated last week
- LLaMa/RWKV onnx models, quantization and testcase☆367Updated 2 years ago
- Inference RWKV v7 in pure C.☆44Updated 3 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆1,005Updated last year
- [ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs☆229Updated last year
- Tenstorrent TT-BUDA Repository☆314Updated 10 months ago
- Reverse engineering the rk3588 npu☆110Updated last year
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆74Updated last year
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆674Updated 9 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆280Updated 2 years ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆395Updated last year
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆127Updated last year
- An innovative library for efficient LLM inference via low-bit quantization☆352Updated last year
- SoTA Transformers with C-backend for fast inference on your CPU.☆311Updated 2 years ago
- LLM training in simple, raw C/HIP for AMD GPUs☆58Updated last year
- Fork of LLVM to support AMD AIEngine processors☆187Updated this week
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization☆713Updated last year
- An experimental CPU backend for Triton☆174Updated 2 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆276Updated 6 months ago