andrewkchan / deepseek.cpp
CPU inference for the DeepSeek family of large language models in pure C++
☆282Updated last month
Alternatives and similar repositories for deepseek.cpp:
Users that are interested in deepseek.cpp are comparing it to the libraries listed below
- Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O☆279Updated 2 months ago
- llama.cpp fork with additional SOTA quants and improved performance☆231Updated this week
- ☆141Updated last month
- LM inference server implementation based on *.cpp.☆154Updated this week
- Efficient LLM Inference over Long Sequences☆365Updated last month
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆811Updated 2 weeks ago
- Efficient inference of large language models.☆146Updated 3 months ago
- Low-bit LLM inference on CPU with lookup table☆705Updated 2 months ago
- Muon is Scalable for LLM Training☆993Updated this week
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆622Updated this week
- Advanced Quantization Algorithm for LLMs/VLMs.☆413Updated this week
- Materials for learning SGLang☆360Updated last week
- Free Search is a wrapper on top of publicly available SearXNG instances to give free internet access as a rest API.☆147Updated this week
- A throughput-oriented high-performance serving framework for LLMs☆785Updated 6 months ago
- ☆81Updated 3 weeks ago
- run DeepSeek-R1 GGUFs on KTransformers☆212Updated 3 weeks ago
- Self-hosted voice chat with LLMs☆422Updated last month
- A quantization algorithm for LLM☆137Updated 9 months ago
- Kyutai with an "eye"☆160Updated last week
- ☆40Updated this week
- ☆124Updated last month
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆330Updated 9 months ago
- Review/Check GGUF files and estimate the memory usage and maximum tokens per second.☆139Updated 2 weeks ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆149Updated last week
- OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training☆478Updated 2 months ago
- Inference of Mamba models in pure C☆187Updated last year
- prime is a framework for efficient, globally distributed training of AI models over the internet.☆689Updated this week
- 📋 NotebookMLX - An Open Source version of NotebookLM (Ported NotebookLlama)☆267Updated 3 weeks ago
- MoBA: Mixture of Block Attention for Long-Context LLMs☆1,696Updated 3 weeks ago
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆236Updated this week