foldl / chatllm.cpp
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
☆376Updated this week
Related projects ⓘ
Alternatives and complementary repositories for chatllm.cpp
- A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations☆732Updated this week
- C++ implementation of Qwen-LM☆551Updated 10 months ago
- ggml implementation of BERT☆464Updated 8 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- [NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces in…☆776Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆129Updated 4 months ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆1,745Updated last month
- Open Source Text Embedding Models with OpenAI Compatible API☆131Updated 3 months ago
- llama.cpp fork with additional SOTA quants and improved performance☆89Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆697Updated last week
- Python bindings for ggml☆132Updated 2 months ago
- Manage GPU clusters for running LLMs☆551Updated this week
- Low-bit LLM inference on CPU with lookup table☆563Updated 2 weeks ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆498Updated this week
- Comparison of Language Model Inference Engines☆190Updated 2 months ago
- automatically quant GGUF models☆137Updated this week
- 支持中文场景的的小语言模型 llama2.c-zh☆143Updated 8 months ago
- Finetune ALL LLMs with ALL Adapeters on ALL Platforms!☆306Updated last month
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆118Updated this week
- CLIP inference in plain C/C++ with no extra dependencies☆457Updated 2 months ago
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆135Updated 2 months ago
- Yi-1.5 is an upgraded version of Yi, delivering stronger performance in coding, math, reasoning, and instruction-following capability.☆514Updated 4 months ago
- ☆873Updated 4 months ago
- A throughput-oriented high-performance serving framework for LLMs☆629Updated last month
- ☆376Updated this week
- Efficient AI Inference & Serving☆456Updated 10 months ago
- ggml implementation of embedding models including SentenceTransformer and BGE☆52Updated 10 months ago
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆236Updated 7 months ago
- Inference of Mamba models in pure C☆177Updated 8 months ago