chu-tianxiang / llama-cpp-torchLinks
llama.cpp to PyTorch Converter
☆33Updated last year
Alternatives and similar repositories for llama-cpp-torch
Users that are interested in llama-cpp-torch are comparing it to the libraries listed below
Sorting:
- QuIP quantization☆54Updated last year
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆149Updated 2 months ago
- RWKV-7: Surpassing GPT☆92Updated 7 months ago
- 1.58-bit LLaMa model☆81Updated last year
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated 11 months ago
- SparseGPT + GPTQ Compression of LLMs like LLaMa, OPT, Pythia☆41Updated 2 years ago
- ☆53Updated last year
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆80Updated last month
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 6 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆77Updated 2 weeks ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆86Updated 2 weeks ago
- ☆137Updated this week
- Experiments with BitNet inference on CPU☆54Updated last year
- Inference of Mamba models in pure C☆187Updated last year
- ☆68Updated this week
- 3x Faster Inference; Unofficial implementation of EAGLE Speculative Decoding☆66Updated last week
- Inference code for LLaMA models☆42Updated 2 years ago
- Code for data-aware compression of DeepSeek models☆35Updated 2 weeks ago
- Low-Rank adapter extraction for fine-tuned transformers models☆173Updated last year
- ☆51Updated 7 months ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.