Lizonghang / TPI-LLMLinks
TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices
☆180Updated 3 weeks ago
Alternatives and similar repositories for TPI-LLM
Users that are interested in TPI-LLM are comparing it to the libraries listed below
Sorting:
- cli tool to quantize gguf, gptq, awq, hqq and exl2 models☆73Updated 6 months ago
- Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache☆109Updated 2 months ago
- ☆137Updated this week
- ☆57Updated 4 months ago
- GRadient-INformed MoE☆263Updated 9 months ago
- LLM inference in C/C++☆77Updated this week
- Efficient LLM Inference over Long Sequences☆378Updated 3 weeks ago
- ☆149Updated 2 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated 11 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆273Updated last month
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 8 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆426Updated last month
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆643Updated 2 months ago
- 🤖 A multilingual translation tool that automatically converts Hugging Face's daily AI research papers into 🇯🇵 Japanese, 🇰🇷 Korean, �…☆16Updated this week
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆80Updated last month
- Simple extension on vLLM to help you speed up reasoning model without training.☆161Updated 3 weeks ago
- automatically quant GGUF models☆184Updated last week
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆238Updated last year
- KV cache compression for high-throughput LLM inference☆131Updated 4 months ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆136Updated last week
- Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆85Updated 2 weeks ago
- LongRoPE is a novel method that can extends the context window of pre-trained LLMs to an impressive 2048k tokens.☆231Updated 10 months ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆71Updated 4 months ago
- Distributed Inference for mlx LLm☆93Updated 10 months ago
- The homepage of OneBit model quantization framework.☆181Updated 4 months ago
- Code for the paper "Coding Agents with Multimodal Browsing are Generalist Problem Solvers"☆50Updated this week
- 1.58 Bit LLM on Apple Silicon using MLX☆214Updated last year
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆466Updated 4 months ago
- ☆45Updated last year
- scalable and robust tree-based speculative decoding algorithm☆348Updated 4 months ago