Lizonghang / TPI-LLMLinks
TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices
☆185Updated last month
Alternatives and similar repositories for TPI-LLM
Users that are interested in TPI-LLM are comparing it to the libraries listed below
Sorting:
- cli tool to quantize gguf, gptq, awq, hqq and exl2 models☆73Updated 7 months ago
- GRadient-INformed MoE☆263Updated 9 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆81Updated last month
- Sparse Inferencing for transformer based LLMs☆193Updated last week
- LLM Inference on consumer devices☆120Updated 4 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated last year
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆446Updated last month
- Distributed Inference for mlx LLm☆93Updated 11 months ago
- ☆267Updated this week
- ☆139Updated 3 weeks ago
- Easy to use, High Performant Knowledge Distillation for LLMs☆88Updated 2 months ago
- A pipeline parallel training script for LLMs.☆153Updated 2 months ago
- Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆91Updated last week
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆139Updated this week
- Code to train and evaluate Neural Attention Memory Models to obtain universally-applicable memory systems for transformers.☆316Updated 8 months ago
- 1.58 Bit LLM on Apple Silicon using MLX☆214Updated last year
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆153Updated 9 months ago
- PyTorch implementation of models from the Zamba2 series.☆184Updated 5 months ago
- Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache☆112Updated last week
- ☆101Updated 10 months ago
- ☆156Updated 3 months ago
- AnyModal is a Flexible Multimodal Language Model Framework for PyTorch☆100Updated 6 months ago
- ☆134Updated 10 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆647Updated 2 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆277Updated last month
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆72Updated 5 months ago
- LLM inference in C/C++☆21Updated 3 months ago
- Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.☆31Updated 3 months ago
- Testing LLM reasoning abilities with family relationship quizzes.☆62Updated 5 months ago
- ☆57Updated 5 months ago