Lizonghang / TPI-LLM
TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices
☆168Updated 2 months ago
Alternatives and similar repositories for TPI-LLM:
Users that are interested in TPI-LLM are comparing it to the libraries listed below
- cli tool to quantize gguf, gptq, awq, hqq and exl2 models☆67Updated last month
- GRadient-INformed MoE☆261Updated 4 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆241Updated 3 months ago
- 1.58 Bit LLM on Apple Silicon using MLX☆180Updated 8 months ago
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆212Updated 9 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆190Updated 6 months ago
- Bamboo-7B Large Language Model☆89Updated 10 months ago
- automatically quant GGUF models☆151Updated this week
- A pipeline parallel training script for LLMs.☆121Updated this week
- LLM as Interpreter for Natural Language Programming, Pseudo-code Programming and Flow Programming of AI Agents☆34Updated 6 months ago
- PyTorch implementation of models from the Zamba2 series.☆173Updated last week
- Distributed Inference for mlx LLm☆79Updated 5 months ago
- Easy to use, High Performant Knowledge Distillation for LLMs☆40Updated 3 weeks ago
- scalable and robust tree-based speculative decoding algorithm☆331Updated this week
- ☆98Updated 5 months ago
- OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training☆426Updated 2 weeks ago
- Efficient LLM Inference over Long Sequences☆349Updated last month
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 3 months ago
- KV cache compression for high-throughput LLM inference☆109Updated last week
- ☆109Updated last month
- Testing LLM reasoning abilities with family relationship quizzes.☆57Updated this week
- 1.58-bit LLaMa model☆80Updated 9 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆572Updated last week
- Micro Llama is a small Llama based model with 300M parameters trained from scratch with $500 budget☆140Updated 10 months ago
- llama.cpp fork with additional SOTA quants and improved performance☆133Updated this week
- Code to train and evaluate Neural Attention Memory Models to obtain universally-applicable memory systems for transformers.☆282Updated 3 months ago
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆319Updated 7 months ago
- ☆192Updated last week
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆80Updated last week
- Fast Real-time Object Detection with High-Res Output https://x.com/_akhaliq/status/1840213012818329826☆61Updated 4 months ago