Lizonghang / TPI-LLM
TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices
☆164Updated last month
Related projects ⓘ
Alternatives and complementary repositories for TPI-LLM
- cli tool to quantize gguf, gptq, awq, hqq and exl2 models☆62Updated last month
- automatically quant GGUF models☆137Updated this week
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆172Updated 3 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆498Updated last week
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- 1.58-bit LLaMa model☆79Updated 7 months ago
- ☆117Updated this week
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 3 weeks ago
- Distributed Inference for mlx LLm☆69Updated 3 months ago
- 1.58 Bit LLM on Apple Silicon using MLX☆136Updated 6 months ago
- Fast approximate inference on a single GPU with sparsity aware offloading☆38Updated 10 months ago
- A pipeline parallel training script for LLMs.☆83Updated this week
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆194Updated 6 months ago
- Mixture-of-Ollamas☆26Updated 3 months ago
- Fast parallel LLM inference for MLX☆145Updated 4 months ago
- A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX.☆53Updated this week
- PyTorch implementation of models from the Zamba2 series.☆158Updated this week
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆307Updated 5 months ago
- prime (previously called ZeroBand) is a framework for efficient, globally distributed training of AI models over the internet.☆207Updated this week
- HTTP proxy for on-demand model loading with llama.cpp (or other OpenAI compatible backends)☆38Updated this week
- smolLM with Entropix sampler on pytorch☆137Updated last week
- GPT-4 Level Conversational QA Trained In a Few Hours☆54Updated 2 months ago
- ☆53Updated 5 months ago
- Large Model Proxy is designed to make it easy to run multiple resource-heavy Large Models (LM) on the same machine with limited amount of…☆46Updated last month
- The homepage of OneBit model quantization framework.☆156Updated 4 months ago
- Train your own small bitnet model☆55Updated 3 weeks ago
- Low-Rank adapter extraction for fine-tuned transformers model☆162Updated 6 months ago
- An OpenAI API compatible API for chat with image input and questions about the images. aka Multimodal.☆200Updated last month
- A fast batching API to serve LLM models☆172Updated 6 months ago