Lizonghang / TPI-LLM
TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices
☆171Updated 4 months ago
Alternatives and similar repositories for TPI-LLM:
Users that are interested in TPI-LLM are comparing it to the libraries listed below
- cli tool to quantize gguf, gptq, awq, hqq and exl2 models☆69Updated 3 months ago
- GRadient-INformed MoE☆261Updated 6 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆196Updated 8 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 5 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆617Updated last week
- smolLM with Entropix sampler on pytorch☆150Updated 4 months ago
- LLM Inference on consumer devices☆95Updated last week
- PyTorch implementation of models from the Zamba2 series.☆178Updated 2 months ago
- [ICML 2024] CLLMs: Consistency Large Language Models☆386Updated 4 months ago
- Simple extension on vLLM to help you speed up reasoning model without training.☆137Updated 2 weeks ago
- ☆126Updated 7 months ago
- automatically quant GGUF models☆163Updated this week
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆127Updated this week
- 1.58 Bit LLM on Apple Silicon using MLX☆194Updated 10 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆259Updated 5 months ago
- Fast parallel LLM inference for MLX☆174Updated 8 months ago
- An easy-to-understand framework for LLM samplers that rewind and revise generated tokens☆138Updated last month
- Train your own SOTA deductive reasoning model☆81Updated 2 weeks ago
- A pipeline parallel training script for LLMs.☆132Updated this week
- Distributed Inference for mlx LLm☆87Updated 7 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆310Updated 3 months ago
- A Pure Rust based LLM (Any LLM based MLLM such as Spark-TTS) Inference Engine, powering by Candle framework.☆74Updated this week
- RWKV-7: Surpassing GPT☆82Updated 4 months ago
- Scripts to create your own moe models using mlx☆89Updated last year
- 🤖 A multilingual translation tool that automatically converts Hugging Face's daily AI research papers into 🇯🇵 Japanese, 🇰🇷 Korean, �…☆15Updated this week
- Low-Rank adapter extraction for fine-tuned transformers models☆171Updated 10 months ago
- KV cache compression for high-throughput LLM inference☆117Updated last month
- The homepage of OneBit model quantization framework.☆174Updated last month
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆226Updated 11 months ago
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.☆116Updated this week