Lizonghang / TPI-LLMLinks
TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices
☆186Updated 2 months ago
Alternatives and similar repositories for TPI-LLM
Users that are interested in TPI-LLM are comparing it to the libraries listed below
Sorting:
- cli tool to quantize gguf, gptq, awq, hqq and exl2 models☆74Updated 7 months ago
- GRadient-INformed MoE☆264Updated 10 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆82Updated 2 months ago
- Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆95Updated last week
- LLM Inference on consumer devices☆123Updated 4 months ago
- Train, tune, and infer Bamba model☆130Updated 2 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated last year
- Code to train and evaluate Neural Attention Memory Models to obtain universally-applicable memory systems for transformers.☆318Updated 9 months ago
- Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.☆39Updated 3 weeks ago
- ☆290Updated this week
- PyTorch implementation of models from the Zamba2 series.☆184Updated 6 months ago
- ☆102Updated 11 months ago
- Distributed Inference for mlx LLm☆94Updated last year
- Testing LLM reasoning abilities with family relationship quizzes.☆63Updated 6 months ago
- ☆51Updated last year
- automatically quant GGUF models☆190Updated this week
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆142Updated this week
- Sparse Inferencing for transformer based LLMs☆196Updated last week
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 9 months ago
- RWKV-7: Surpassing GPT☆94Updated 8 months ago
- ☆134Updated 11 months ago
- An easy-to-understand framework for LLM samplers that rewind and revise generated tokens☆140Updated 5 months ago
- 1.58 Bit LLM on Apple Silicon using MLX☆217Updated last year
- A pipeline parallel training script for LLMs.☆153Updated 3 months ago
- ☆155Updated 3 months ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆72Updated 6 months ago
- Clue inspired puzzles for testing LLM deduction abilities☆40Updated 4 months ago
- 1.58-bit LLaMa model☆81Updated last year
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆173Updated 6 months ago
- The homepage of OneBit model quantization framework.☆185Updated 6 months ago