Lizonghang / TPI-LLMLinks

TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices

☆186

Alternatives and similar repositories for TPI-LLM

Users that are interested in TPI-LLM are comparing it to the libraries listed below

Sorting:

xhedit / quantkit
cli tool to quantize gguf, gptq, awq, hqq and exl2 models
☆74Updated 7 months ago
microsoft / GRIN-MoE
GRadient-INformed MoE
☆264Updated 10 months ago
tiiuae / onebitllms
Lightweight toolkit package to train and fine-tune 1.58bit Language models
☆82Updated 2 months ago
snu-mllab / KVzip
Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
☆95Updated last week
Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆123Updated 4 months ago
foundation-model-stack / bamba
Train, tune, and infer Bamba model
☆130Updated 2 months ago
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆198Updated last year
SakanaAI / evo-memory
Code to train and evaluate Neural Attention Memory Models to obtain universally-applicable memory systems for transformers.
☆318Updated 9 months ago
jukofyork / transplant-vocab
Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.
☆39Updated 3 weeks ago
google-ai-edge / LiteRT-LM
☆290Updated this week
Zyphra / Zamba2
PyTorch implementation of models from the Zamba2 series.
☆184Updated 6 months ago
AlexBodner / How_Much_VRAM
☆102Updated 11 months ago
mzbac / mlx_sharding
Distributed Inference for mlx LLm
☆94Updated last year
fairydreaming / farel-bench
Testing LLM reasoning abilities with family relationship quizzes.
☆63Updated 6 months ago
nyunAI / PruneGPT
☆51Updated last year
leafspark / AutoGGUF
automatically quant GGUF models
☆190Updated this week
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆142Updated this week
NimbleEdge / sparse_transformers
Sparse Inferencing for transformer based LLMs
☆196Updated last week
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated 9 months ago
BlinkDL / modded-nanogpt-rwkv
RWKV-7: Surpassing GPT
☆94Updated 8 months ago
QuixiAI / grokadamw
☆134Updated 11 months ago
Mihaiii / backtrack_sampler
An easy-to-understand framework for LLM samplers that rewind and revise generated tokens
☆140Updated 5 months ago
exo-explore / mlx-bitnet
1.58 Bit LLM on Apple Silicon using MLX
☆217Updated last year
tdrussell / qlora-pipe
A pipeline parallel training script for LLMs.
☆153Updated 3 months ago
menloresearch / ReZero
☆155Updated 3 months ago
BlinkDL / fast.c
Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.
☆72Updated 6 months ago
bradhilton / temporal-clue
Clue inspired puzzles for testing LLM deduction abilities
☆40Updated 4 months ago
rafacelente / bllama
1.58-bit LLaMa model
☆81Updated last year
casper-hansen / OpenCoconut
OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.
☆173Updated 6 months ago
xuyuzhuang11 / OneBit
The homepage of OneBit model quantization framework.
☆185Updated 6 months ago