mit-han-lab/TinyChatEngine

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/mit-han-lab/TinyChatEngine)

mit-han-lab / TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library

☆958

Alternatives and similar repositories for TinyChatEngine

Users that are interested in TinyChatEngine are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mit-han-lab / llm-awq
View on GitHub
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,592Jul 17, 2025Updated last year
mit-han-lab / tinychat-tutorial
View on GitHub
☆79Nov 5, 2024Updated last year
mit-han-lab / tinyengine
View on GitHub
[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep L…
☆951Nov 27, 2024Updated last year
mit-han-lab / smoothquant
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,670Jul 12, 2024Updated 2 years ago
mit-han-lab / parallel-computing-tutorial
View on GitHub
☆178Aug 9, 2023Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
mit-han-lab / mcunet
View on GitHub
[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep L…
☆704Mar 29, 2024Updated 2 years ago
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆343Jul 2, 2024Updated 2 years ago
SqueezeAILab / SqueezeLLM
View on GitHub
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆722Aug 13, 2024Updated last year
mit-han-lab / duo-attention
View on GitHub
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆539Feb 10, 2025Updated last year
HuangOwen / Awesome-LLM-Compression
View on GitHub
Awesome LLM compression research papers and tools.
☆1,853Jun 30, 2026Updated 2 weeks ago
IST-DASLab / gptq
View on GitHub
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆2,334Mar 27, 2024Updated 2 years ago
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆400Jul 10, 2025Updated last year
UbiquitousLearning / mllm
View on GitHub
Fast Multimodal LLM on Mobile Devices
☆1,572Jun 26, 2026Updated 3 weeks ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆5,988Updated this week
ModelTC / LightLLM
View on GitHub
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…
☆4,183Updated this week
mlc-ai / llm-perf-bench
View on GitHub
☆121Apr 22, 2024Updated 2 years ago
microsoft / BitBLAS
View on GitHub
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆769Aug 6, 2025Updated 11 months ago
nunchaku-ai / deepcompressor
View on GitHub
Model Compression Toolbox for Large Language Models and Diffusion Models
☆794Aug 14, 2025Updated 11 months ago
OpenGVLab / OmniQuant
View on GitHub
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆901Nov 26, 2025Updated 7 months ago
LeiWang1999 / AutoGPTQ.tvm
View on GitHub
GPTQ inference TVM kernel
☆41Apr 25, 2024Updated 2 years ago
yifanlu0227 / MIT-6.5940
View on GitHub
All Homeworks for TinyML and Efficient Deep Learning Computing 6.5940 • Fall • 2023 • https://efficientml.ai
☆200Dec 2, 2023Updated 2 years ago
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Jun 21, 2026Updated 3 weeks ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
NVIDIA / TensorRT-LLM
View on GitHub
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…
☆14,162Updated this week
Tiiny-AI / PowerInfer
View on GitHub
High-speed Large Language Model Serving for Local Deployment
☆9,665May 11, 2026Updated 2 months ago
mlc-ai / mlc-llm
View on GitHub
Universal LLM Deployment Engine with ML Compilation
☆22,975Jul 13, 2026Updated last week
horseee / Awesome-Efficient-LLM
View on GitHub
A curated list for Efficient Large Language Models
☆2,023Jun 17, 2025Updated last year
xxyux / SpInfer
View on GitHub
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆68Mar 25, 2025Updated last year
xlite-dev / Awesome-LLM-Inference
View on GitHub
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆5,404Jun 23, 2026Updated 3 weeks ago
AutoGPTQ / AutoGPTQ
View on GitHub
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
☆5,073Apr 11, 2025Updated last year
mit-han-lab / tiny-training
View on GitHub
On-Device Training Under 256KB Memory [NeurIPS'22]
☆523Mar 29, 2024Updated 2 years ago
NVlabs / VILA
View on GitHub
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and clou…
☆3,842Mar 12, 2026Updated 4 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
mlc-ai / relax
View on GitHub
☆174Jul 10, 2026Updated last week
FasterDecoding / Medusa
View on GitHub
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,757Jun 25, 2024Updated 2 years ago
SafeAILab / EAGLE
View on GitHub
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).
☆2,471Feb 20, 2026Updated 5 months ago
AIoT-MLSys-Lab / Efficient-LLMs-Survey
View on GitHub
[TMLR 2024] Efficient Large Language Models: A Survey
☆1,258Jun 23, 2025Updated last year
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆445Mar 5, 2026Updated 4 months ago
AlibabaResearch / flash-llm
View on GitHub
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆246Sep 24, 2023Updated 2 years ago
FMInference / DejaVu
View on GitHub
☆359Apr 2, 2024Updated 2 years ago