UbiquitousLearning / mllm
Fast Multimodal LLM on Mobile Devices
☆737Updated last week
Alternatives and similar repositories for mllm:
Users that are interested in mllm are comparing it to the libraries listed below
- Low-bit LLM inference on CPU with lookup table☆695Updated 2 months ago
- [NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which r…☆934Updated 2 weeks ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆594Updated last week
- Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)☆1,021Updated 3 weeks ago
- [EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a V…☆427Updated this week
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…☆3,004Updated this week
- FlashInfer: Kernel Library for LLM Serving☆2,355Updated this week
- Survey Paper List - Efficient LLM and Foundation Models☆240Updated 5 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…☆409Updated 6 months ago
- Fast inference from large lauguage models via speculative decoding☆678Updated 6 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,074Updated this week
- Demonstration of running a native LLM on Android device.☆119Updated last week
- A throughput-oriented high-performance serving framework for LLMs☆755Updated 5 months ago
- Awesome Mobile LLMs☆149Updated last month
- ☆310Updated 11 months ago
- Strong and Open Vision Language Assistant for Mobile Devices☆1,168Updated 10 months ago
- C++ implementation of Qwen-LM☆581Updated 3 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆755Updated 6 months ago
- 10x Faster Long-Context LLM By Smart KV Cache Optimizations☆584Updated this week
- 📰 Must-read papers and blogs on Speculative Decoding ⚡️☆637Updated this week
- Large Language Model (LLM) Systems Paper List☆810Updated this week
- For releasing code related to compression methods for transformers, accompanying our publications☆413Updated last month
- Disaggregated serving system for Large Language Models (LLMs).☆491Updated 6 months ago
- 📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉☆3,616Updated last week
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆237Updated 11 months ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆2,005Updated last week
- TinyChatEngine: On-Device LLM Inference Library☆818Updated 8 months ago
- a lightweight LLM model inference framework☆717Updated 11 months ago
- ☆53Updated 3 months ago
- RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.☆656Updated last month