stevelaskaridis / awesome-mobile-llm
Awesome Mobile LLMs
☆56Updated this week
Related projects: ⓘ
- Efficient LLM Inference Acceleration using Prompting☆38Updated this week
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆123Updated 3 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆104Updated 3 months ago
- Fast Inference of MoE Models with CPU-GPU Orchestration☆163Updated 4 months ago
- EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).☆44Updated 3 months ago
- TinyAgent: Function Calling at the Edge!☆124Updated 2 weeks ago
- PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".☆70Updated last year
- ☆38Updated 2 weeks ago
- The official implementation of the paper "Demystifying the Compression of Mixture-of-Experts Through a Unified Framework".☆34Updated 2 weeks ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆134Updated 2 months ago
- Unofficial implementations of block/layer-wise pruning methods for LLMs.☆45Updated 4 months ago
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆213Updated 3 weeks ago
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆155Updated last month
- ☆29Updated 3 weeks ago
- PB-LLM: Partially Binarized Large Language Models☆143Updated 10 months ago
- For releasing code related to compression methods for transformers, accompanying our publications☆356Updated 2 weeks ago
- ☆117Updated 8 months ago
- Explorations into some recent techniques surrounding speculative decoding☆190Updated 11 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆69Updated 6 months ago
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆282Updated last month
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆205Updated this week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆55Updated this week
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆51Updated 3 months ago
- [ICLR 2024] Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding☆138Updated 6 months ago
- Survey Paper List - Efficient LLM and Foundation Models☆190Updated 6 months ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆68Updated 2 months ago
- ☆174Updated 4 months ago
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆39Updated this week
- A minimal implementation of vllm.☆29Updated last month
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆28Updated 2 months ago