stevelaskaridis / awesome-mobile-llm
Awesome Mobile LLMs
☆122Updated 2 weeks ago
Alternatives and similar repositories for awesome-mobile-llm:
Users that are interested in awesome-mobile-llm are comparing it to the libraries listed below
- ☆80Updated 3 months ago
- Advanced Quantization Algorithm for LLMs/VLMs.☆364Updated this week
- [EMNLP Findings 2024] MobileQuant: Mobile-friendly Quantization for On-device Language Models☆54Updated 4 months ago
- codebase for "MELTing Point: Mobile Evaluation of Language Transformers"☆16Updated 6 months ago
- Awesome list for LLM quantization☆160Updated last month
- [EMNLP 2024 Demo] TinyAgent: Function Calling at the Edge!☆354Updated 4 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆155Updated last week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆262Updated 3 weeks ago
- ☆36Updated 2 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆241Updated 3 months ago
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆175Updated 2 weeks ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆180Updated 2 months ago
- LLM Serving Performance Evaluation Harness☆66Updated 5 months ago
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆185Updated last month
- [NeurIPS 24 Spotlight] MaskLLM: Learnable Semi-structured Sparsity for Large Language Models☆150Updated 3 weeks ago
- Notes on quantization in neural networks☆66Updated last year
- Code repo for the paper "SpinQuant LLM quantization with learned rotations"☆207Updated 2 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆327Updated 5 months ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆80Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆66Updated this week
- PB-LLM: Partially Binarized Large Language Models☆150Updated last year
- Efficient LLM Inference Acceleration using Prompting☆45Updated 3 months ago
- KV cache compression for high-throughput LLM inference☆109Updated last week
- A minimal implementation of vllm.☆33Updated 6 months ago
- The official implementation of the paper "Demystifying the Compression of Mixture-of-Experts Through a Unified Framework".☆54Updated 3 months ago
- A collection of all available inference solutions for the LLMs☆76Updated 4 months ago
- ☆41Updated 3 months ago
- ☆116Updated 9 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆145Updated 7 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆572Updated last week