Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆96Updated last week
Alternatives and similar repositories for UMbreLLa:
Users that are interested in UMbreLLa are comparing it to the libraries listed below
- ☆112Updated 2 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆200Updated 4 months ago
- ☆51Updated 4 months ago
- KV cache compression for high-throughput LLM inference☆117Updated last month
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆255Updated 5 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆88Updated this week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆36Updated last year
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆126Updated this week
- scalable and robust tree-based speculative decoding algorithm☆337Updated last month
- ☆193Updated 3 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆109Updated 3 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆232Updated 3 weeks ago
- [ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs☆210Updated 2 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 5 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆196Updated 8 months ago
- QuIP quantization☆52Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆279Updated 2 months ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆26Updated 4 months ago
- Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning☆37Updated 3 weeks ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆123Updated 3 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆432Updated last month
- Fast low-bit matmul kernels in Triton☆263Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆68Updated 6 months ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆79Updated last week
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆612Updated this week
- Bamboo-7B Large Language Model☆92Updated 11 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆86Updated 3 weeks ago
- 1.58-bit LLaMa model☆82Updated 11 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆272Updated last year