Infini-AI-Lab / UMbreLLaLinks
LLM Inference on consumer devices
☆125Updated 8 months ago
Alternatives and similar repositories for UMbreLLa
Users that are interested in UMbreLLa are comparing it to the libraries listed below
Sorting:
- ☆158Updated 5 months ago
- ☆64Updated 5 months ago
- Sparse Inferencing for transformer based LLMs☆215Updated 4 months ago
- Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache☆135Updated 3 months ago
- KV cache compression for high-throughput LLM inference☆146Updated 10 months ago
- Samples of good AI generated CUDA kernels☆92Updated 6 months ago
- ☆111Updated 3 weeks ago
- ☆63Updated 6 months ago
- scalable and robust tree-based speculative decoding algorithm☆363Updated 10 months ago
- ☆458Updated 2 weeks ago
- REAP: Router-weighted Expert Activation Pruning for SMoE compression☆136Updated last week
- Efficient non-uniform quantization with GPTQ for GGUF☆56Updated 2 months ago
- ☆219Updated 10 months ago
- 1.58-bit LLaMa model☆83Updated last year
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆243Updated last year
- Easy, Fast, and Scalable Multimodal AI☆78Updated last week
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆345Updated this week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆249Updated 10 months ago
- Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.☆47Updated last month
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆433Updated 3 weeks ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆317Updated 2 weeks ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆571Updated 2 weeks ago
- Code for data-aware compression of DeepSeek models☆65Updated last month
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆148Updated last month
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆93Updated this week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆378Updated 7 months ago
- Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model☆254Updated 6 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆214Updated last week
- Official implementation for Training LLMs with MXFP4☆111Updated 7 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆100Updated 6 months ago