Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆114Updated last month
Alternatives and similar repositories for UMbreLLa
Users that are interested in UMbreLLa are comparing it to the libraries listed below
Sorting:
- ☆131Updated last month
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆334Updated last week
- Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache☆99Updated 3 weeks ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆109Updated this week
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆209Updated 5 months ago
- KV cache compression for high-throughput LLM inference☆126Updated 3 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆266Updated 7 months ago
- ☆55Updated 6 months ago
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆238Updated last year
- Work in progress.☆62Updated last month
- QuIP quantization☆52Updated last year
- PyTorch implementation of models from the Zamba2 series.☆181Updated 3 months ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆84Updated 2 months ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆72Updated 3 months ago
- 1.58-bit LLaMa model☆81Updated last year
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆199Updated 9 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 7 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆231Updated 3 months ago
- ☆117Updated last week
- Code for data-aware compression of DeepSeek models☆24Updated last month
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 5 months ago
- ☆45Updated 2 weeks ago
- InferX is a Inference Function as a Service Platform☆77Updated this week
- ☆209Updated 3 months ago
- prime-rl is a codebase for decentralized RL training at scale☆211Updated this week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆35Updated last year
- scalable and robust tree-based speculative decoding algorithm☆345Updated 3 months ago
- Bamboo-7B Large Language Model☆93Updated last year
- ☆198Updated 5 months ago
- Simple extension on vLLM to help you speed up reasoning model without training.☆150Updated last week