mani-kantap / llm-inference-solutions
A collection of all available inference solutions for the LLMs
☆81Updated 2 weeks ago
Alternatives and similar repositories for llm-inference-solutions:
Users that are interested in llm-inference-solutions are comparing it to the libraries listed below
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆212Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 5 months ago
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆225Updated 10 months ago
- Easy and Efficient Quantization for Transformers☆192Updated last month
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆126Updated this week
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- ☆237Updated last week
- ☆53Updated 9 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆200Updated 4 months ago
- Comparison of Language Model Inference Engines☆208Updated 3 months ago
- ☆54Updated 6 months ago
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆289Updated last month
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆196Updated 8 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆109Updated 3 months ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆116Updated last year
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆335Updated 7 months ago
- Benchmark suite for LLMs from Fireworks.ai☆69Updated last month
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆162Updated last week
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆79Updated last week
- The official repo for "LLoCo: Learning Long Contexts Offline"☆115Updated 9 months ago
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆121Updated 3 weeks ago
- experiments with inference on llama☆104Updated 9 months ago
- ☆193Updated 3 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆763Updated 6 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆298Updated 8 months ago
- Advanced Quantization Algorithm for LLMs/VLMs.☆391Updated this week
- Google TPU optimizations for transformers models☆103Updated last month
- GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ☆99Updated last year
- Unofficial implementation of https://arxiv.org/pdf/2407.14679☆44Updated 6 months ago
- ☆117Updated 10 months ago