vllm-project / vllm-omniLinks
A framework for efficient model inference with omni-modality models
☆1,335Updated this week
Alternatives and similar repositories for vllm-omni
Users that are interested in vllm-omni are comparing it to the libraries listed below
Sorting:
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆864Updated last week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆557Updated this week
- ☆1,087Updated this week
- Efficient LLM Inference over Long Sequences☆393Updated 5 months ago
- ☆440Updated 4 months ago
- Materials for learning SGLang☆693Updated last week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,166Updated 2 months ago
- Muon is Scalable for LLM Training☆1,387Updated 4 months ago
- A throughput-oriented high-performance serving framework for LLMs☆924Updated last month
- VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo☆1,432Updated last week
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆933Updated this week
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆349Updated last week
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,208Updated 3 months ago
- Scalable toolkit for efficient model reinforcement☆1,141Updated this week
- Common recipes to run vLLM☆283Updated last week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆2,447Updated this week
- Tile-Based Runtime for Ultra-Low-Latency LLM Inference☆451Updated 2 weeks ago
- Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond☆713Updated 3 weeks ago
- Advanced quantization toolkit for LLMs and VLMs. Support for WOQ, MXFP4, NVFP4, GGUF, Adaptive Schemes and seamless integration with Tra…☆775Updated this week
- DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference☆572Updated last month
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆755Updated last week
- FlagScale is a large model toolkit based on open-sourced projects.☆426Updated this week
- Speed Always Wins: A Survey on Efficient Architectures for Large Language Models☆373Updated last month
- slime is an LLM post-training framework for RL Scaling.☆2,911Updated this week
- Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.☆228Updated this week
- LLM KV cache compression made easy☆717Updated last week
- ☆1,367Updated last month
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆515Updated 10 months ago
- MoBA: Mixture of Block Attention for Long-Context LLMs☆2,022Updated 8 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆198Updated 2 weeks ago