cornserve-ai / cornserveLinks
Easy, Fast, and Scalable Multimodal AI
☆97Updated last week
Alternatives and similar repositories for cornserve
Users that are interested in cornserve are comparing it to the libraries listed below
Sorting:
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆137Updated last year
- Block Diffusion for Ultra-Fast Speculative Decoding☆432Updated this week
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆80Updated last month
- KV cache compression for high-throughput LLM inference☆150Updated 11 months ago
- ☆48Updated last year
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆115Updated 2 months ago
- [NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning☆62Updated 2 months ago
- ☆117Updated 8 months ago
- High-performance distributed data shuffling (all-to-all) library for MoE training and inference☆109Updated 3 weeks ago
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆491Updated 2 months ago
- dInfer: An Efficient Inference Framework for Diffusion Language Models☆403Updated 3 weeks ago
- Official implementation for Training LLMs with MXFP4☆118Updated 9 months ago
- A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM☆205Updated this week
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆258Updated last year
- Vortex: A Flexible and Efficient Sparse Attention Framework☆45Updated last week
- Accelerating MoE with IO and Tile-aware Optimizations☆563Updated last week
- Memory optimized Mixture of Experts☆72Updated 6 months ago
- ☆117Updated 3 weeks ago
- Distributed MoE in a Single Kernel [NeurIPS '25]☆188Updated this week
- LLM Serving Performance Evaluation Harness☆83Updated 11 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆93Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆176Updated last year
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆193Updated last week
- ☆64Updated 8 months ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆148Updated 2 months ago
- Code for data-aware compression of DeepSeek models☆69Updated last month
- AI-Driven Research Systems (ADRS)☆117Updated last month
- Efficient LLM Inference over Long Sequences☆394Updated 7 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆161Updated 3 months ago
- 16-fold memory access reduction with nearly no loss☆109Updated 10 months ago