cornserve-ai / cornserveLinks
Easy, Fast, and Scalable Multimodal AI
☆109Updated this week
Alternatives and similar repositories for cornserve
Users that are interested in cornserve are comparing it to the libraries listed below
Sorting:
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆141Updated last year
- KV cache compression for high-throughput LLM inference☆153Updated last year
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆87Updated last week
- ☆47Updated last year
- Block Diffusion for Ultra-Fast Speculative Decoding☆459Updated last week
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆116Updated 3 months ago
- [NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning☆63Updated 3 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆176Updated last year
- LLM Serving Performance Evaluation Harness☆83Updated 11 months ago
- Code for data-aware compression of DeepSeek models☆70Updated last month
- Official implementation for Training LLMs with MXFP4☆118Updated 9 months ago
- High-performance distributed data shuffling (all-to-all) library for MoE training and inference☆112Updated last month
- A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM☆228Updated this week
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆495Updated 2 months ago
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆196Updated 3 weeks ago
- ☆64Updated 8 months ago
- Distributed MoE in a Single Kernel [NeurIPS '25]☆191Updated this week
- Accelerating MoE with IO and Tile-aware Optimizations☆569Updated 3 weeks ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆148Updated 3 months ago
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆391Updated this week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆160Updated 3 months ago
- ☆119Updated last month
- Vortex: A Flexible and Efficient Sparse Attention Framework☆45Updated 2 weeks ago
- ☆118Updated 8 months ago
- ☆47Updated 9 months ago
- The evaluation framework for training-free sparse attention in LLMs☆117Updated 2 weeks ago
- torchcomms: a modern PyTorch communications API☆327Updated this week
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆248Updated last year
- Memory optimized Mixture of Experts☆73Updated 6 months ago
- AI-Driven Research Systems (ADRS)☆119Updated last month