GeeeekExplorer / nano-vllmLinks
Nano vLLM
☆1,659Updated this week
Alternatives and similar repositories for nano-vllm
Users that are interested in nano-vllm are comparing it to the libraries listed below
Sorting:
- My learning notes/codes for ML SYS.☆2,498Updated this week
- A throughput-oriented high-performance serving framework for LLMs☆822Updated 2 weeks ago
- FlashInfer: Kernel Library for LLM Serving☆3,211Updated this week
- Materials for learning SGLang☆435Updated 2 weeks ago
- Fast, Flexible and Portable Structured Generation☆1,008Updated last week
- MoBA: Mixture of Block Attention for Long-Context LLMs☆1,798Updated 2 months ago
- Redis for LLMs☆1,560Updated this week
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆978Updated 3 weeks ago
- Distributed RL System for LLM Reasoning☆1,774Updated this week
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.☆3,436Updated this week
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels☆1,292Updated this week
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.☆1,325Updated last week
- Muon is Scalable for LLM Training☆1,077Updated 2 months ago
- A self-learning tutorail for CUDA High Performance Programing.☆641Updated 2 months ago
- 📰 Must-read papers and blogs on Speculative Decoding ⚡️☆792Updated this week
- An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models☆936Updated this week
- Official Repo for Open-Reasoner-Zero☆1,967Updated 2 weeks ago
- Disaggregated serving system for Large Language Models (LLMs).☆614Updated 2 months ago
- 🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"☆700Updated 3 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,500Updated this week
- LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.☆780Updated this week
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.☆1,151Updated last week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,055Updated this week
- RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.☆791Updated 2 weeks ago
- Distributed Compiler Based on Triton for Parallel Systems☆829Updated this week
- vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization☆1,350Updated this week
- Community maintained hardware plugin for vLLM on Ascend☆773Updated this week
- Ring attention implementation with flash attention☆782Updated last week
- High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.☆1,144Updated this week
- Puzzles for learning Triton, play it with minimal environment configuration!☆362Updated 6 months ago