openmlsys / openmlsys-enLinks
《Machine Learning Systems: Design and Implementation》- English Version
☆37Updated last year
Alternatives and similar repositories for openmlsys-en
Users that are interested in openmlsys-en are comparing it to the libraries listed below
Sorting:
- Systems for GenAI☆157Updated last week
- Review automated kernel generation in the era of LLMs☆91Updated 2 weeks ago
- fmchisel: Efficient Compression and Training Algorithms for Foundation Models☆83Updated 3 months ago
- a minimal cache manager for PagedAttention, on top of llama3.☆135Updated last year
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆248Updated last year
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆279Updated this week
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆283Updated 11 months ago
- Materials for learning SGLang☆738Updated last month
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆313Updated 7 months ago
- paper and its code for AI System☆347Updated last month
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆209Updated last year
- Cataloging released Triton kernels.☆292Updated 4 months ago
- A large-scale simulation framework for LLM inference☆530Updated 6 months ago
- A curated list of awesome projects and papers for distributed training or inference☆265Updated last year
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆260Updated last year
- Distributed MoE in a Single Kernel [NeurIPS '25]☆190Updated last week
- torchcomms: a modern PyTorch communications API☆327Updated this week
- ☆96Updated 10 months ago
- JAX backend for SGL☆234Updated this week
- ☆47Updated last year
- ☆166Updated 2 months ago
- Allow torch tensor memory to be released and resumed later☆216Updated 3 weeks ago
- Modular and structured prompt caching for low-latency LLM inference☆110Updated last year
- A low-latency & high-throughput serving engine for LLMs☆470Updated last month
- ☆628Updated 3 weeks ago
- LLMem: GPU Memory Estimation for Fine-Tuning Pre-Trained LLMs☆28Updated 8 months ago
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆250Updated this week
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆87Updated last week
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆116Updated 2 months ago
- ☆222Updated last year