MLSysOps / InfraGymLinks
Empowering LLM Agents for Real-World Computer System Optimization
☆16Updated 4 months ago
Alternatives and similar repositories for InfraGym
Users that are interested in InfraGym are comparing it to the libraries listed below
Sorting:
- [Archived] For the latest updates and community contribution, please visit: https://github.com/Ascend/TransferQueue or https://gitcode.co…☆13Updated 3 weeks ago
- A simple calculation for LLM MFU.☆66Updated 4 months ago
- ☆85Updated 9 months ago
- LLM Serving Performance Evaluation Harness☆83Updated 11 months ago
- Stateful LLM Serving☆95Updated 10 months ago
- ☆73Updated 4 months ago
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆87Updated last week
- ☆96Updated 10 months ago
- Triton-based Symmetric Memory operators and examples☆81Updated 3 weeks ago
- ☆51Updated 9 months ago
- ByteCheckpoint: An Unified Checkpointing Library for LFMs☆268Updated last month
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated 2 years ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆69Updated last year
- High-performance distributed data shuffling (all-to-all) library for MoE training and inference☆111Updated last month
- Accepted to MLSys 2026☆70Updated last week
- ☆89Updated 3 years ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆209Updated last year
- Fast and memory-efficient exact attention☆18Updated 2 weeks ago
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆263Updated this week
- ☆47Updated last year
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆116Updated 2 months ago
- Toolchain built around the Megatron-LM for Distributed Training☆84Updated 2 months ago
- PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".☆93Updated 2 years ago
- ☆52Updated 8 months ago
- Allow torch tensor memory to be released and resumed later☆216Updated 3 weeks ago
- A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, …☆124Updated 2 years ago
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit