cisco-open / pymultiworld
A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
☆19Updated last week
Alternatives and similar repositories for pymultiworld:
Users that are interested in pymultiworld are comparing it to the libraries listed below
- Boosting 4-bit inference kernels with 2:4 Sparsity☆73Updated 8 months ago
- extensible collectives library in triton☆86Updated last month
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 5 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆131Updated 9 months ago
- A resilient distributed training framework☆95Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆116Updated 5 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆155Updated 7 months ago
- ☆104Updated 8 months ago
- Load compute kernels from the Hub☆116Updated this week
- ByteCheckpoint: An Unified Checkpointing Library for LFMs☆207Updated last month
- Triton-based implementation of Sparse Mixture of Experts.☆212Updated 5 months ago
- A bunch of kernels that might make stuff slower 😉☆40Updated this week
- A minimal implementation of vllm.☆40Updated 9 months ago
- ring-attention experiments☆140Updated 6 months ago
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆36Updated last year
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆209Updated 8 months ago
- LLM Serving Performance Evaluation Harness☆77Updated 2 months ago
- Microsoft Collective Communication Library☆65Updated 5 months ago
- prime-rl is a codebase for decentralized RL training at scale☆89Updated this week
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆48Updated 6 months ago
- Simple and efficient pytorch-native transformer training and inference (batched)☆75Updated last year
- Applied AI experiments and examples for PyTorch☆264Updated last week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆122Updated this week
- 🏙 Interactive performance profiling and debugging tool for PyTorch neural networks.☆61Updated 3 months ago
- ☆79Updated 6 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆103Updated last month
- ☆27Updated last year
- A lightweight design for computation-communication overlap.☆67Updated last week
- ☆93Updated 2 years ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆59Updated 7 months ago