UofT-EcoSystem / TempoLinks
Memory footprint reduction for transformer models
☆11Updated 2 years ago
Alternatives and similar repositories for Tempo
Users that are interested in Tempo are comparing it to the libraries listed below
Sorting:
- ☆75Updated 4 years ago
- ☆42Updated 2 years ago
- PyTorch bindings for CUTLASS grouped GEMM.☆101Updated last month
- ☆149Updated 11 months ago
- DeeperGEMM: crazy optimized version☆69Updated 2 months ago
- 16-fold memory access reduction with nearly no loss☆100Updated 3 months ago
- ☆106Updated 10 months ago
- ☆9Updated last year
- pytorch-profiler☆51Updated 2 years ago
- (NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.☆40Updated 2 years ago
- ☆60Updated 2 months ago
- nnScaler: Compiling DNN models for Parallel Training☆113Updated last week
- ☆39Updated last year
- Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.☆67Updated 3 months ago
- Complete GPU residency for ML.☆31Updated last week
- Quantized Attention on GPU☆44Updated 7 months ago
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆23Updated 2 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆70Updated last year
- TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators☆62Updated last month
- Debug print operator for cudagraph debugging☆12Updated 11 months ago
- ☆52Updated 11 months ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆79Updated 7 months ago
- ☆49Updated last month
- ☆77Updated 5 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆50Updated 3 months ago
- An Attention Superoptimizer☆22Updated 5 months ago
- ☆83Updated 8 months ago
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆212Updated 10 months ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆187Updated this week
- ☆86Updated 3 years ago