dlsyscourse / hw1Links
☆8Updated 9 months ago
Alternatives and similar repositories for hw1
Users that are interested in hw1 are comparing it to the libraries listed below
Sorting:
- A simple calculation for LLM MFU.☆38Updated 3 months ago
- ☆36Updated last year
- ☆20Updated 9 months ago
- GPTQ inference TVM kernel☆40Updated last year
- ☆74Updated 4 years ago
- ☆207Updated 7 months ago
- ☆87Updated 3 months ago
- A minimal implementation of vllm.☆44Updated 11 months ago
- Simple PyTorch graph capturing.☆19Updated 2 years ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
- Machine Learning Compiler Road Map☆43Updated last year
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆42Updated last month
- Triton Compiler related materials.☆30Updated 5 months ago
- ☆77Updated 2 months ago
- ☆84Updated 3 years ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆69Updated 4 years ago
- Triton implementation of Flash Attention2.0☆35Updated last year
- (NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.☆39Updated 2 years ago
- ☆69Updated 7 months ago
- A TVM-like CUDA/C code generator.☆9Updated 3 years ago
- ☆105Updated 10 months ago
- Summary of some awesome work for optimizing LLM inference☆77Updated 3 weeks ago
- ☆97Updated 9 months ago
- SOTA Learning-augmented Systems☆36Updated 3 years ago
- ☆170Updated last year
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections☆121Updated 3 years ago
- DeeperGEMM: crazy optimized version☆69Updated last month
- PyTorch bindings for CUTLASS grouped GEMM.☆100Updated 3 weeks ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆38Updated 2 weeks ago