dlsyscourse / hw1
☆7Updated 6 months ago
Alternatives and similar repositories for hw1:
Users that are interested in hw1 are comparing it to the libraries listed below
- ☆19Updated 6 months ago
- ☆32Updated 10 months ago
- Machine Learning Compiler Road Map☆43Updated last year
- GPTQ inference TVM kernel☆38Updated 11 months ago
- A practical way of learning Swizzle☆16Updated last month
- DeeperGEMM: crazy optimized version☆63Updated 2 weeks ago
- ☆61Updated 4 months ago
- Simple PyTorch graph capturing.☆17Updated last year
- Implement Flash Attention using Cute.☆74Updated 3 months ago
- Surrogate-based Hyperparameter Tuning System☆28Updated last year
- SOTA Learning-augmented Systems☆35Updated 2 years ago
- ☆32Updated 7 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- ☆6Updated 5 months ago
- ThrillerFlow is a Dataflow Analysis and Codegen Framework written in Rust.☆14Updated 4 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆35Updated 3 weeks ago
- ☆73Updated 4 months ago
- High performance NCCL plugin for Bagua.☆15Updated 3 years ago
- ☆160Updated last year
- ☆39Updated 4 years ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- an implementation of parallel skills like amp, ddp, pp, tp for learning purposes☆12Updated last year
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆66Updated 4 years ago
- My Paper Reading Lists and Notes.☆20Updated 2 months ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆180Updated 2 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆90Updated last month
- A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, …☆109Updated last year
- ☆204Updated 4 months ago
- hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.☆45Updated last year
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆23Updated last month