leo811121 / UIUC-CS-483-Parallel-Programming
☆18Updated 4 years ago
Related projects: ⓘ
- ☆124Updated last week
- Cataloging released Triton kernels.☆111Updated 3 weeks ago
- ring-attention experiments☆89Updated 5 months ago
- ☆124Updated 7 months ago
- Collection of kernels written in Triton language☆48Updated 2 weeks ago
- Learning about CUDA by writing PTX code.☆28Updated 6 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆88Updated 11 months ago
- A minimal implementation of vllm.☆29Updated last month
- Applied AI experiments and examples for PyTorch☆123Updated last month
- ☆83Updated 3 weeks ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆84Updated 2 months ago
- Imperative deep learning framework with customized GPU and CPU backend☆28Updated last year
- Learn CUDA with PyTorch☆11Updated last month
- This repository contains the experimental PyTorch native float8 training UX☆210Updated last month
- ☆24Updated last year
- Solve puzzles. Learn CUDA.☆53Updated 9 months ago
- 《Machine Learning Systems: Design and Implementation》- English Version☆16Updated 8 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆58Updated 6 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 3 months ago
- An implementation of the Llama architecture, to instruct and delight☆21Updated last month
- ML/DL Math and Method notes☆56Updated 9 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆134Updated 2 months ago
- ☆21Updated 4 months ago
- Megatron's multi-modal data loader☆42Updated this week
- ☆75Updated this week
- Just some miscellaneous utility functions / decorators / modules related to Pytorch and Accelerate to help speed up implementation of new…☆115Updated last month
- ☆68Updated 2 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆258Updated 2 months ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆152Updated 11 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆156Updated this week