leo811121 / UIUC-CS-483-Parallel-Programming
☆18Updated 5 years ago
Alternatives and similar repositories for UIUC-CS-483-Parallel-Programming:
Users that are interested in UIUC-CS-483-Parallel-Programming are comparing it to the libraries listed below
- A minimal implementation of vllm.☆37Updated 8 months ago
- Cataloging released Triton kernels.☆212Updated 2 months ago
- DeeperGEMM: crazy optimized version☆63Updated 2 weeks ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆104Updated this week
- ☆52Updated this week
- Collection of kernels written in Triton language☆114Updated last month
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆111Updated 3 months ago
- Mixed precision training from scratch with Tensors and CUDA☆21Updated 10 months ago
- ring-attention experiments☆128Updated 5 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆91Updated last month
- DPO, but faster 🚀☆40Updated 3 months ago
- ☆192Updated this week
- ☆28Updated 2 months ago
- ☆46Updated last year
- ☆160Updated last year
- Load compute kernels from the Hub☆107Updated this week
- ☆152Updated last year
- [WIP] Better (FP8) attention for Hopper☆26Updated last month
- extensible collectives library in triton☆84Updated 6 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆44Updated 8 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆154Updated last week
- Fast low-bit matmul kernels in Triton☆272Updated this week
- ☆13Updated 3 weeks ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated☆31Updated 7 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆35Updated this week
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 9 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆60Updated last week
- FlexAttention w/ FlashAttention3 Support☆26Updated 5 months ago
- Simple and efficient pytorch-native transformer training and inference (batched)☆71Updated 11 months ago