SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆61Updated this week
Related projects ⓘ
Alternatives and complementary repositories for Triton-Puzzles-Lite
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆46Updated 3 months ago
- A sparse attention kernel supporting mix sparse patterns☆53Updated 3 weeks ago
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆101Updated 2 months ago
- High performance Transformer implementation in C++.☆80Updated last month
- MagicPIG: LSH Sampling for Efficient LLM Generation☆45Updated 2 weeks ago
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆168Updated last week
- ☆42Updated 7 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆98Updated 4 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆53Updated last week
- Summary of some awesome work for optimizing LLM inference☆35Updated this week
- ☆63Updated 3 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆196Updated last week
- Quantized Attention on GPU☆29Updated last week
- ☆34Updated 2 months ago
- ☆70Updated 2 years ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆228Updated last week
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆178Updated last year
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆151Updated this week
- nnScaler: Compiling DNN models for Parallel Training☆64Updated 2 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆67Updated 3 months ago
- ☆43Updated last month
- ☆79Updated 2 months ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆75Updated last week
- Materials for learning SGLang☆86Updated this week
- 16-fold memory access reduction with nearly no loss☆57Updated this week
- ☆65Updated 3 years ago
- My learning notes/codes for ML SYS.☆34Updated this week
- flash attention tutorial written in python, triton, cuda, cutlass☆195Updated 4 months ago
- 📰 Must-read papers on KV Cache Compression (constantly updating 🤗).☆123Updated this week
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆12Updated last month