sunkx109 / My-Torch-Extension
A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.
☆31Updated 9 months ago
Alternatives and similar repositories for My-Torch-Extension:
Users that are interested in My-Torch-Extension are comparing it to the libraries listed below
- learning how CUDA works☆190Updated 5 months ago
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆76Updated 3 weeks ago
- ☆106Updated 10 months ago
- flash attention tutorial written in python, triton, cuda, cutlass☆255Updated 3 weeks ago
- A CUDA tutorial to make people learn CUDA program from 0☆203Updated 6 months ago
- Puzzles for learning Triton, play it with minimal environment configuration!☆205Updated last month
- Examples of CUDA implementations by Cutlass CuTe☆132Updated 2 months ago
- A light llama-like llm inference framework based on the triton kernel.☆78Updated 3 weeks ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆52Updated 2 weeks ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆127Updated last year
- Implement Flash Attention using Cute.☆67Updated last month
- ☆42Updated this week
- A collection of memory efficient attention operators implemented in the Triton language.☆233Updated 7 months ago
- A tutorial for CUDA&PyTorch☆126Updated last week
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆35Updated 5 months ago
- ☆95Updated last month
- 📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.☆73Updated this week
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆34Updated 4 months ago
- ☆73Updated 6 months ago
- Summary of some awesome work for optimizing LLM inference☆51Updated last month
- A Easy-to-understand TensorOp Matmul Tutorial☆307Updated 4 months ago
- ☆57Updated 2 months ago
- 校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。☆269Updated 2 weeks ago
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆53Updated 5 months ago
- CUDA 算子手撕与面试指南☆103Updated 2 weeks ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆95Updated last month
- The repository has collected a batch of noteworthy MLSys bloggers (Algorithms/Systems)☆153Updated 3 weeks ago
- Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization☆95Updated last week
- Optimize softmax in triton in many cases☆17Updated 4 months ago
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆213Updated last month