Kedreamix / pytorch-cppcuda-tutorialLinks
tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)
☆28Updated last year
Alternatives and similar repositories for pytorch-cppcuda-tutorial
Users that are interested in pytorch-cppcuda-tutorial are comparing it to the libraries listed below
Sorting:
- Implement custom operators in PyTorch with cuda/c++☆63Updated 2 years ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆38Updated 2 weeks ago
- Code release for book "Efficient Training in PyTorch"☆69Updated 2 months ago
- Implement Flash Attention using Cute.☆87Updated 6 months ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- analyse problems of AI with Math and Code☆17Updated 2 weeks ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆82Updated last month
- ☆69Updated 7 months ago
- Code for Draft Attention☆77Updated last month
- ☆135Updated last year
- CPU Memory Compiler and Parallel programing☆26Updated 7 months ago
- FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]☆28Updated 3 weeks ago
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆31Updated 7 months ago
- LLM Inference with Deep Learning Accelerator.☆44Updated 5 months ago
- Quantized Attention on GPU☆44Updated 7 months ago
- SGEMM optimization with cuda step by step☆19Updated last year
- Triton Documentation in Chinese Simplified / Triton 中文文档☆71Updated 2 months ago
- A light llama-like llm inference framework based on the triton kernel.☆128Updated last week
- 使用 CUDA C++ 实现的 llama 模型推理框架☆57Updated 7 months ago
- A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.☆33Updated last year
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆49Updated 2 weeks ago
- ☆170Updated last year
- ⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.🎉☆187Updated last month
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆96Updated 2 weeks ago
- flash attention tutorial written in python, triton, cuda, cutlass☆377Updated last month
- This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.☆33Updated 6 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆48Updated 3 months ago
- DeepSeek Native Sparse Attention pytorch implementation☆72Updated 3 months ago
- Codes & examples for "CUDA - From Correctness to Performance"☆100Updated 8 months ago
- Course materials for MIT6.5940: TinyML and Efficient Deep Learning Computing☆47Updated 5 months ago