Kedreamix / pytorch-cppcuda-tutorial
tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)
☆27Updated last year
Alternatives and similar repositories for pytorch-cppcuda-tutorial
Users that are interested in pytorch-cppcuda-tutorial are comparing it to the libraries listed below
Sorting:
- Implement custom operators in PyTorch with cuda/c++☆60Updated 2 years ago
- ☆125Updated 2 weeks ago
- ☆168Updated last year
- Implement Flash Attention using Cute.☆82Updated 4 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated last month
- ☆68Updated 3 weeks ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆76Updated this week
- analyse problems of AI with Math and Code☆13Updated last week
- Quantized Attention on GPU☆45Updated 5 months ago
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆88Updated 4 months ago
- CPU Memory Compiler and Parallel programing☆26Updated 5 months ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆69Updated last month
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- ☆65Updated 6 months ago
- SGEMM optimization with cuda step by step☆18Updated last year
- ☆123Updated last year
- Codes & examples for "CUDA - From Correctness to Performance"☆98Updated 6 months ago
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆29Updated 5 months ago
- ☆33Updated last year
- Examples of CUDA implementations by Cutlass CuTe☆177Updated 3 months ago
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆66Updated 2 years ago
- 📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.☆174Updated this week
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆34Updated 3 weeks ago
- A PyTorch-like deep learning framework. Just for fun.☆153Updated last year
- A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.☆33Updated last year
- flash attention tutorial written in python, triton, cuda, cutlass☆349Updated this week
- Optimize softmax in triton in many cases☆20Updated 8 months ago
- 使用 CUDA C++ 实现的 llama 模型推理框架☆55Updated 6 months ago
- Code release for book "Efficient Training in PyTorch"☆64Updated last month
- [EuroSys'24] Minuet: Accelerating 3D Sparse Convolutions on GPUs☆75Updated 11 months ago