Kedreamix / pytorch-cppcuda-tutorialLinks

tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)

☆28

Alternatives and similar repositories for pytorch-cppcuda-tutorial

Users that are interested in pytorch-cppcuda-tutorial are comparing it to the libraries listed below

Sorting:

YuxueYang1204 / CudaDemo
Implement custom operators in PyTorch with cuda/c++
☆63Updated 2 years ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆38Updated 2 weeks ago
ailzhang / EfficientPyTorch
Code release for book "Efficient Training in PyTorch"
☆69Updated 2 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆87Updated 6 months ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆130Updated last year
ifromeast / AI_analysis
analyse problems of AI with Math and Code
☆17Updated 2 weeks ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆82Updated last month
mit-han-lab / tinychat-tutorial
☆69Updated 7 months ago
shawnricecake / draft-attention
Code for Draft Attention
☆77Updated last month
AyakaGEMM / Hands-on-GEMM
☆135Updated last year
mrzhuzhe / riven
CPU Memory Compiler and Parallel programing
☆26Updated 7 months ago
NoakLiu / FastCache-xDiT
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]
☆28Updated 3 weeks ago
xdit-project / DiTCacheAnalysis
An auxiliary project analysis of the characteristics of KV in DiT Attention.
☆31Updated 7 months ago
shishishu / LLM-Inference-Acceleration
LLM Inference with Deep Learning Accelerator.
☆44Updated 5 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 7 months ago
li199603 / sgemm_with_cuda
SGEMM optimization with cuda step by step
☆19Updated last year
hyperai / triton-cn
Triton Documentation in Chinese Simplified / Triton 中文文档
☆71Updated 2 months ago
harleyszhang / lite_llama
A light llama-like llm inference framework based on the triton kernel.
☆128Updated last week
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆57Updated 7 months ago
sunkx109 / My-Torch-Extension
A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.
☆33Updated last year
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆49Updated 2 weeks ago
mit-han-lab / parallel-computing-tutorial
☆170Updated last year
xlite-dev / ffpa-attn
⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.🎉
☆187Updated last month
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆96Updated 2 weeks ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆377Updated last month
Qwesh157 / conv_op_optimization
This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.
☆33Updated 6 months ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆48Updated 3 months ago
dhcode-cpp / NSA-pytorch
DeepSeek Native Sparse Attention pytorch implementation
☆72Updated 3 months ago
interestingLSY / CUDA-From-Correctness-To-Performance-Code
Codes & examples for "CUDA - From Correctness to Performance"
☆100Updated 8 months ago
PKUFlyingPig / MIT6.5940_TinyML
Course materials for MIT6.5940: TinyML and Efficient Deep Learning Computing
☆47Updated 5 months ago