aryagxr / cudaLinks
coding CUDA everyday!
☆56Updated 4 months ago
Alternatives and similar repositories for cuda
Users that are interested in cuda are comparing it to the libraries listed below
Sorting:
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆391Updated 5 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆211Updated 3 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 5 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆190Updated 2 months ago
- Learning about CUDA by writing PTX code.☆134Updated last year
- making the official triton tutorials actually comprehensible☆53Updated last month
- ☆192Updated 7 months ago
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.☆250Updated 2 weeks ago
- GPU Kernels☆193Updated 3 months ago
- ☆49Updated 7 months ago
- Learnings and programs related to CUDA☆415Updated last month
- ☆163Updated last year
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆74Updated this week
- My submission for the GPUMODE/AMD fp8 mm challenge☆27Updated 2 months ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆189Updated last year
- Cataloging released Triton kernels.☆252Updated 7 months ago
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆64Updated 3 weeks ago
- ☆39Updated 5 months ago
- in this repository, i'm going to implement increasingly complex llm inference optimizations☆64Updated 3 months ago
- Efficient LLM Inference over Long Sequences☆389Updated 2 months ago
- ☆362Updated 4 months ago
- Fast low-bit matmul kernels in Triton☆353Updated last week
- Learn CUDA with PyTorch☆35Updated last week
- ring-attention experiments☆149Updated 10 months ago
- An extension of the nanoGPT repository for training small MOE models.☆178Updated 5 months ago
- ☆237Updated 2 months ago
- learning & making kernels in cuda / triton☆21Updated 2 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆160Updated 2 months ago
- ☆211Updated 6 months ago
- Load compute kernels from the Hub☆244Updated this week