VikParuchuri / triton_tutorialLinks
Tutorials for Triton, a language for writing gpu kernels
☆34Updated 2 years ago
Alternatives and similar repositories for triton_tutorial
Users that are interested in triton_tutorial are comparing it to the libraries listed below
Sorting:
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆190Updated 2 months ago
- Load compute kernels from the Hub☆244Updated this week
- ☆162Updated last year
- Learn CUDA with PyTorch☆35Updated last month
- ☆118Updated last year
- Experiment of using Tangent to autodiff triton☆80Updated last year
- Code for studying the super weight in LLM☆115Updated 8 months ago
- ☆87Updated last year
- The evaluation framework for training-free sparse attention in LLMs☆90Updated 2 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆84Updated last month
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆211Updated 3 months ago
- An extension of the nanoGPT repository for training small MOE models.☆178Updated 5 months ago
- Implementations of attention with the softpick function, naive and FlashAttention-2☆82Updated 3 months ago
- ring-attention experiments☆149Updated 10 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆160Updated 2 months ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆260Updated 3 weeks ago
- ☆237Updated 2 months ago
- ☆192Updated 7 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆327Updated 3 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 5 months ago
- A bunch of kernels that might make stuff slower 😉☆58Updated this week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆244Updated 6 months ago
- Accelerated First Order Parallel Associative Scan☆187Updated last year
- Mixed precision training from scratch with Tensors and CUDA☆24Updated last year
- ☆81Updated last year
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.☆250Updated 2 weeks ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆137Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆224Updated last year
- Understand and test language model architectures on synthetic tasks.☆221Updated last month
- Explorations into the recently proposed Taylor Series Linear Attention☆100Updated last year