tspeterkim / mixed-precision-from-scratchLinks
Mixed precision training from scratch with Tensors and CUDA
☆28Updated last year
Alternatives and similar repositories for mixed-precision-from-scratch
Users that are interested in mixed-precision-from-scratch are comparing it to the libraries listed below
Sorting:
- ring-attention experiments☆155Updated last year
 - Learn CUDA with PyTorch☆95Updated last month
 - The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆92Updated 3 months ago
 - Load compute kernels from the Hub☆308Updated last week
 - Cataloging released Triton kernels.☆264Updated last month
 - ☆174Updated last year
 - QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆120Updated last week
 - This repository contains the experimental PyTorch native float8 training UX☆223Updated last year
 - ☆246Updated this week
 - The evaluation framework for training-free sparse attention in LLMs☆102Updated 3 weeks ago
 - ☆130Updated 5 months ago
 - a minimal cache manager for PagedAttention, on top of llama3.☆125Updated last year
 - Collection of kernels written in Triton language☆159Updated 6 months ago
 - Code for studying the super weight in LLM☆119Updated 11 months ago
 - A bunch of kernels that might make stuff slower 😉☆63Updated last week
 - QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
 - Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆147Updated last year
 - Triton-based Symmetric Memory operators and examples☆58Updated 2 weeks ago
 - ☆121Updated last year
 - Applied AI experiments and examples for PyTorch☆301Updated 2 months ago
 - Fast low-bit matmul kernels in Triton☆388Updated last week
 - ☆27Updated last year
 - Automatic differentiation for Triton Kernels☆13Updated 2 months ago
 - How to ensure correctness and ship LLM generated kernels in PyTorch☆111Updated this week
 - A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆233Updated 5 months ago
 - Boosting 4-bit inference kernels with 2:4 Sparsity☆85Updated last year
 - A minimal implementation of vllm.☆60Updated last year
 - Write a fast kernel and run it on Discord. See how you compare against the best!☆58Updated 3 weeks ago
 - Flash-Muon: An Efficient Implementation of Muon Optimizer☆202Updated 4 months ago
 - CUDA and Triton implementations of Flash Attention with SoftmaxN.☆73Updated last year