tspeterkim / mixed-precision-from-scratchLinks
Mixed precision training from scratch with Tensors and CUDA
☆28Updated last year
Alternatives and similar repositories for mixed-precision-from-scratch
Users that are interested in mixed-precision-from-scratch are comparing it to the libraries listed below
Sorting:
- ring-attention experiments☆165Updated last year
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆92Updated 6 months ago
- ☆177Updated 2 years ago
- ☆131Updated 8 months ago
- Load compute kernels from the Hub☆397Updated this week
- This repository contains the experimental PyTorch native float8 training UX☆226Updated last year
- Cataloging released Triton kernels.☆292Updated 5 months ago
- The evaluation framework for training-free sparse attention in LLMs☆117Updated 2 weeks ago
- a minimal cache manager for PagedAttention, on top of llama3.☆135Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆93Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆148Updated last year
- ☆119Updated last month
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆142Updated last year
- Code for studying the super weight in LLM☆121Updated last year
- ☆288Updated this week
- ☆124Updated last year
- Applied AI experiments and examples for PyTorch☆315Updated 5 months ago
- ☆61Updated 2 years ago
- Learn CUDA with PyTorch☆200Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆155Updated 2 years ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆131Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆176Updated last year
- A bunch of kernels that might make stuff slower 😉☆75Updated last week
- Collection of kernels written in Triton language☆178Updated 2 weeks ago
- Fast low-bit matmul kernels in Triton☆429Updated last week
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆120Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆198Updated 8 months ago
- ☆158Updated 11 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆143Updated 8 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆96Updated 4 months ago