tspeterkim / mixed-precision-from-scratchLinks

Mixed precision training from scratch with Tensors and CUDA

☆27

Alternatives and similar repositories for mixed-precision-from-scratch

Users that are interested in mixed-precision-from-scratch are comparing it to the libraries listed below

Sorting:

PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆89Updated 2 months ago
gpu-mode / profiling-cuda-in-torch
☆173Updated last year
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆85Updated 2 weeks ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆101Updated 3 months ago
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆153Updated 11 months ago
huggingface / kernels
Load compute kernels from the Hub
☆299Updated this week
Dao-AILab / grouped-latent-attention
☆129Updated 4 months ago
mgmalek / efficient_cross_entropy
☆120Updated last year
gpu-mode / triton-index
Cataloging released Triton kernels.
☆261Updated last month
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆244Updated last week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆157Updated 6 months ago
mengxiayu / LLMSuperWeight
Code for studying the super weight in LLM
☆120Updated 10 months ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆114Updated last week
AnswerDotAI / cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆147Updated last year
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆168Updated last year
FasterDecoding / TEAL
☆143Updated 7 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆379Updated 2 weeks ago
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆123Updated last year
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆144Updated last year
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆61Updated this week
Deep-Learning-Profiling-Tools / triton-viz
☆242Updated this week
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆296Updated last month
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆193Updated 3 months ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆108Updated 2 years ago
stanford-futuredata / stk
☆112Updated last year
vdesai2014 / inference-optimization-blog-post
☆89Updated last year
IST-DASLab / Quartet
☆100Updated last month
fw-ai / llama-cuda-graph-example
Example of applying CUDA graphs to LLaMA-v2
☆12Updated 2 years ago