meta-pytorch / float8_experimentalLinks

This repository contains the experimental PyTorch native float8 training UX

☆225

Alternatives and similar repositories for float8_experimental

Users that are interested in float8_experimental are comparing it to the libraries listed below

Sorting:

meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆305Updated 2 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆216Updated last week
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆395Updated 3 weeks ago
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆272Updated 2 weeks ago
gpu-mode / ring-attention
ring-attention experiments
☆155Updated last year
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated 2 years ago
cchan / tccl
extensible collectives library in triton
☆91Updated 7 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆247Updated last week
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆248Updated last month
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆286Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆167Updated 7 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆267Updated 2 months ago
google / aqt
☆337Updated 2 weeks ago
stanford-futuredata / stk
☆113Updated last year
graphcore-research / unit-scaling
A library for unit scaling in PyTorch
☆132Updated 4 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆130Updated 5 months ago
mgmalek / efficient_cross_entropy
☆121Updated last year
open-lm-engine / accelerated-model-architectures
A bunch of kernels that might make stuff slower 😉
☆64Updated this week
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆582Updated 3 months ago
huggingface / kernels
Load compute kernels from the Hub
☆327Updated last week
lucidrains / triton-transformer
Implementation of a Transformer, but completely in Triton
☆276Updated 3 years ago
meta-pytorch / kraken
Triton-based Symmetric Memory operators and examples
☆63Updated last month
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆272Updated 4 months ago
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆257Updated last month
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆660Updated 3 weeks ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
epfml / dynamic-sparse-flash-attention
☆150Updated 2 years ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆85Updated last year
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆134Updated last week
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆217Updated last year