meta-pytorch / float8_experimentalLinks
This repository contains the experimental PyTorch native float8 training UX
☆223Updated last year
Alternatives and similar repositories for float8_experimental
Users that are interested in float8_experimental are comparing it to the libraries listed below
Sorting:
- Applied AI experiments and examples for PyTorch☆301Updated 2 months ago
- Fast low-bit matmul kernels in Triton☆388Updated this week
- ☆158Updated 2 years ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆215Updated last week
- Triton-based implementation of Sparse Mixture of Experts.☆246Updated 3 weeks ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆270Updated 3 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆272Updated this week
- ☆246Updated this week
- ring-attention experiments☆155Updated last year
- ☆112Updated last year
- Collection of kernels written in Triton language☆159Updated 6 months ago
- extensible collectives library in triton☆90Updated 7 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆580Updated 2 months ago
- Cataloging released Triton kernels.☆263Updated last month
- A library for unit scaling in PyTorch☆132Updated 3 months ago
- ☆121Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆125Updated 5 months ago
- A bunch of kernels that might make stuff slower 😉☆63Updated this week
- Fast Hadamard transform in CUDA, with a PyTorch interface☆253Updated last week
- Load compute kernels from the Hub☆308Updated this week
- A Quirky Assortment of CuTe Kernels☆645Updated this week
- ☆149Updated 2 years ago
- ☆335Updated last month
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆266Updated 3 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆120Updated last week
- Triton-based Symmetric Memory operators and examples☆58Updated 2 weeks ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆543Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆84Updated last year
- Implementation of a Transformer, but completely in Triton☆276Updated 3 years ago