meta-pytorch / float8_experimentalLinks
This repository contains the experimental PyTorch native float8 training UX
β224Updated last year
Alternatives and similar repositories for float8_experimental
Users that are interested in float8_experimental are comparing it to the libraries listed below
Sorting:
- Applied AI experiments and examples for PyTorchβ294Updated 3 weeks ago
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β212Updated this week
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β265Updated last month
- β159Updated 2 years ago
- Fast low-bit matmul kernels in Tritonβ371Updated last week
- Triton-based implementation of Sparse Mixture of Experts.β239Updated 3 weeks ago
- β237Updated this week
- extensible collectives library in tritonβ87Updated 5 months ago
- Collection of kernels written in Triton languageβ154Updated 5 months ago
- β330Updated last week
- ring-attention experimentsβ152Updated 11 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.β223Updated this week
- Cataloging released Triton kernels.β257Updated last week
- β111Updated last year
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.β307Updated this week
- A library for unit scaling in PyTorchβ130Updated 2 months ago
- PyTorch bindings for CUTLASS grouped GEMM.β119Updated 3 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β576Updated last month
- A bunch of kernels that might make stuff slower πβ59Updated this week
- Fast Hadamard transform in CUDA, with a PyTorch interfaceβ233Updated 2 weeks ago
- β118Updated last year
- A Quirky Assortment of CuTe Kernelsβ557Updated this week
- Load compute kernels from the Hubβ283Updated this week
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inferenceβ118Updated last year
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).β265Updated 2 months ago
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Trainingβ216Updated last year
- β149Updated 2 years ago
- A safetensors extension to efficiently store sparse quantized tensors on diskβ161Updated this week
- kernels, of the mega varietyβ496Updated 3 months ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ538Updated 4 months ago