BobMcDear / attorchLinks

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

☆580

Alternatives and similar repositories for attorch

Users that are interested in attorch are comparing it to the libraries listed below

Sorting:

gpu-mode / triton-index
Cataloging released Triton kernels.
☆263Updated last month
Deep-Learning-Profiling-Tools / triton-viz
☆242Updated this week
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated last year
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆301Updated 2 months ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆637Updated 2 weeks ago
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆385Updated last week
gpu-mode / profiling-cuda-in-torch
☆174Updated last year
pytorch / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆491Updated this week
pytorch / PiPPy
Pipeline Parallelism for PyTorch
☆780Updated last year
meta-pytorch / attention-gym
Helpful tools and examples for working with flex-attention
☆1,029Updated last week
lucidrains / ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
☆542Updated 5 months ago
google / aqt
☆335Updated last month
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆264Updated last week
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆233Updated 5 months ago
meta-pytorch / torchft
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
☆436Updated last week
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆270Updated 3 months ago
tspeterkim / flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
☆953Updated 10 months ago
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆146Updated 2 years ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆215Updated last week
HazyResearch / flash-fft-conv
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
☆329Updated 10 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆159Updated 6 months ago
HazyResearch / Megakernels
kernels, of the mega variety
☆587Updated last month
pranjalssh / fast.cu
Fastest kernels written from scratch
☆377Updated last month
ScalingIntelligence / KernelBench
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
☆632Updated last week
huggingface / kernels
Load compute kernels from the Hub
☆308Updated this week
NVIDIA / kvpress
LLM KV cache compression made easy
☆669Updated last week
srush / annotated-mamba
Annotated version of the Mamba paper
☆490Updated last year
lucidrains / triton-transformer
Implementation of a Transformer, but completely in Triton
☆276Updated 3 years ago
hidet-org / hidet
An open-source efficient deep learning framework/compiler, written in python.
☆732Updated last month
haoliuhl / ringattention
Large Context Attention
☆746Updated 2 weeks ago