HazyResearch / flash-fft-conv
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
☆296Updated last month
Alternatives and similar repositories for flash-fft-conv:
Users that are interested in flash-fft-conv are comparing it to the libraries listed below
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 6 months ago
- Accelerated First Order Parallel Associative Scan☆171Updated 6 months ago
- Helpful tools and examples for working with flex-attention☆647Updated this week
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆221Updated this week
- Fast Hadamard transform in CUDA, with a PyTorch interface☆143Updated 8 months ago
- Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"☆370Updated last year
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆514Updated this week
- Muon optimizer: +~30% sample efficiency with <3% wallclock overhead☆254Updated last week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆215Updated 3 weeks ago
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆502Updated 3 months ago
- A repository for log-time feedforward networks☆219Updated 10 months ago
- ☆285Updated last week
- Annotated version of the Mamba paper☆473Updated 11 months ago
- A library for unit scaling in PyTorch☆122Updated 2 months ago
- The AdEMAMix Optimizer: Better, Faster, Older.☆178Updated 5 months ago
- Understand and test language model architectures on synthetic tasks.☆181Updated last month
- ☆138Updated last year
- Cataloging released Triton kernels.☆168Updated last month
- Collection of kernels written in Triton language☆105Updated this week
- When it comes to optimizers, it's always better to be safe than sorry☆179Updated 3 weeks ago
- Implementation of fused cosine similarity attention in the same style as Flash Attention☆210Updated 2 years ago
- ☆253Updated 5 months ago
- CIFAR-10 speedruns: 94% in 2.6 seconds and 96% in 27 seconds☆205Updated this week
- FlashRNN - Fast RNN Kernels with I/O Awareness☆75Updated 2 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆229Updated this week
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAI☆273Updated 3 months ago
- Explorations into the recently proposed Taylor Series Linear Attention☆93Updated 6 months ago
- Some preliminary explorations of Mamba's context scaling.☆213Updated last year
- Fast low-bit matmul kernels in Triton☆238Updated this week