proger / accelerated-scan
Accelerated First Order Parallel Associative Scan
โ169Updated 4 months ago
Alternatives and similar repositories for accelerated-scan:
Users that are interested in accelerated-scan are comparing it to the libraries listed below
- A library for unit scaling in PyTorchโ118Updated last month
- Experiment of using Tangent to autodiff tritonโ74Updated 11 months ago
- A MAD laboratory to improve AI architecture designs ๐งชโ102Updated last month
- โ146Updated last month
- LoRA for arbitrary JAX models and functionsโ135Updated 10 months ago
- Understand and test language model architectures on synthetic tasks.โ175Updated this week
- JAX bindings for Flash Attention v2โ83Updated 6 months ago
- Efficient optimizersโ144Updated this week
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"โ219Updated last month
- A simple library for scaling up JAX programsโ129Updated 2 months ago
- This repository contains the experimental PyTorch native float8 training UXโ219Updated 5 months ago
- โ135Updated last year
- supporting pytorch FSDP for optimizersโ75Updated last month
- โ83Updated 7 months ago
- Triton-based implementation of Sparse Mixture of Experts.โ192Updated last month
- โ201Updated 6 months ago
- FlashRNN - Fast RNN Kernels with I/O Awarenessโ69Updated last month
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Coresโ292Updated 2 weeks ago
- ๐งฑ Modula software packageโ132Updated this week
- seqax = sequence modeling + JAXโ136Updated 6 months ago
- โ75Updated 6 months ago
- Implementation of GateLoop Transformer in Pytorch and Jaxโ87Updated 6 months ago
- โ50Updated 3 months ago
- Implementation of Flash Attention in Jaxโ204Updated 10 months ago
- Normalized Transformer (nGPT)โ145Updated last month
- The simplest but fast implementation of matrix multiplication in CUDA.โ34Updated 5 months ago
- โ51Updated 7 months ago
- โ275Updated this week
- Muon optimizer for neural networks: >30% extra sample efficiency, <3% wallclock overheadโ210Updated last week
- nanoGPT-like codebase for LLM trainingโ83Updated this week