nebius / kvaxLinks
A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.
☆153Updated last month
Alternatives and similar repositories for kvax
Users that are interested in kvax are comparing it to the libraries listed below
Sorting:
- Minimal yet performant LLM examples in pure JAX☆219Updated 3 weeks ago
- torchax is a PyTorch frontend for JAX. It gives JAX the ability to author JAX programs using familiar PyTorch syntax. It also provides JA…☆158Updated last week
- seqax = sequence modeling + JAX☆169Updated 5 months ago
- JAX-Toolbox☆369Updated this week
- MoE training for Me and You and maybe other people☆298Updated last week
- A simple library for scaling up JAX programs☆144Updated last month
- 🧱 Modula software package☆316Updated 4 months ago
- ☆286Updated last year
- Dion optimizer algorithm☆409Updated this week
- jax-triton contains integrations between JAX and OpenAI Triton☆436Updated 2 weeks ago
- Tokamax: A GPU and TPU kernel library.☆142Updated last week
- Implementation of Diffusion Transformer (DiT) in JAX☆299Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆64Updated last week
- ☆92Updated last year
- NanoGPT-speedrunning for the poor T4 enjoyers☆73Updated 8 months ago
- a Jax quantization library☆79Updated last week
- Accelerated First Order Parallel Associative Scan☆193Updated this week
- A zero-to-one guide on scaling modern transformers with n-dimensional parallelism.☆105Updated 3 months ago
- Custom triton kernels for training Karpathy's nanoGPT.☆19Updated last year
- Efficient optimizers☆279Updated last week
- FlashRNN - Fast RNN Kernels with I/O Awareness☆173Updated 2 months ago
- Minimal but scalable implementation of large language models in JAX☆35Updated last month
- ☆69Updated last year
- supporting pytorch FSDP for optimizers☆84Updated last year
- Experiment of using Tangent to autodiff triton☆81Updated last year
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆462Updated this week
- JAX implementation of the Mistral 7b v0.2 model☆35Updated last year
- An implementation of PSGD Kron second-order optimizer for PyTorch☆97Updated 5 months ago
- Load compute kernels from the Hub☆352Updated last week
- Attention Kernels for Symmetric Power Transformers☆128Updated 3 months ago