ZixuanJiang / pre-rmsnorm-transformer
☆21Updated last year
Related projects: ⓘ
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆34Updated 2 months ago
- Simple and fast low-bit matmul kernels in CUDA☆48Updated this week
- Fast training of unitary deep network layers from low-rank updates☆28Updated last year
- Experiment of using Tangent to autodiff triton☆66Updated 7 months ago
- Personal solutions to the Triton Puzzles☆11Updated 2 months ago
- Official repository of Sparse ISO-FLOP Transformations for Maximizing Training Efficiency☆23Updated last month
- ☆187Updated last year
- ML/DL Math and Method notes☆56Updated 9 months ago
- ☆24Updated last year
- Sparsity support for PyTorch☆31Updated 11 months ago
- ☆42Updated 3 months ago
- ☆104Updated last week
- ☆56Updated 2 years ago
- Butterfly matrix multiplication in PyTorch☆160Updated 11 months ago
- Code for papers Linear Algebra with Transformers (TMLR) and What is my Math Transformer Doing? (AI for Maths Workshop, Neurips 2022)☆61Updated last month
- ☆202Updated 4 months ago
- This repository contains code for the MicroAdam paper.☆9Updated 2 months ago
- A MAD laboratory to improve AI architecture designs 🧪☆84Updated 4 months ago
- Transformer with Mu-Parameterization, implemented in Jax/Flax. Supports FSDP on TPU pods.☆29Updated 3 weeks ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆87Updated 3 months ago
- ☆129Updated last year
- Explorations into the recently proposed Taylor Series Linear Attention☆85Updated last month
- Memory Optimizations for Deep Learning (ICML 2023)☆58Updated 6 months ago
- ☆65Updated 9 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs. Now, with kittens!☆45Updated last month
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆101Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆94Updated 2 weeks ago
- A simple library for scaling up JAX programs☆116Updated last month
- ☆46Updated 7 months ago
- ☆26Updated this week