CLAIRE-Labo / flash_attention
A basic pure pytorch implementation of flash attention
☆16Updated 3 months ago
Alternatives and similar repositories for flash_attention:
Users that are interested in flash_attention are comparing it to the libraries listed below
- ☆75Updated 6 months ago
- Engineering the state of RNN language models (Mamba, RWKV, etc.)☆32Updated 8 months ago
- ☆53Updated last year
- ☆24Updated 3 weeks ago
- Minimal but scalable implementation of large language models in JAX☆28Updated 2 months ago
- Exploration into the proposed "Self Reasoning Tokens" by Felipe Bonetto☆54Updated 8 months ago
- LL3M: Large Language and Multi-Modal Model in Jax☆68Updated 9 months ago
- ☆37Updated 9 months ago
- ☆78Updated 9 months ago
- ☆30Updated 2 months ago
- Transformer with Mu-Parameterization, implemented in Jax/Flax. Supports FSDP on TPU pods.☆30Updated last month
- σ-GPT: A New Approach to Autoregressive Models☆61Updated 5 months ago
- DPO, but faster 🚀☆29Updated last month
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆25Updated 9 months ago
- Triton Implementation of HyperAttention Algorithm☆46Updated last year
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆36Updated last year
- A general framework for inference-time scaling and steering of diffusion models with arbitrary rewards.☆71Updated 2 weeks ago
- An implementation of PSGD Kron second-order optimizer for PyTorch☆29Updated 3 weeks ago
- Explorations into the recently proposed Taylor Series Linear Attention☆92Updated 5 months ago
- Focused on fast experimentation and simplicity☆65Updated last month
- ☆20Updated last year
- Implementation of Gradient Agreement Filtering, from Chaubard et al. of Stanford, but for single machine microbatches, in Pytorch☆22Updated last week
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆121Updated 9 months ago
- Official code implementation for the work Preference Alignment with Flow Matching (NeurIPS 2024)☆20Updated 2 months ago
- A repository for research on medium sized language models.☆76Updated 8 months ago
- Collection of autoregressive model implementation☆77Updated 3 weeks ago
- Latent Diffusion Language Models☆68Updated last year
- ☆70Updated 5 months ago
- ☆25Updated 9 months ago
- Using FlexAttention to compute attention with different masking patterns☆40Updated 4 months ago