fla-org / flameLinks
π₯ A minimal training framework for scaling FLA models
β233Updated last week
Alternatives and similar repositories for flame
Users that are interested in flame are comparing it to the libraries listed below
Sorting:
- Efficient triton implementation of Native Sparse Attention.β208Updated 3 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ160Updated 2 months ago
- Triton implementation of FlashAttention2 that adds Custom Masks.β132Updated last year
- β123Updated 2 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Trainingβ231Updated 2 weeks ago
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.β87Updated 8 months ago
- β80Updated 6 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoringβ219Updated last month
- Triton-based implementation of Sparse Mixture of Experts.β233Updated 8 months ago
- Physics of Language Models, Part 4β232Updated 3 weeks ago
- β237Updated 2 months ago
- β139Updated 6 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ199Updated 5 months ago
- β98Updated 4 months ago
- Some preliminary explorations of Mamba's context scaling.β217Updated last year
- β117Updated 2 months ago
- Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"β374Updated last week
- An efficient implementation of the NSA (Native Sparse Attention) kernelβ113Updated 2 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMsβ147Updated 3 weeks ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ133Updated 3 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ324Updated 6 months ago
- A sparse attention kernel supporting mix sparse patternsβ280Updated 6 months ago
- Low-bit optimizers for PyTorchβ130Updated last year
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Trainingβ214Updated last year
- Fast and memory-efficient exact attentionβ70Updated 5 months ago
- Efficient 2:4 sparse training algorithms and implementationsβ55Updated 8 months ago
- Implementation for FP8/INT8 Rollout for RL training without performence drop.β155Updated this week
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β261Updated last month
- β148Updated 2 years ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ536Updated 3 months ago