sustcsonglin / linear-attention-and-beyond-slidesLinks
β74Updated 3 months ago
Alternatives and similar repositories for linear-attention-and-beyond-slides
Users that are interested in linear-attention-and-beyond-slides are comparing it to the libraries listed below
Sorting:
- π₯ A minimal training framework for scaling FLA modelsβ146Updated 3 weeks ago
- Stick-breaking attentionβ56Updated 2 months ago
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"β104Updated 2 weeks ago
- β47Updated 2 months ago
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"β78Updated 2 weeks ago
- Official implementation of "The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs"β32Updated last month
- β53Updated this week
- β83Updated last month
- [ICML 2025] Fourier Position Embedding: Enhancing Attentionβs Periodic Extension for Length Generalizationβ69Updated 4 months ago
- A collection of papers on discrete diffusion modelsβ121Updated last week
- [NeurIPS-2024] π Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623β84Updated 8 months ago
- Efficient triton implementation of Native Sparse Attention.β155Updated last week
- XAttention: Block Sparse Attention with Antidiagonal Scoringβ158Updated 3 weeks ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ167Updated 2 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ115Updated this week
- β93Updated 2 weeks ago
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"β84Updated 11 months ago
- β79Updated 9 months ago
- β92Updated 8 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β62Updated 4 months ago
- β129Updated 3 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β73Updated 7 months ago
- The official implementation for Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Freeβ38Updated 3 weeks ago
- Triton implementation of FlashAttention2 that adds Custom Masks.β117Updated 9 months ago
- β78Updated this week
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.β74Updated 5 months ago
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.β67Updated 10 months ago
- β52Updated last year
- "Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding" Zhenyu Zhang, Runjin Chen, Shiwβ¦β29Updated last year
- LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verificationβ53Updated 3 months ago