jzhang38 / LongMambaLinks
Some preliminary explorations of Mamba's context scaling.
β216Updated last year
Alternatives and similar repositories for LongMamba
Users that are interested in LongMamba are comparing it to the libraries listed below
Sorting:
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β241Updated 4 months ago
- Implementation of π₯₯ Coconut, Chain of Continuous Thought, in Pytorchβ179Updated 4 months ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ231Updated last week
- Understand and test language model architectures on synthetic tasks.β233Updated 3 weeks ago
- Normalized Transformer (nGPT)β192Updated 11 months ago
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Modeβ¦β116Updated last year
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmindβ129Updated last year
- β149Updated 2 years ago
- π₯ A minimal training framework for scaling FLA modelsβ266Updated last month
- β93Updated 7 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β162Updated 6 months ago
- [NeurIPS 2024] Low rank memory efficient optimizer without SVDβ30Updated 3 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ195Updated 4 months ago
- β200Updated last month
- Stick-breaking attentionβ61Updated 3 months ago
- A MAD laboratory to improve AI architecture designs π§ͺβ131Updated 10 months ago
- β107Updated last year
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β248Updated 8 months ago
- Token Omission Via Attentionβ127Updated last year
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.β97Updated 10 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ130Updated 10 months ago
- Physics of Language Models, Part 4β250Updated 2 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ331Updated last month
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruningβ131Updated 3 weeks ago
- β86Updated last year
- β251Updated 4 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.β166Updated 3 months ago
- β202Updated 10 months ago
- β127Updated last year
- AnchorAttention: Improved attention for LLMs long-context trainingβ213Updated 9 months ago