jlamprou / Infini-Attention
Efficient Infinite Context Transformers with Infini-attention Pytorch Implementation + QwenMoE Implementation + Training Script + 1M context keypass retrieval
☆74Updated 8 months ago
Alternatives and similar repositories for Infini-Attention:
Users that are interested in Infini-Attention are comparing it to the libraries listed below
- ☆124Updated 11 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆139Updated 4 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆148Updated last month
- The official repo for "LLoCo: Learning Long Contexts Offline"☆114Updated 7 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆145Updated 6 months ago
- ☆191Updated last month
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆33Updated 6 months ago
- Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)☆52Updated 9 months ago
- PB-LLM: Partially Binarized Large Language Models☆150Updated last year
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆96Updated 3 months ago
- ☆69Updated this week
- Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTO…☆53Updated this week
- ☆251Updated last year
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆70Updated 7 months ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆119Updated this week
- An unofficial pytorch implementation of 'Efficient Infinite Context Transformers with Infini-attention'☆46Updated 5 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆204Updated 7 months ago
- ☆107Updated 3 months ago
- This is the official repository for Inheritune.☆109Updated 3 months ago
- Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"☆123Updated 8 months ago
- ☆212Updated 7 months ago
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆171Updated 3 months ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆219Updated last month
- Unofficial PyTorch/🤗Transformers(Gemma/Llama3) implementation of Leave No Context Behind: Efficient Infinite Context Transformers with I…☆349Updated 8 months ago
- Unofficial implementations of block/layer-wise pruning methods for LLMs.☆58Updated 8 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆147Updated last month
- A repository for research on medium sized language models.☆76Updated 7 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆188Updated 6 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆115Updated last year
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆189Updated 2 weeks ago