jlamprou / Infini-Attention
Efficient Infinite Context Transformers with Infini-attention Pytorch Implementation + QwenMoE Implementation + Training Script + 1M context keypass retrieval
☆66Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for Infini-Attention
- ☆122Updated 10 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆130Updated 2 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆135Updated 5 months ago
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆64Updated 5 months ago
- Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTO…☆52Updated last week
- ☆64Updated last month
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (Official Code)☆135Updated last month
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆175Updated this week
- Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)☆50Updated 7 months ago
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆139Updated 2 months ago
- Unofficial implementations of block/layer-wise pruning methods for LLMs.☆51Updated 6 months ago
- ☆64Updated 4 months ago
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆160Updated 3 months ago
- ☆184Updated last month
- Training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge.☆68Updated last year
- ☆199Updated 5 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆139Updated this week
- The official repo for "LLoCo: Learning Long Contexts Offline"☆113Updated 5 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆74Updated this week
- The official implementation of the paper "Demystifying the Compression of Mixture-of-Experts Through a Unified Framework".☆48Updated last month
- This is the official repository for Inheritune.☆105Updated last month
- [NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623☆71Updated last month
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆307Updated 7 months ago
- [ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models☆68Updated 6 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆93Updated last month
- Official implementation for 'Extending LLMs’ Context Window with 100 Samples'☆74Updated 10 months ago
- ☆96Updated 2 months ago
- Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"☆122Updated 6 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆119Updated this week
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆31Updated 5 months ago