OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆182Updated 4 months ago
Related projects: ⓘ
- Low-bit optimizers for PyTorch☆109Updated 11 months ago
- Official implementation of TransNormerLLM: A Faster and Better LLM☆223Updated 7 months ago
- ☆130Updated last year
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆121Updated 3 months ago
- ☆170Updated 9 months ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆114Updated 2 months ago
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…☆95Updated 3 months ago
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels☆90Updated last year
- The official implementation of the EMNLP 2023 paper LLM-FP4☆156Updated 9 months ago
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …☆116Updated 4 months ago
- ☆125Updated this week
- ☆94Updated 6 months ago
- A MoE impl for PyTorch, [ATC'23] SmartMoE☆56Updated last year
- Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting☆39Updated 2 months ago
- ☆60Updated last month
- ☆75Updated this week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆134Updated 2 months ago
- ☆164Updated 4 months ago
- A collection of memory efficient attention operators implemented in the Triton language.☆205Updated 3 months ago
- [EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer☆53Updated last year
- A Tight-fisted Optimizer☆46Updated last year
- Implementation of FlashAttention in PyTorch☆95Updated last year
- ☆191Updated 3 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆309Updated this week
- Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"☆120Updated 4 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆123Updated 2 months ago
- Implementation of "Attention Is Off By One" by Evan Miller☆177Updated last year
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…☆55Updated 5 months ago
- A generalized framework for subspace tuning methods in parameter efficient fine-tuning.☆70Updated this week
- Patch convolution to avoid large GPU memory usage of Conv2D☆73Updated 3 months ago