OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆184Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for lightning-attention
- ☆134Updated last year
- Official implementation of TransNormerLLM: A Faster and Better LLM☆229Updated 9 months ago
- Low-bit optimizers for PyTorch☆119Updated last year
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆118Updated 4 months ago
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…☆98Updated 5 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆126Updated 5 months ago
- ☆181Updated 11 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆167Updated 11 months ago
- A collection of memory efficient attention operators implemented in the Triton language.☆219Updated 5 months ago
- ☆98Updated 8 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆147Updated 4 months ago
- Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting☆44Updated 4 months ago
- ☆96Updated last month
- ☆154Updated last month
- A MoE impl for PyTorch, [ATC'23] SmartMoE☆57Updated last year
- [EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer☆55Updated last year
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …☆133Updated 6 months ago
- ☆188Updated 6 months ago
- The official code for paper "parallel speculative decoding with adaptive draft length."☆24Updated 2 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆185Updated last month
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (Official Code)☆135Updated last month
- A Tight-fisted Optimizer☆47Updated last year
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels☆98Updated last year
- ☆64Updated 3 months ago
- An algorithm for static activation quantization of LLMs☆77Updated last week
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆254Updated 2 months ago
- ☆199Updated 5 months ago
- PB-LLM: Partially Binarized Large Language Models☆148Updated last year
- An easy-to-use package for implementing SmoothQuant for LLMs☆83Updated 6 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆74Updated 5 months ago