zhuhanqing / APOLLOLinks
APOLLO: SGD-like Memory, AdamW-level Performance; MLSys'25 Oustanding Paper Honorable Mention
☆270Updated 2 months ago
Alternatives and similar repositories for APOLLO
Users that are interested in APOLLO are comparing it to the libraries listed below
Sorting:
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding☆276Updated last year
- [ICLR 2025🔥] SVD-LLM & [NAACL 2025🔥] SVD-LLM V2☆280Updated 5 months ago
- ☆158Updated 11 months ago
- Low-bit optimizers for PyTorch☆138Updated 2 years ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆233Updated 7 months ago
- ☆131Updated 8 months ago
- ☆163Updated 7 months ago
- [NeurIPS 2024] BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models☆286Updated 10 months ago
- [ICML 2024] Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆39Updated last year
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference☆283Updated 9 months ago
- The evaluation framework for training-free sparse attention in LLMs☆117Updated 2 weeks ago
- 🔥 A minimal training framework for scaling FLA models☆344Updated 2 months ago
- Efficient LLM Inference over Long Sequences☆394Updated 7 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆258Updated 6 months ago
- Official implementation for Training LLMs with MXFP4☆118Updated 9 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring☆269Updated 7 months ago
- Efficient triton implementation of Native Sparse Attention.☆262Updated 8 months ago
- Fast and memory-efficient exact attention☆75Updated 11 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆160Updated 3 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆155Updated 11 months ago
- implementations and experimentation on mHC by deepseek - https://arxiv.org/abs/2512.24880☆290Updated this week
- Block Diffusion for Ultra-Fast Speculative Decoding☆533Updated this week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆251Updated last year
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆327Updated 2 months ago
- ☆235Updated last year
- Code for studying the super weight in LLM☆121Updated last year
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆155Updated 3 weeks ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆204Updated 2 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆176Updated last year
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆177Updated last year