pilancilab / calderaLinks
Compressing Large Language Models using Low Precision and Low Rank Decomposition
☆99Updated 9 months ago
Alternatives and similar repositories for caldera
Users that are interested in caldera are comparing it to the libraries listed below
Sorting:
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆129Updated 9 months ago
- The evaluation framework for training-free sparse attention in LLMs☆96Updated 3 months ago
- Work in progress.☆73Updated 2 months ago
- PB-LLM: Partially Binarized Large Language Models☆154Updated last year
- Fast and memory-efficient exact attention☆69Updated 6 months ago
- QuIP quantization☆60Updated last year
- ☆152Updated 3 months ago
- ☆202Updated 9 months ago
- [NAACL 2025] Official Implementation of "HMT: Hierarchical Memory Transformer for Long Context Language Processing"☆75Updated 3 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆89Updated 2 months ago
- ☆69Updated last year
- ☆57Updated 4 months ago
- ☆53Updated 10 months ago
- Token Omission Via Attention☆128Updated 11 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆302Updated 4 months ago
- Code for studying the super weight in LLM☆119Updated 9 months ago
- This repository contains code for the MicroAdam paper.☆19Updated 9 months ago
- [ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆92Updated 10 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆186Updated 3 months ago
- From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…☆48Updated 5 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆246Updated 7 months ago
- ☆81Updated 10 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆78Updated 11 months ago
- Linear Attention Sequence Parallelism (LASP)☆86Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆168Updated last year
- ☆196Updated 9 months ago
- ☆35Updated last month
- ☆142Updated 7 months ago
- ☆126Updated 3 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆115Updated 3 months ago