NVlabs / EoRALinks
EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
β27Updated 4 months ago
Alternatives and similar repositories for EoRA
Users that are interested in EoRA are comparing it to the libraries listed below
Sorting:
- [COLM 2025] Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"β61Updated 4 months ago
- HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. π The official implementation of https://arxβ¦β29Updated 9 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Modelsβ80Updated last year
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inferenceβ47Updated last year
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inferenceβ54Updated last year
- β24Updated last year
- LLM Inference with Microscaling Formatβ33Updated last year
- [ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsityβ64Updated 5 months ago
- β31Updated last year
- β49Updated last year
- β154Updated 9 months ago
- β19Updated 11 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMsβ42Updated last year
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retentiβ¦β67Updated last year
- β37Updated last year
- β83Updated 10 months ago
- β30Updated last year
- β132Updated 6 months ago
- β60Updated last year
- Vortex: A Flexible and Efficient Sparse Attention Frameworkβ41Updated last week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ155Updated last month
- The evaluation framework for training-free sparse attention in LLMsβ106Updated last month
- Quantized Attention on GPUβ44Updated last year
- A collection of research papers on low-precision training methodsβ51Updated 6 months ago
- An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantizationβ168Updated last week
- AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)β30Updated 2 months ago
- β79Updated 3 weeks ago
- β27Updated 8 months ago
- β62Updated 2 years ago
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressiβ¦β23Updated 2 months ago