mobiusml / low-rank-llama2Links
Low-Rank Llama Custom Training
☆23Updated last year
Alternatives and similar repositories for low-rank-llama2
Users that are interested in low-rank-llama2 are comparing it to the libraries listed below
Sorting:
- [ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆92Updated 9 months ago
- ☆29Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆77Updated last year
- This repository contains code for the MicroAdam paper.☆19Updated 9 months ago
- ☆23Updated 5 months ago
- ☆142Updated 7 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆115Updated 2 months ago
- ☆29Updated 10 months ago
- ACL 2023☆39Updated 2 years ago
- Kinetics: Rethinking Test-Time Scaling Laws☆80Updated 2 months ago
- Fast and memory-efficient exact attention☆69Updated 6 months ago
- ☆126Updated 3 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Updated last year
- Muon fsdp 2☆43Updated last month
- Boosting 4-bit inference kernels with 2:4 Sparsity☆82Updated last year
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆37Updated 7 months ago
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…☆67Updated last year
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆78Updated 10 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆52Updated 10 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆168Updated last year
- ☆54Updated 3 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆48Updated 11 months ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆49Updated 2 years ago
- Code for "RSQ: Learning from Important Tokens Leads to Better Quantized LLMs"☆19Updated 3 months ago
- Transformers components but in Triton☆34Updated 4 months ago
- Low-bit optimizers for PyTorch☆131Updated last year
- Official Code For Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM☆14Updated last year
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆146Updated 2 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆185Updated 3 months ago