krafton-ai / lexicoLinks
KV cache compression via sparse coding
☆14Updated last week
Alternatives and similar repositories for lexico
Users that are interested in lexico are comparing it to the libraries listed below
Sorting:
- ☆17Updated last year
- The implementation for FREE-Merging: Fourier Transform for Model Merging with Lightweight Experts (ICCV25)☆10Updated 4 months ago
- ☆15Updated 11 months ago
- Official Implementation of FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation☆25Updated 5 months ago
- Official implementation for "Pruning Large Language Models with Semi-Structural Adaptive Sparse Training" (AAAI 2025)☆15Updated 4 months ago
- Official code for the paper "HEXA-MoE: Efficient and Heterogeneous-Aware MoE Acceleration with Zero Computation Redundancy"☆13Updated 8 months ago
- 2D-TPE: Two-Dimensional Positional Encoding Enhances Table Understanding for Large Language Models (WWW 2025)☆10Updated 6 months ago
- ☆61Updated 4 months ago
- ☆18Updated 8 months ago
- ☆10Updated last year
- [NeurIPS'25] dKV-Cache: The Cache for Diffusion Language Models☆114Updated 5 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆80Updated last year
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]☆57Updated last month
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆36Updated last year
- ☆29Updated 5 months ago
- ☆81Updated 4 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆124Updated 4 months ago
- [NeurIPS 2025] Official implementation of "Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning"☆24Updated 2 weeks ago
- ☆26Updated 3 months ago
- [ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models☆108Updated last year
- [ICML 2024] Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆37Updated 9 months ago
- ☆100Updated last month
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆151Updated 3 weeks ago
- ☆12Updated 3 months ago
- Work in progress.☆74Updated 4 months ago
- The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink…☆101Updated last month
- HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…☆28Updated 8 months ago
- [ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆94Updated 11 months ago
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".☆27Updated 11 months ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆36Updated last year