66RING / CritiPrefill
Code repo for "CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs".
☆13Updated 4 months ago
Alternatives and similar repositories for CritiPrefill:
Users that are interested in CritiPrefill are comparing it to the libraries listed below
- ☆37Updated 3 months ago
- [ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models☆75Updated 7 months ago
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆110Updated last month
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆107Updated last month
- The official implementation of the paper "Demystifying the Compression of Mixture-of-Experts Through a Unified Framework".☆52Updated 2 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆76Updated last month
- From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…☆42Updated 6 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆34Updated 8 months ago
- ☆56Updated 3 months ago
- TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆26Updated last month
- ☆38Updated 11 months ago
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆33Updated 6 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆42Updated 3 months ago
- Code for this paper "HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts via HyperNetwork"☆31Updated last year
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆85Updated 3 months ago
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆70Updated 7 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆152Updated 6 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆114Updated 7 months ago
- EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).☆52Updated 7 months ago
- ☆124Updated 11 months ago
- Cascade Speculative Drafting☆28Updated 9 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆38Updated 10 months ago
- This repository contains the code for the paper: SirLLM: Streaming Infinite Retentive LLM☆56Updated 7 months ago
- KV cache compression for high-throughput LLM inference☆104Updated last month
- Transformers components but in Triton☆29Updated 2 months ago
- ☆107Updated 3 months ago
- OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure☆19Updated 5 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆32Updated 5 months ago
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"☆52Updated 8 months ago
- ☆69Updated this week