Efficient LLM Inference Acceleration using Prompting
☆51Oct 22, 2024Updated last year
Alternatives and similar repositories for parallel-prompt-decoding
Users that are interested in parallel-prompt-decoding are comparing it to the libraries listed below
Sorting:
- FPGA-based hardware acceleration for dropout-based Bayesian Neural Networks.☆27Aug 15, 2023Updated 2 years ago
- ☆35Feb 10, 2025Updated last year
- An innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.☆26Apr 15, 2025Updated 10 months ago
- ☆21Apr 17, 2025Updated 10 months ago
- List Flower resources☆12Feb 4, 2022Updated 4 years ago
- ☆15Jan 12, 2026Updated last month
- [ACL 2024 Findings] Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning☆13Sep 2, 2024Updated last year
- [ICCAD 2025] Squant☆15Jul 3, 2025Updated 7 months ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆35Jun 12, 2024Updated last year
- ☆16Dec 9, 2023Updated 2 years ago
- This repository contains papers for a comprehensive survey on accelerated generation techniques in Large Language Models (LLMs).☆11May 24, 2024Updated last year
- ☆15Apr 11, 2024Updated last year
- BESA is a differentiable weight pruning technique for large language models.☆17Mar 4, 2024Updated last year
- ☆19Mar 25, 2025Updated 11 months ago
- ☆11Feb 5, 2026Updated 3 weeks ago
- Offcial Repo of Paper "Eliminating Position Bias of Language Models: A Mechanistic Approach""☆19Jun 13, 2025Updated 8 months ago
- Fork of Flame repo for training of some new stuff in development☆19Feb 20, 2026Updated last week
- Instruct-tuning LLaMA on consumer hardware with machine-translated data☆19Apr 17, 2023Updated 2 years ago
- Federated Learning for Artificial Intelligence and Machine Learning☆17Sep 11, 2025Updated 5 months ago
- AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference☆20Jan 24, 2025Updated last year
- LLM Serving Performance Evaluation Harness☆83Feb 25, 2025Updated last year
- a survey on deep research☆47Sep 9, 2025Updated 5 months ago
- ☆91Aug 18, 2024Updated last year
- The CaMLSys project template used for researching Federated Learning.☆23Updated this week
- AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)☆33Sep 30, 2025Updated 5 months ago
- The open-source Mixture of Depths code and the official implementation of the paper "Router-Tuning: A Simple and Effective Approach for E…☆28Updated this week
- Enhancing contextual understanding in large language models through contrastive decoding☆20May 3, 2024Updated last year
- A Hackable Quantization Library for PyTorch☆22Mar 29, 2021Updated 4 years ago
- [ICCV 2025] Official code for "AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning"☆55Oct 9, 2025Updated 4 months ago
- Official Implementation of FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration☆29Nov 22, 2025Updated 3 months ago
- Official code and data repository of MathChat: MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Inte…☆22Jun 3, 2024Updated last year
- ☆26May 30, 2023Updated 2 years ago
- [ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…☆29Jul 24, 2025Updated 7 months ago
- Suri: Multi-constraint instruction following for long-form text generation (EMNLP’24)☆27Oct 3, 2025Updated 4 months ago
- Beyond KV Caching: Shared Attention for Efficient LLMs☆20Jul 19, 2024Updated last year
- [NeurIPS'23] FedL2P: Federated Learning to Personalize☆24Nov 8, 2025Updated 3 months ago
- The official implementation of TinyTrain [ICML '24]☆24Jul 19, 2024Updated last year
- FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration☆20Jun 27, 2025Updated 8 months ago
- ☆105Sep 12, 2024Updated last year