kyegomez / Speculative-Decoding
My own implementation of "Fast Inference from Transformers via Speculative Decoding"
☆11Updated 11 months ago
Related projects ⓘ
Alternatives and complementary repositories for Speculative-Decoding
- SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration☆27Updated last month
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆53Updated last month
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection☆29Updated 3 weeks ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆77Updated last month
- Official repository for paper "Weak-to-Strong Extrapolation Expedites Alignment"☆68Updated 5 months ago
- ☆31Updated this week
- OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure☆18Updated 3 months ago
- [EMNLP Findings 2024 & ACL 2024 NLRSE Oral] Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards☆44Updated 6 months ago
- The official repository of the Omni-MATH benchmark.☆52Updated 3 weeks ago
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆64Updated 5 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆119Updated this week
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆79Updated last month
- Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)☆50Updated 7 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆39Updated last month
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**☆140Updated 6 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆82Updated 8 months ago
- [NeurIPS 2024 Spotlight] Code and data for the paper "Finding Transformer Circuits with Edge Pruning".☆25Updated 3 weeks ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆32Updated 3 months ago
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆31Updated 4 months ago
- Simple and efficient pytorch-native transformer training and inference (batched)☆61Updated 7 months ago
- Sirius, an efficient correction mechanism, which significantly boosts Contextual Sparsity models on reasoning tasks while maintaining its…☆19Updated 2 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆139Updated this week
- This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"☆39Updated 4 months ago
- ☆122Updated 10 months ago
- Long Context Extension and Generalization in LLMs☆39Updated 2 months ago
- [EMNLP 2023]Context Compression for Auto-regressive Transformers with Sentinel Tokens☆21Updated last year
- Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"☆64Updated last week
- Multi-Candidate Speculative Decoding☆28Updated 7 months ago
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆139Updated 2 months ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆79Updated last week