feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β762Updated 10 months ago
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ800Updated last week
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β282Updated 2 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,337Updated 2 weeks ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β454Updated 10 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,026Updated 8 months ago
- LongBench v2 and LongBench (ACL 25'&24')β908Updated 5 months ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ617Updated last year
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ519Updated 3 weeks ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,258Updated 3 months ago
- Ring attention implementation with flash attentionβ789Updated last week
- β328Updated last year
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β487Updated 9 months ago
- Explorations into some recent techniques surrounding speculative decodingβ269Updated 6 months ago
- A repository sharing the literatures about long-context large language models, including the methodologies and the evaluation benchmarksβ263Updated 10 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β459Updated this week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β705Updated 3 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β190Updated 4 months ago
- Best practice for training LLaMA models in Megatron-LMβ656Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β846Updated 9 months ago
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β967Updated 6 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β204Updated 6 months ago
- β256Updated last year
- Codes for the paper "βBench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718β336Updated 9 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,057Updated this week
- [EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Vβ¦β492Updated this week
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmindβ98Updated last year
- Microsoft Automatic Mixed Precision Libraryβ610Updated 8 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"β300Updated 3 months ago
- Awesome list for LLM pruning.β232Updated 6 months ago
- A simple and effective LLM pruning approach.β763Updated 10 months ago