feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β815Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ917Updated last week
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β310Updated 4 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β473Updated last year
- LongBench v2 and LongBench (ACL 25'&24')β963Updated 8 months ago
- Ring attention implementation with flash attentionβ864Updated last month
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ557Updated last month
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,635Updated this week
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β547Updated last year
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,212Updated 2 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β531Updated last month
- Best practice for training LLaMA models in Megatron-LMβ661Updated last year
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β986Updated 9 months ago
- β335Updated last year
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,058Updated 11 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,276Updated 6 months ago
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.β1,341Updated this week
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β201Updated 7 months ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ630Updated last year
- Awesome list for LLM pruning.β257Updated this week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β374Updated this week
- β609Updated 4 months ago
- Disaggregated serving system for Large Language Models (LLMs).β685Updated 5 months ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β45Updated 6 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"β312Updated 6 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β753Updated 6 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,124Updated last month
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β358Updated 6 months ago
- A powerful toolkit for compressing large models including LLM, VLM, and video generation models.β559Updated 3 weeks ago
- Explorations into some recent techniques surrounding speculative decodingβ285Updated 8 months ago
- A flexible and efficient training framework for large-scale alignment tasksβ422Updated this week