feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β841Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ988Updated this week
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β320Updated 6 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β482Updated last year
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β567Updated last year
- Ring attention implementation with flash attentionβ903Updated last month
- LongBench v2 and LongBench (ACL 25'&24')β997Updated 9 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β566Updated 3 weeks ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ582Updated last week
- β343Updated last year
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β994Updated 10 months ago
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,221Updated 4 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,072Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,288Updated 7 months ago
- Disaggregated serving system for Large Language Models (LLMs).β709Updated 6 months ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β439Updated this week
- Best practice for training LLaMA models in Megatron-LMβ659Updated last year
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,141Updated 3 weeks ago
- Awesome list for LLM pruning.β267Updated 2 weeks ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β770Updated 7 months ago
- A flexible and efficient training framework for large-scale alignment tasksβ433Updated this week
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β376Updated 7 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β205Updated 8 months ago
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.β1,395Updated this week
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ631Updated last year
- A powerful toolkit for compressing large models including LLM, VLM, and video generation models.β593Updated 2 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"β317Updated 7 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β1,884Updated last week
- β284Updated 3 months ago
- FlagScale is a large model toolkit based on open-sourced projects.β362Updated last week
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β47Updated 7 months ago