feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β779Updated 10 months ago
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ822Updated 3 weeks ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β459Updated 11 months ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β285Updated 2 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,384Updated this week
- β330Updated last year
- LongBench v2 and LongBench (ACL 25'&24')β926Updated 6 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β506Updated 10 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,031Updated 9 months ago
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β973Updated 7 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β481Updated 2 weeks ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ528Updated last month
- slime is a LLM post-training framework aiming for RL Scaling.β596Updated this week
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ621Updated last year
- Best practice for training LLaMA models in Megatron-LMβ657Updated last year
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,181Updated 3 weeks ago
- Ring attention implementation with flash attentionβ800Updated last week
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,259Updated 4 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β194Updated 5 months ago
- An Efficient "Factory" to Build Multiple LoRA Adaptersβ322Updated 5 months ago
- Awesome list for LLM pruning.β239Updated 7 months ago
- Disaggregated serving system for Large Language Models (LLMs).β639Updated 3 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β325Updated 4 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,067Updated 3 weeks ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β717Updated 4 months ago
- A flexible and efficient training framework for large-scale alignment tasksβ388Updated this week
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β42Updated 4 months ago
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.β1,200Updated last week
- β261Updated last year
- Explorations into some recent techniques surrounding speculative decodingβ272Updated 6 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"β302Updated 4 months ago