feifeibear / LLMSpeculativeSamplingLinks
Fast inference from large lauguage models via speculative decoding
β859Updated last year
Alternatives and similar repositories for LLMSpeculativeSampling
Users that are interested in LLMSpeculativeSampling are comparing it to the libraries listed below
Sorting:
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ1,040Updated this week
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)β338Updated 7 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.β487Updated last year
- LongBench v2 and LongBench (ACL 25'&24')β1,032Updated 10 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β589Updated last year
- Ring attention implementation with flash attentionβ923Updated 2 months ago
- β348Updated last year
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ605Updated last month
- π° Must-read papers on KV Cache Compression (constantly updating π€).β614Updated 2 months ago
- β·οΈ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)β996Updated last year
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β523Updated this week
- Best practice for training LLaMA models in Megatron-LMβ661Updated last year
- Disaggregated serving system for Large Language Models (LLMs).β737Updated 7 months ago
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β393Updated 9 months ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)β51Updated 8 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,163Updated 2 months ago
- A flexible and efficient training framework for large-scale alignment tasksβ440Updated last month
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,307Updated 9 months ago
- The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.β1,457Updated 3 weeks ago
- [EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLM, VLM, and video generation models.β632Updated 2 weeks ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**β212Updated 9 months ago
- [NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichβ¦β1,083Updated last year
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,236Updated 5 months ago
- Awesome list for LLM pruning.β276Updated last month
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"β323Updated 9 months ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ632Updated last year
- An Efficient "Factory" to Build Multiple LoRA Adaptersβ357Updated 9 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inferenceβ466Updated 7 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β786Updated 9 months ago
- FlagScale is a large model toolkit based on open-sourced projects.β416Updated last week