[ICML 2024] CLLMs: Consistency Large Language Models
☆412Nov 16, 2024Updated last year
Alternatives and similar repositories for Consistency_LLM
Users that are interested in Consistency_LLM are comparing it to the libraries listed below
Sorting:
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,317Mar 6, 2025Updated last year
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads☆2,710Jun 25, 2024Updated last year
- [ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning☆665Jun 1, 2024Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).☆2,201Feb 20, 2026Updated 2 weeks ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆156Apr 7, 2025Updated 11 months ago
- Serving multiple LoRA finetuned LLM as one☆1,144May 8, 2024Updated last year
- ModuleFormer is a MoE-based architecture that includes two different types of experts: stick-breaking attention heads and feedforward exp…☆226Sep 18, 2025Updated 5 months ago
- Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.☆754Sep 27, 2024Updated last year
- scalable and robust tree-based speculative decoding algorithm☆370Jan 28, 2025Updated last year
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,899Jan 21, 2024Updated 2 years ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆360Feb 5, 2026Updated last month
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,190Sep 30, 2025Updated 5 months ago
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Models☆1,663Mar 8, 2024Updated last year
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆280Nov 3, 2023Updated 2 years ago
- [ICLR 2024] Efficient Streaming Language Models with Attention Sinks☆7,196Jul 11, 2024Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆817Mar 6, 2025Updated last year
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training☆1,863Updated this week
- A throughput-oriented high-performance serving framework for LLMs☆947Oct 29, 2025Updated 4 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆527Feb 10, 2025Updated last year
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆72Nov 4, 2024Updated last year
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆222Aug 19, 2024Updated last year
- The official repo for "LLoCo: Learning Long Contexts Offline"☆118Jun 15, 2024Updated last year
- 📰 Must-read papers and blogs on Speculative Decoding ⚡️☆1,131Jan 24, 2026Updated last month
- ☆87Oct 17, 2025Updated 4 months ago
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…☆3,919Updated this week
- Ring attention implementation with flash attention☆987Sep 10, 2025Updated 5 months ago
- Official implementation of the paper "Linear Transformers with Learnable Kernel Functions are Better In-Context Models"☆169Jan 16, 2025Updated last year
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆487Mar 19, 2024Updated last year
- Train your own SOTA deductive reasoning model☆107Mar 6, 2025Updated last year
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆209May 20, 2024Updated last year
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization☆712Aug 13, 2024Updated last year
- [ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"☆448Oct 16, 2024Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆179Jul 12, 2024Updated last year
- 🚀 Efficient implementations of state-of-the-art linear attention models☆4,474Updated this week
- [ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.☆890Nov 26, 2025Updated 3 months ago
- FlashInfer: Kernel Library for LLM Serving☆5,057Updated this week
- YaRN: Efficient Context Window Extension of Large Language Models☆1,676Apr 17, 2024Updated last year
- Efficient Triton Kernels for LLM Training☆6,189Updated this week
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆2,097Jun 30, 2025Updated 8 months ago