NVlabs / Fast-dLLMLinks

Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"

☆320

Alternatives and similar repositories for Fast-dLLM

Users that are interested in Fast-dLLM are comparing it to the libraries listed below

Sorting:

maomaocun / dLLM-cache
Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cache…
☆132Updated this week
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆213Updated 3 weeks ago
ByteDance-Seed / VeOmni
VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework
☆399Updated this week
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆221Updated last month
mit-han-lab / Block-Sparse-Attention
A sparse attention kernel supporting mix sparse patterns
☆262Updated 5 months ago
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆220Updated last month
stepfun-ai / Step3
☆326Updated this week
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆141Updated this week
XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆186Updated 2 months ago
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆480Updated 5 months ago
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆85Updated 7 months ago
JT-Ushio / MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
☆183Updated last month
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆141Updated 3 weeks ago
sii-research / siiRL
siiRL: Shanghai Innovation Institute RL Framework for Advanced LLMs and Multi-Agent Systems
☆152Updated this week
horseee / dKV-Cache
☆88Updated 2 months ago
fxmeng / TransMLA
TransMLA: Multi-Head Latent Attention Is All You Need
☆335Updated 3 weeks ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆124Updated 2 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆311Updated 3 weeks ago
pprp / Awesome-Efficient-MoE
Efficient Mixture of Experts for LLM Paper List
☆87Updated 7 months ago
OpenSparseLLMs / Linear-MoE
☆113Updated 2 months ago
z-lab / sparselora
[ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
☆48Updated last month
LINs-lab / DynMoE
[ICLR 2025] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
☆121Updated 3 weeks ago
yczhou001 / Awesome-Diffusion-LLM
paper list, tutorial, and nano code snippet for Diffusion Large Language Models.
☆92Updated last month
openpsi-project / ReaLHF
Super-Efficient RLHF Training of LLMs with Parameter Reallocation
☆307Updated 3 months ago
FasterDecoding / SnapKV
☆268Updated 3 weeks ago
ruipeterpan / specreason
PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [arXiv '25]
☆43Updated 3 weeks ago
UNITES-Lab / MC-SMoE
[ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"
☆89Updated last month
GAIR-NLP / MAYE
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
☆138Updated 3 months ago
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆128Updated 11 months ago
NVlabs / MaskLLM
[NeurIPS 24 Spotlight] MaskLLM: Learnable Semi-structured Sparsity for Large Language Models
☆174Updated 7 months ago