inclusionAI / dInferLinks

dInfer: An Efficient Inference Framework for Diffusion Language Models

☆315

Alternatives and similar repositories for dInfer

Users that are interested in dInfer are comparing it to the libraries listed below

Sorting:

NVlabs / QeRL
QeRL enables RL for 32B LLMs on a single H100 GPU.
☆441Updated last month
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆150Updated 4 months ago
NVlabs / Fast-dLLM
Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"
☆676Updated last month
JinjieNi / MegaDLMs
GPU-optimized framework for training diffusion language models at any scale. The backend of Quokka, Super Data Learners, and OpenMoE 2 tr…
☆272Updated last week
maomaocun / dLLM-cache
Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cache…
☆185Updated this week
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆206Updated 5 months ago
JT-Ushio / MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
☆194Updated last month
Infini-AI-Lab / Multiverse
☆103Updated 2 months ago
svg-project / flash-kmeans
Fast and memory-efficient exact kmeans
☆122Updated last week
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆254Updated 4 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆154Updated last month
RLsys-Foundation / TritonForge
🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…
☆99Updated last week
XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆248Updated 6 months ago
hao-ai-lab / Awesome-Video-Attention
A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and cach…
☆48Updated 3 weeks ago
mit-han-lab / flash-moba
☆143Updated this week
hao-ai-lab / LookaheadReasoning
[NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning
☆52Updated 3 weeks ago
stepfun-ai / Step3
☆435Updated 3 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆51Updated 4 months ago
deepseek-ai / LPLB
An early research stage MoE load balancer based on inear programming.
☆228Updated this week
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆171Updated last month
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆124Updated 4 months ago
microsoft / AttentionEngine
☆109Updated 6 months ago
weigao266 / Awesome-Efficient-Arch
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
☆359Updated last week
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
xichen-fy / Fira
Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?
☆115Updated last year
Gen-Verse / dLLM-RL
TraceRL & TraDo-8B: Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
☆317Updated last week
yaof20 / Flash-RL
Implementation for FP8/INT8 Rollout for RL training without performence drop.
☆275Updated 2 weeks ago
sii-research / siiRL
siiRL: Shanghai Innovation Institute RL Framework for Advanced LLMs and Multi-Agent Systems
☆226Updated this week
thu-ml / SLA
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
☆140Updated last week
Dao-AILab / grouped-latent-attention
☆130Updated 5 months ago