inclusionAI / dInferLinks
dInfer: An Efficient Inference Framework for Diffusion Language Models
☆262Updated last week
Alternatives and similar repositories for dInfer
Users that are interested in dInfer are comparing it to the libraries listed below
Sorting:
- ☆102Updated 5 months ago
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆195Updated this week
- ☆651Updated this week
- [NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning☆51Updated this week
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆52Updated 3 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Updated last year
- [WIP] Better (FP8) attention for Hopper☆33Updated 8 months ago
- Fast and memory-efficient exact kmeans☆113Updated last week
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆121Updated 4 months ago
- QeRL enables RL for 32B LLMs on a single H100 GPU.☆384Updated 2 weeks ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆147Updated 2 weeks ago
- ☆58Updated 5 months ago
- KV cache compression for high-throughput LLM inference☆143Updated 8 months ago
- Samples of good AI generated CUDA kernels☆91Updated 5 months ago
- ☆37Updated 5 months ago
- ☆103Updated this week
- A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and cach…☆41Updated last month
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆147Updated 3 months ago
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆89Updated 3 weeks ago
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆21Updated last week
- Train, tune, and infer Bamba model☆135Updated 4 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆197Updated 4 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆130Updated 10 months ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆793Updated this week
- Multi-Turn RL Training System with AgentTrainer for Language Model Game Reinforcement Learning☆48Updated 2 weeks ago
- A collection of tricks and tools to speed up transformer models☆182Updated this week
- Work in progress.☆74Updated 4 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆183Updated this week
- 👷 Build compute kernels☆163Updated this week
- ☆50Updated 5 months ago