flashinfer-ai / flashinfer-benchLinks
Building the Virtuous Cycle for AI-driven LLM Systems
☆140Updated this week
Alternatives and similar repositories for flashinfer-bench
Users that are interested in flashinfer-bench are comparing it to the libraries listed below
Sorting:
- Autonomous GPU Kernel Generation via Deep Agents☆223Updated this week
- ☆65Updated 9 months ago
- [NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive☆66Updated last month
- DeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang☆43Updated 2 months ago
- TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators☆114Updated 7 months ago
- Distributed MoE in a Single Kernel [NeurIPS '25]☆190Updated this week
- Sequence-level 1F1B schedule for LLMs.☆38Updated 5 months ago
- Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding☆82Updated 2 months ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆158Updated 4 months ago
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆79Updated last month
- FlashInfer Bench @ MLSys 2026: Building AI agents to write high performance GPU kernels☆60Updated last week
- DeeperGEMM: crazy optimized version☆73Updated 8 months ago
- Nex Venus Communication Library☆72Updated 2 months ago
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆87Updated 2 months ago
- High-performance distributed data shuffling (all-to-all) library for MoE training and inference☆109Updated last month
- Tile-based language built for AI computation across all scales☆119Updated this week
- Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocatio…☆82Updated 4 months ago
- ☆35Updated 10 months ago
- Learning TileLang with 10 puzzles!☆56Updated this week
- nnScaler: Compiling DNN models for Parallel Training☆124Updated 4 months ago
- 16-fold memory access reduction with nearly no loss☆109Updated 10 months ago
- ☆81Updated this week
- A lightweight design for computation-communication overlap.☆213Updated 2 weeks ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆233Updated 2 years ago
- NVIDIA cuTile learn☆154Updated last month
- PyTorch bindings for CUTLASS grouped GEMM.☆141Updated 8 months ago
- [COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆25Updated last year
- ☆38Updated 5 months ago
- ☆83Updated 3 months ago
- ☆102Updated last year