☆968Nov 4, 2025Updated 4 months ago
Alternatives and similar repositories for batch_invariant_ops
Users that are interested in batch_invariant_ops are comparing it to the libraries listed below
Sorting:
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- SkyRL: A Modular Full-stack RL Library for LLMs☆1,628Updated this week
- Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.☆18Jan 15, 2025Updated last year
- ☆39Dec 14, 2025Updated 2 months ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Jul 4, 2025Updated 8 months ago
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆91Feb 23, 2026Updated last week
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 9 months ago
- Distributed Compiler based on Triton for Parallel Systems☆1,371Feb 13, 2026Updated 2 weeks ago
- ☆52May 19, 2025Updated 9 months ago
- Supporting code for the blog post on modular manifolds.☆117Sep 26, 2025Updated 5 months ago
- Tile primitives for speedy kernels☆3,202Feb 24, 2026Updated last week
- slime is an LLM post-training framework for RL Scaling.☆4,536Updated this week
- 🚀 Efficient implementations of state-of-the-art linear attention models☆4,428Updated this week
- Muon is Scalable for LLM Training☆1,440Aug 3, 2025Updated 7 months ago
- kernels, of the mega variety☆684Updated this week
- A Quirky Assortment of CuTe Kernels☆838Updated this week
- Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.☆3,586Updated this week
- Triton-based implementation of Sparse Mixture of Experts.☆266Oct 3, 2025Updated 5 months ago
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels☆5,284Updated this week
- Build compute kernels and load them from the Hub.☆452Updated this week
- Analyze computation-communication overlap in V3/R1.☆1,143Mar 21, 2025Updated 11 months ago
- Understanding R1-Zero-Like Training: A Critical Perspective☆1,219Aug 27, 2025Updated 6 months ago
- FlashInfer: Kernel Library for LLM Serving☆5,057Updated this week
- A bibliography and survey of the papers surrounding o1☆1,213Nov 16, 2024Updated last year
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆912Updated this week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)☆820Updated this week
- Benchmarking Optimizers for LLM Pretraining☆52Dec 30, 2025Updated 2 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆147May 10, 2025Updated 9 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆129Jun 24, 2025Updated 8 months ago
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆271Feb 20, 2026Updated last week
- [ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter☆138Dec 5, 2025Updated 2 months ago
- ☆32Jul 2, 2025Updated 8 months ago
- A sparse attention kernel supporting mix sparse patterns☆467Jan 18, 2026Updated last month
- Transformers components but in Triton☆34May 9, 2025Updated 9 months ago
- Helpful tools and examples for working with flex-attention☆1,140Feb 8, 2026Updated 3 weeks ago
- Ring attention implementation with flash attention☆986Sep 10, 2025Updated 5 months ago
- Pytorch routines for (Ker)nel (Mac)hines☆10Oct 10, 2025Updated 4 months ago
- a simple API to use CUPTI☆11Aug 19, 2025Updated 6 months ago
- Scalable toolkit for efficient model reinforcement☆1,372Updated this week