MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI training and inference, such as FP8 row-wise quantization and collective communications.
☆55Mar 1, 2026Updated this week
Alternatives and similar repositories for MSLK
Users that are interested in MSLK are comparing it to the libraries listed below
Sorting:
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Apr 7, 2023Updated 2 years ago
- Tutorial Exercises and Code for GPU Communications Tutorial at HOT Interconnects 2025☆31Oct 22, 2025Updated 4 months ago
- A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search☆21Jul 22, 2025Updated 7 months ago
- A Triton JIT runtime and ffi provider in C++☆31Feb 24, 2026Updated last week
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.☆20Jan 24, 2025Updated last year
- Ship correct and fast LLM kernels to PyTorch☆142Jan 14, 2026Updated last month
- ☆24Updated this week
- DeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang☆44Nov 19, 2025Updated 3 months ago
- Triton-based Symmetric Memory operators and examples☆85Jan 15, 2026Updated last month
- GPTQ inference TVM kernel☆40Apr 25, 2024Updated last year
- A dynamic binary instrumentation tool for tracing and analyzing CUDA kernel instructions.☆35Updated this week
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Jul 21, 2023Updated 2 years ago
- Framework to reduce autotune overhead to zero for well known deployments.☆97Sep 19, 2025Updated 5 months ago
- A practical way of learning Swizzle☆37Feb 3, 2025Updated last year
- TORCH_TRACE parser for PT2☆78Updated this week
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Feb 20, 2026Updated last week
- ☆111Updated this week
- NVidia sass disassembler/inline patcher☆43Updated this week
- Quantize transformers to any learned arbitrary 4-bit numeric format☆51Jan 25, 2026Updated last month
- ☆65Apr 26, 2025Updated 10 months ago
- Perplexity GPU Kernels☆567Nov 7, 2025Updated 3 months ago
- InfiniStore: an elastic serverless cloud storage system (VLDB'23)☆24May 5, 2023Updated 2 years ago
- Artifacts of EVT ASPLOS'24☆29Mar 6, 2024Updated last year
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 9 months ago
- Triton Compiler related materials.☆42Jan 4, 2025Updated last year
- Transformers components but in Triton☆34May 9, 2025Updated 9 months ago
- ☆53Feb 24, 2026Updated last week
- From Minimal GEMM to Everything☆163Feb 10, 2026Updated 3 weeks ago
- Luthier, a GPU binary instrumentation tool for AMD GPUs☆27Feb 21, 2026Updated last week
- extensible collectives library in triton☆95Mar 31, 2025Updated 11 months ago
- NVIDIA Inference Xfer Library (NIXL)☆898Updated this week
- ☆261Jul 11, 2024Updated last year
- The repository provides code for the paper RECE: Reduced Cross-Entropy Loss for Large-Catalogue Sequential Recommenders, CIKM'24☆11Oct 21, 2024Updated last year
- ☆40Feb 28, 2020Updated 6 years ago
- Artifact for 'Register Optimizations for Stencils on GPUs'☆10Sep 18, 2018Updated 7 years ago
- a simple general program language☆100Feb 2, 2026Updated last month
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆168Nov 11, 2025Updated 3 months ago
- ☆169Mar 9, 2023Updated 2 years ago