MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI training and inference, such as FP8 row-wise quantization and collective communications.
☆87Mar 21, 2026Updated this week
Alternatives and similar repositories for MSLK
Users that are interested in MSLK are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Triton-based Symmetric Memory operators and examples☆94Jan 15, 2026Updated 2 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 7 months ago
- Clustered Compositional Embeddings☆11Oct 25, 2023Updated 2 years ago
- Ship correct and fast LLM kernels to PyTorch☆145Jan 14, 2026Updated 2 months ago
- Tutorial Exercises and Code for GPU Communications Tutorial at HOT Interconnects 2025☆31Oct 22, 2025Updated 5 months ago
- extensible collectives library in triton☆97Mar 31, 2025Updated 11 months ago
- A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search☆21Jul 22, 2025Updated 8 months ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆170Feb 11, 2026Updated last month
- ☆38Aug 7, 2025Updated 7 months ago
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Apr 7, 2023Updated 2 years ago
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.☆20Jan 24, 2025Updated last year
- TORCH_TRACE parser for PT2☆78Updated this week
- GPTQ inference TVM kernel☆40Apr 25, 2024Updated last year
- Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.☆10Sep 24, 2023Updated 2 years ago
- CVFusion is an open-source deep learning compiler to fuse the OpenCV operators.☆33Aug 31, 2022Updated 3 years ago
- A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-perfo…☆94Feb 2, 2026Updated last month
- DeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang☆44Nov 19, 2025Updated 4 months ago
- The AI-First Software Development (AIFSD) Manifesto☆23Feb 17, 2026Updated last month
- Thunder Research Group's Collective Communication Library☆50Jul 8, 2025Updated 8 months ago
- ☆119May 19, 2025Updated 10 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆95Feb 20, 2026Updated last month
- Pure Triton kernels for Qwen3.5-27B inference on NVIDIA B200☆81Feb 28, 2026Updated 3 weeks ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Jul 21, 2023Updated 2 years ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆809Updated this week
- A Triton JIT runtime and ffi provider in C++☆32Mar 16, 2026Updated last week
- Distributed Compiler based on Triton for Parallel Systems☆1,394Mar 11, 2026Updated last week
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 5 months ago
- ☆36Mar 7, 2025Updated last year
- CUDAAdvisor: a GPU profiling tool☆53Aug 24, 2018Updated 7 years ago
- ☆11Dec 1, 2022Updated 3 years ago
- ☆65Apr 26, 2025Updated 10 months ago
- It is an LLM-based AI agent, which can write correct and efficient gpu kernels automatically.☆78Updated this week
- Hex encode & decode a string, right from your terminal.☆10Jan 5, 2023Updated 3 years ago
- DL Dataloader Benchmarks☆20Jan 27, 2025Updated last year
- Triton Compiler related materials.☆42Mar 16, 2026Updated last week
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆18Feb 9, 2026Updated last month
- ☆24Updated this week
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆222Aug 19, 2024Updated last year
- ☆121Mar 14, 2026Updated last week