MetaX-MACA / FlashMLA
Fast and efficient attention method exploration and implementation.
☆21Updated last month
Alternatives and similar repositories for FlashMLA:
Users that are interested in FlashMLA are comparing it to the libraries listed below
- ☆49Updated this week
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆82Updated 2 years ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆181Updated 2 months ago
- CVFusion is an open-source deep learning compiler to fuse the OpenCV operators.☆29Updated 2 years ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆73Updated 3 weeks ago
- GVProf: A Value Profiler for GPU-based Clusters☆49Updated last year
- A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores☆51Updated last year
- Assembler and Decompiler for NVIDIA (Maxwell Pascal Volta Turing Ampere) GPUs.☆78Updated 2 years ago
- A benchmark suited especially for deep learning operators☆42Updated 2 years ago
- A Python script to convert the output of NVIDIA Nsight Systems (in SQLite format) to JSON in Google Chrome Trace Event Format.☆33Updated 3 months ago
- ☆11Updated this week
- ☆148Updated 3 months ago
- Examples of CUDA implementations by Cutlass CuTe☆159Updated 2 months ago
- This is a demo how to write a high performance convolution run on apple silicon☆54Updated 3 years ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆91Updated 3 weeks ago
- 分层解耦的深度学习推理引擎☆72Updated 2 months ago
- ☆90Updated 3 weeks ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆25Updated 2 months ago
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆236Updated last week
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆97Updated last year
- Efficient operation implementation based on the Cambricon Machine Learning Unit (MLU) .☆115Updated 2 weeks ago
- A tutorial for CUDA&PyTorch☆137Updated 3 months ago
- play gemm with tvm☆90Updated last year
- ☆96Updated 3 years ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆82Updated last week
- DeepSeek-V3/R1 inference performance simulator☆113Updated last month
- ☆109Updated last year
- Paella: Low-latency Model Serving with Virtualized GPU Scheduling☆58Updated 11 months ago
- An extension library of WMMA API (Tensor Core API)☆96Updated 9 months ago
- flexible-gemm conv of deepcore☆17Updated 5 years ago