☆22May 5, 2025Updated 10 months ago
Alternatives and similar repositories for gemm-cublas
Users that are interested in gemm-cublas are comparing it to the libraries listed below
Sorting:
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 7 months ago
- ☆52May 19, 2025Updated 9 months ago
- ☆136May 29, 2025Updated 9 months ago
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 10 months ago
- ☆13Jan 7, 2025Updated last year
- ☆15Jul 13, 2025Updated 7 months ago
- Demo for Qwen2.5-VL-3B-Instruct on Axera device.☆17Sep 3, 2025Updated 6 months ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- PyTorch implementation of the Flash Spectral Transform Unit.☆22Sep 19, 2024Updated last year
- Custom triton kernels for training Karpathy's nanoGPT.☆19Oct 21, 2024Updated last year
- ☆39Dec 14, 2025Updated 2 months ago
- Quantized Attention on GPU☆44Nov 22, 2024Updated last year
- Official implementation of Adaptive Feature Transfer (AFT)☆23Jun 12, 2024Updated last year
- ☆20Dec 24, 2024Updated last year
- A collection of Ethereum Virtual Machine benchmarks☆22Jun 7, 2024Updated last year
- ☆45Feb 27, 2026Updated last week
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆93Jan 16, 2026Updated last month
- ☆227Nov 19, 2025Updated 3 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆329Updated this week
- FlashTile is a CUDA Tile IR compiler that is compatible with NVIDIA's tileiras, targeting SM70 through SM121 NVIDIA GPUs.☆56Feb 6, 2026Updated last month
- ☆61Nov 27, 2023Updated 2 years ago
- The solution and code for NTO AI Olympics 2022.☆19Sep 20, 2022Updated 3 years ago
- Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.☆31Nov 4, 2024Updated last year
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆150May 10, 2025Updated 9 months ago
- FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]☆46Feb 17, 2026Updated 2 weeks ago
- ☆79Dec 27, 2024Updated last year
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆253May 6, 2025Updated 10 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆129Jun 24, 2025Updated 8 months ago
- 详细双语注释版word2vec源码,well-annotated word2vec☆10Oct 3, 2021Updated 4 years ago
- tool and many hash functions: crc16, crc32, md5, sha1, sha256, sha512, sha3_224, sha3_256, sha3_384, sha3_512☆11Apr 21, 2022Updated 3 years ago
- ☆30Dec 3, 2025Updated 3 months ago
- This is the official training code of OmniSVG☆30Jan 19, 2026Updated last month
- extensible collectives library in triton☆96Mar 31, 2025Updated 11 months ago
- LM engine is a library for pretraining/finetuning LLMs☆126Updated this week
- ring-attention experiments☆166Oct 17, 2024Updated last year
- ☆272Jun 6, 2025Updated 9 months ago
- Implementation of the paper "Predicting gamma passing rates for portal dosimetry based IMRT QA using machine learning"☆12Oct 8, 2021Updated 4 years ago
- MFRC522 RFID reader/writer I2C driver in Python 3☆10Oct 7, 2024Updated last year
- Customize recoil helper for PUBG using C++☆12Apr 4, 2018Updated 7 years ago