Dao-AILab / gemm-cublasView external linksLinks
☆22May 5, 2025Updated 9 months ago
Alternatives and similar repositories for gemm-cublas
Users that are interested in gemm-cublas are comparing it to the libraries listed below
Sorting:
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 6 months ago
- ☆52May 19, 2025Updated 8 months ago
- ☆131May 29, 2025Updated 8 months ago
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 9 months ago
- ☆13Jan 7, 2025Updated last year
- Demo for Qwen2.5-VL-3B-Instruct on Axera device.☆17Sep 3, 2025Updated 5 months ago
- ☆42Jan 24, 2026Updated 3 weeks ago
- PyTorch implementation of the Flash Spectral Transform Unit.☆21Sep 19, 2024Updated last year
- Custom triton kernels for training Karpathy's nanoGPT.☆19Oct 21, 2024Updated last year
- ☆39Dec 14, 2025Updated 2 months ago
- Quantized Attention on GPU☆44Nov 22, 2024Updated last year
- ☆20Dec 24, 2024Updated last year
- study of cutlass☆22Nov 10, 2024Updated last year
- A collection of Ethereum Virtual Machine benchmarks☆22Jun 7, 2024Updated last year
- FlashTile is a CUDA Tile IR compiler that is compatible with NVIDIA's tileiras, targeting SM70 through SM121 NVIDIA GPUs.☆37Feb 6, 2026Updated last week
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆93Jan 16, 2026Updated last month
- ☆221Nov 19, 2025Updated 2 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆326Updated this week
- ☆61Nov 27, 2023Updated 2 years ago
- Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.☆31Nov 4, 2024Updated last year
- The solution and code for NTO AI Olympics 2022.☆19Sep 20, 2022Updated 3 years ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆148May 10, 2025Updated 9 months ago
- Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)☆50Jul 6, 2025Updated 7 months ago
- FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]☆46Jan 27, 2026Updated 2 weeks ago
- Optimize GEMM with tensorcore step by step☆36Dec 17, 2023Updated 2 years ago
- ☆26Dec 3, 2025Updated 2 months ago
- Minimal but scalable implementation of large language models in JAX☆35Nov 28, 2025Updated 2 months ago
- ☆79Dec 27, 2024Updated last year
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆252May 6, 2025Updated 9 months ago
- patches for huggingface transformers to save memory☆34Jun 2, 2025Updated 8 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆129Jun 24, 2025Updated 7 months ago
- The official engine source code for Project ORKA☆10Nov 25, 2024Updated last year
- 详细双语注释版word2vec源码,well-annotated word2vec☆10Oct 3, 2021Updated 4 years ago
- tool and many hash functions: crc16, crc32, md5, sha1, sha256, sha512, sha3_224, sha3_256, sha3_384, sha3_512☆11Apr 21, 2022Updated 3 years ago
- extensible collectives library in triton☆95Mar 31, 2025Updated 10 months ago
- ring-attention experiments☆165Oct 17, 2024Updated last year
- ☆270Jun 6, 2025Updated 8 months ago
- Intel(R) Distribution for GDB*☆15Jan 26, 2026Updated 3 weeks ago
- OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents☆21Jan 6, 2026Updated last month