CUDA SGEMM optimization note
☆15Oct 31, 2023Updated 2 years ago
Alternatives and similar repositories for cuda-sgemm-optimization
Users that are interested in cuda-sgemm-optimization are comparing it to the libraries listed below
Sorting:
- a simple API to use CUPTI☆11Aug 19, 2025Updated 6 months ago
- Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.☆18Jan 15, 2025Updated last year
- TileGraph is an experimental DNN compiler that utilizes static code generation and kernel fusion techniques.☆12Sep 18, 2024Updated last year
- ☆14May 30, 2019Updated 6 years ago
- A concurrent LRU cache.☆23Feb 14, 2021Updated 5 years ago
- [WIP] A tiny RISC-V hypervisor software written in Rust☆27Dec 8, 2020Updated 5 years ago
- ☆23Jun 14, 2023Updated 2 years ago
- OSDI 2023 Welder, deeplearning compiler☆32Nov 24, 2023Updated 2 years ago
- CUDA project for uni subject☆26Oct 26, 2020Updated 5 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 11 months ago
- Linux io_uring based c++ 20 coroutine library☆28Jun 21, 2022Updated 3 years ago
- Sequence-level 1F1B schedule for LLMs.☆38Aug 26, 2025Updated 6 months ago
- Asynchronous pipeline parallel optimization☆19Feb 2, 2026Updated last month
- Transformers components but in Triton☆34May 9, 2025Updated 9 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆129Jun 24, 2025Updated 8 months ago
- Symphony — A decentralized multi-agent framework that enables intelligent agents to collaborate seamlessly across heterogeneous edge devi…☆30Oct 30, 2025Updated 4 months ago
- Low-level RDMA API☆39Oct 22, 2023Updated 2 years ago
- Prefix-Aware Attention for LLM Decoding☆29Jan 23, 2026Updated last month
- Cute tiny operating system for RISC-V. ฅ•ω•ฅ☆38Jun 9, 2022Updated 3 years ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆163Feb 11, 2026Updated 3 weeks ago
- A simple spinlock crate based on the abstractions provided by the `lock_api` crate.☆44Feb 20, 2026Updated last week
- lab solutions of ICS course☆10Jan 20, 2013Updated 13 years ago
- This is the code of a agentic rag method with dynamic workflow.☆12Jan 22, 2026Updated last month
- ☆12Feb 7, 2018Updated 8 years ago
- A simple MIPS CPU for BUAA CO course (and now NSCSCC).☆10May 15, 2021Updated 4 years ago
- [ICDCS 2023] Evaluation and Optimization of Gradient Compression for Distributed Deep Learning☆10Apr 28, 2023Updated 2 years ago
- Wait for async tasks☆13Dec 22, 2022Updated 3 years ago
- Simple Java Virtual Machine written in pure Rust☆36Sep 3, 2025Updated 6 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆168Nov 11, 2025Updated 3 months ago
- Async version of smoltcp☆42Feb 1, 2026Updated last month
- ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch☆39Mar 27, 2025Updated 11 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 8 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Feb 20, 2026Updated last week
- The chinese translation for https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf☆37May 13, 2023Updated 2 years ago
- The Project TinyMIPS is dedicated to enabling undergraduates to build a complete computer system from scratch.☆36Feb 28, 2020Updated 6 years ago
- Texture Block Compression (BCn) written in Rust☆11Apr 12, 2021Updated 4 years ago
- LLCL-MIPS is a superscalar MIPS processor, which supports MIPS Release 1 instructions and is capable of booting linux kernel. (第五届龙芯杯特等奖作…☆37Jan 26, 2022Updated 4 years ago
- hardware implement of huffman coding(written in verilog)☆14Jul 30, 2017Updated 8 years ago
- boost context 自实现协程和调度器。构建rpc框架☆10May 9, 2025Updated 9 months ago