☆15Updated this week
Alternatives and similar repositories for QuickRunCUDA
Users that are interested in QuickRunCUDA are comparing it to the libraries listed below
Sorting:
- ☆32Jul 2, 2025Updated 7 months ago
- ☆11Dec 22, 2024Updated last year
- ☆23Jul 11, 2025Updated 7 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- ☆53Updated this week
- A Top-Down Profiler for GPU Applications☆22Feb 29, 2024Updated 2 years ago
- Simple python library for generating your own perfetto traces for your application. Can be used for both app instrumentation and custom …☆25Jun 22, 2025Updated 8 months ago
- ☆44Updated this week
- ☆16Jul 8, 2024Updated last year
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Jun 4, 2025Updated 8 months ago
- An implementation of the Llama architecture, to instruct and delight☆21May 31, 2025Updated 9 months ago
- ☆52May 19, 2025Updated 9 months ago
- Sample Codes using NVSHMEM on Multi-GPU☆30Jan 22, 2023Updated 3 years ago
- ☆65Apr 26, 2025Updated 10 months ago
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 7 months ago
- Estimate MFU for DeepSeekV3☆26Jan 5, 2025Updated last year
- A lightweight design for computation-communication overlap.☆221Jan 20, 2026Updated last month
- Consistent Autoregressive Video Generation with Long Context☆67Feb 6, 2026Updated 3 weeks ago
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆18Nov 18, 2024Updated last year
- High Performance Grouped GEMM in PyTorch☆31May 10, 2022Updated 3 years ago
- JAX implementation of the Mistral 7b v0.2 model☆35Jul 3, 2024Updated last year
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆129Jun 24, 2025Updated 8 months ago
- ☆27Dec 3, 2025Updated 2 months ago
- 详细双语注释版word2vec源码,well-annotated word2vec☆10Oct 3, 2021Updated 4 years ago
- Valentine's Day Anonymous matching☆10Jul 25, 2014Updated 11 years ago
- Building the Virtuous Cycle for AI-driven LLM Systems☆186Feb 19, 2026Updated last week
- Repository for go shared libraries (for now).☆11Dec 1, 2025Updated 3 months ago
- A TUI-based utility for real-time monitoring of InfiniBand traffic and performance metrics on the local node☆62Dec 19, 2025Updated 2 months ago
- torchcomms: a modern PyTorch communications API☆338Updated this week
- extensible collectives library in triton☆95Mar 31, 2025Updated 11 months ago
- Using FlexAttention to compute attention with different masking patterns☆47Sep 22, 2024Updated last year
- WaferLLM: Large Language Model Inference at Wafer Scale☆90Jan 7, 2026Updated last month
- ☆79Feb 10, 2026Updated 2 weeks ago
- Triton-based Symmetric Memory operators and examples☆85Jan 15, 2026Updated last month
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆91Updated this week
- Speeding Up Your Python Codes 1000x☆12Apr 2, 2025Updated 10 months ago
- ☆20Oct 4, 2024Updated last year
- BigBang-Proton is a LLM pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scienti…☆22Nov 8, 2025Updated 3 months ago