☆53Mar 3, 2026Updated 2 weeks ago
Alternatives and similar repositories for agent-gpu-skills
Users that are interested in agent-gpu-skills are comparing it to the libraries listed below
Sorting:
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆13Nov 3, 2023Updated 2 years ago
- ☆47May 20, 2025Updated 10 months ago
- CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark☆34Jun 24, 2025Updated 8 months ago
- ARIA is a collection of foundational computer graphics infrastructure for research.☆18May 19, 2025Updated 10 months ago
- Source code of the IPDPS '21 paper: "TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs" by Yuyao Niu, Zhengyang…☆12Aug 12, 2022Updated 3 years ago
- A intelligent matrix format designer for SpMV☆10Oct 10, 2023Updated 2 years ago
- Mirror of http://gitlab.hpcrl.cse.ohio-state.edu/chong/ppopp19_ae, refactoring for understanding☆16Oct 20, 2021Updated 4 years ago
- codes and plots for "Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs"☆10Dec 30, 2024Updated last year
- Parallel SpMV using CSR representation, built in CUDA☆14Jun 27, 2020Updated 5 years ago
- ☆54Dec 10, 2025Updated 3 months ago
- Code and dataset for the EMNLP 2024 paper: GoldCoin: Grounding Large Language Models in Privacy Laws via Contextual Integrity Theory☆48Sep 26, 2024Updated last year
- ☆32Apr 2, 2025Updated 11 months ago
- ☆53Feb 24, 2026Updated 3 weeks ago
- Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention☆51Oct 16, 2025Updated 5 months ago
- Source code of the SC '23 paper: "DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multipli…☆29Jun 18, 2024Updated last year
- a simple API to use CUPTI☆10Aug 19, 2025Updated 7 months ago
- A benchmarking tool for comparing different LLM API providers' DeepSeek model deployments.☆30Mar 28, 2025Updated 11 months ago
- ☆25Jan 9, 2023Updated 3 years ago
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆21Updated this week
- A dedicated effort to make an optimized, bleeding edge vLLM image using Docker to support DGX comprehensively☆58Feb 22, 2026Updated 3 weeks ago
- triton for dsa☆60Updated this week
- FlashSparse significantly reduces the computation redundancy for unstructured sparsity (for SpMM and SDDMM) on Tensor Cores through a Swa…☆39Oct 5, 2025Updated 5 months ago
- an http s3 client for openresty☆11Oct 1, 2016Updated 9 years ago
- ☆24May 9, 2025Updated 10 months ago
- A compiler of Decaf(an object-oriented compiler)☆12Sep 26, 2017Updated 8 years ago
- libsmctrl论文的复现,添加了python端接口,可以在python端灵活调用接口来分配计算资源☆12May 21, 2024Updated last year
- Inverse Rendering Toolkit☆14Feb 24, 2025Updated last year
- Optimize GEMM with tensorcore step by step☆37Dec 17, 2023Updated 2 years ago
- 微信小程序SDK for Go,API调用,数据解密☆11Jun 7, 2018Updated 7 years ago
- Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.☆18Jan 15, 2025Updated last year
- ☆10Nov 9, 2023Updated 2 years ago
- Ongoing research training transformer models at scale☆18Updated this week
- LoRAFusion: Efficient LoRA Fine-Tuning for LLMs☆25Sep 23, 2025Updated 5 months ago
- ☆149Mar 18, 2024Updated 2 years ago
- A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs☆28Nov 29, 2023Updated 2 years ago
- ☆16Feb 24, 2026Updated 3 weeks ago
- Official implementation for Training LLMs with MXFP4☆121Apr 25, 2025Updated 10 months ago
- Unreal compute shader boid simulation.☆49Sep 9, 2021Updated 4 years ago
- Code repository for ICLR 2025 paper "LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid"☆25Mar 2, 2025Updated last year