Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
☆14Apr 3, 2025Updated last year
Alternatives and similar repositories for Fault-Tolerant-SGEMM-on-NVIDIA-GPUs
Users that are interested in Fault-Tolerant-SGEMM-on-NVIDIA-GPUs are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆16Aug 31, 2023Updated 2 years ago
- This is the core functions needed by the `tsmp` package. The low level and carefully checked mathematical functions are here. These are i…☆12Dec 16, 2025Updated 5 months ago
- BigBang-Proton is a LLM pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scienti…☆21Nov 8, 2025Updated 6 months ago
- GEMV implementation with CUTLASS☆21Aug 21, 2025Updated 9 months ago
- [NeurIPS 2025] Multipole Attention for Efficient Long Context Reasoning☆23Dec 5, 2025Updated 5 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- A service-aware RoCE network monitoring system based on end- to-end probing.☆28Mar 1, 2026Updated 2 months ago
- The Mixing method: coordinate descent for low-rank semidefinite programming☆15Apr 30, 2021Updated 5 years ago
- Simple and efficient memory pool is implemented with C++11.☆10Jun 2, 2022Updated 3 years ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆31Mar 12, 2024Updated 2 years ago
- SGEMM and DGEMM subroutines using AVX512F instructions.☆15May 22, 2022Updated 4 years ago
- transformer tokenizers (e.g. BERT tokenizer) in C++ (WIP)☆18Apr 7, 2022Updated 4 years ago
- ☆25Mar 15, 2023Updated 3 years ago
- Matlab mex wrappers to cuSPARSE (NVIDIA)☆11Dec 10, 2025Updated 5 months ago
- Generation of Debian rootfs for multiple architectures☆15Nov 13, 2021Updated 4 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as ta…☆49Oct 12, 2021Updated 4 years ago
- Fast and low-memory attention layer written in CUDA☆20Jul 14, 2023Updated 2 years ago
- [ICML 2025] Parameter-Efficient Fine-Tuning of State Space Models☆25Jun 9, 2025Updated 11 months ago
- ☆26Dec 5, 2022Updated 3 years ago
- CUDA C simple application for Nvidia's GPU☆11Jun 7, 2022Updated 3 years ago
- 稀疏矩阵-向量乘的并行优化算法(OpenMP,AVX)☆11Jul 7, 2021Updated 4 years ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆163Feb 3, 2022Updated 4 years ago
- An implementation of SGEMV with performance comparable to cuBLAS.☆12May 21, 2021Updated 5 years ago
- ☆16Apr 18, 2025Updated last year
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- ☆10Jul 4, 2022Updated 3 years ago
- Source code of the IPDPS '21 paper: "TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs" by Yuyao Niu, Zhengyang…☆13Aug 12, 2022Updated 3 years ago
- An Architecture-level Fault Injection Tool for GPU Application Resilience Evaluations☆21Apr 14, 2020Updated 6 years ago
- ☆20Jul 10, 2025Updated 10 months ago
- learning notes when learning the source code of pytorch☆24Apr 3, 2019Updated 7 years ago
- A tutorial/example of the Python C-API and integration with CUDA kernels.☆14Jul 7, 2019Updated 6 years ago
- This repository is the official implementation of "Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE" [ACL 2026 Mai…☆37Oct 5, 2025Updated 7 months ago
- OpenCL multiGPU sample monitoring system health☆22Feb 25, 2016Updated 10 years ago
- GPU Performance Advisor☆66Jul 25, 2022Updated 3 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Tutorials for Timemory☆21Aug 1, 2024Updated last year
- Analyze who cites you, where, and how—one-click impact report for grants, tenure, and academic green cards☆35Nov 30, 2025Updated 6 months ago
- A tiny learning framework built by cudnn and cublas.☆21Nov 12, 2021Updated 4 years ago
- An HPL-AI implementation for Fugaku☆23Jun 29, 2021Updated 4 years ago
- ☆33Oct 4, 2024Updated last year
- Simulating safety and non-safety messages in IEEE 1609.4. Tech Stack : Linux 12.04, Omnet++ 4.6, SUMO 0.22.0, Veins 4 alpha 2, Inet 2.5☆12Jul 19, 2017Updated 8 years ago
- Mirror of http://gitlab.hpcrl.cse.ohio-state.edu/chong/ppopp19_ae, refactoring for understanding☆17Oct 20, 2021Updated 4 years ago