☆47Jul 16, 2025Updated 7 months ago
Alternatives and similar repositories for cuda-rt-hook
Users that are interested in cuda-rt-hook are comparing it to the libraries listed below
Sorting:
- Hyperparameter: The High-Performance Configuration Library for AI Systems☆22Dec 14, 2025Updated 2 months ago
- TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.☆14Nov 23, 2024Updated last year
- 在module level分析模型的性能☆13Jun 21, 2025Updated 8 months ago
- auto deploy neovim like chxuan/vimplus☆12Apr 22, 2025Updated 10 months ago
- Handwritten GEMM using Intel AMX (Advanced Matrix Extension)☆17Jan 11, 2025Updated last year
- ☆38Aug 7, 2025Updated 6 months ago
- Study materials collected while studying☆51Apr 16, 2022Updated 3 years ago
- ☆11Apr 3, 2023Updated 2 years ago
- FPGA-based HyperLogLog Accelerator☆12Jul 13, 2020Updated 5 years ago
- A survey of manufacturer-provided DRAM operating parameters and timings as specified by DRAM chip datasheets from between 1970 and 2021. …☆11May 4, 2022Updated 3 years ago
- ☆47Dec 13, 2024Updated last year
- Artifact evaluation repo for EuroSys'24.☆29Nov 7, 2023Updated 2 years ago
- Paper list of federated learning: About system design☆13Apr 13, 2022Updated 3 years ago
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆171Dec 12, 2023Updated 2 years ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- ☆12May 13, 2025Updated 9 months ago
- ☆13Sep 11, 2020Updated 5 years ago
- SmartNIC☆14Dec 13, 2018Updated 7 years ago
- Artifacts for ATC '22 paper "Faster Software Packet Processing on FPGA NICs with eBPF Program Warping"☆17May 20, 2022Updated 3 years ago
- Open Source SSD Controller. NVMe and Lightstor variants☆18May 21, 2014Updated 11 years ago
- An external memory allocator example for PyTorch.☆16Aug 10, 2025Updated 6 months ago
- Debug print operator for cudagraph debugging☆14Aug 2, 2024Updated last year
- This is an official GitHub repository for the paper, "Towards timeout-less transport in commodity datacenter networks.".☆16Oct 12, 2021Updated 4 years ago
- ☆34Nov 7, 2022Updated 3 years ago
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 10 months ago
- NUMA-Aware Reader-Writer Locks☆19Jun 12, 2014Updated 11 years ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Jul 4, 2025Updated 8 months ago
- Johnson-Lindenstrauss transform (JLT), random projections (RP), fast Johnson-Lindenstrauss transform (FJLT), and randomized Hadamard tran…☆23Jul 11, 2023Updated 2 years ago
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆19Feb 24, 2026Updated last week
- GPTQ inference TVM kernel☆40Apr 25, 2024Updated last year
- Automatic virtualization of (general) accelerators.☆47Nov 28, 2022Updated 3 years ago
- ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)☆53Dec 17, 2024Updated last year
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆20Feb 23, 2024Updated 2 years ago
- Efficient GPU communication over multiple NICs.☆26Nov 20, 2025Updated 3 months ago
- ☆39Dec 14, 2025Updated 2 months ago
- Manages vllm-nccl dependency☆17Jun 3, 2024Updated last year
- An In-kernel Transparent Monitoring System for Microservice Systems with eBPF☆22Sep 11, 2022Updated 3 years ago
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆17Mar 13, 2023Updated 2 years ago
- Simple CuDNN wrapper☆20Nov 29, 2015Updated 10 years ago