leimao / Nsight-Systems-Docker-Image
Nsight Systems In Docker
☆20Updated last year
Alternatives and similar repositories for Nsight-Systems-Docker-Image
Users that are interested in Nsight-Systems-Docker-Image are comparing it to the libraries listed below
Sorting:
- Open Source Projects from Pallas Lab☆20Updated 3 years ago
- ☆28Updated 3 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆108Updated 8 months ago
- This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai☆37Updated this week
- llama INT4 cuda inference with AWQ☆54Updated 3 months ago
- Penn CIS 5650 (GPU Programming and Architecture) Final Project☆30Updated last year
- ☆146Updated 2 years ago
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆33Updated 2 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆92Updated last week
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆30Updated last year
- Model compression for ONNX☆92Updated 5 months ago
- GPTQ inference TVM kernel☆38Updated last year
- ☆65Updated 6 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆91Updated 6 years ago
- ☆11Updated 2 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated last month
- This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transfor…☆65Updated last week
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆137Updated 2 years ago
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆110Updated 5 months ago
- study of Ampere' Sparse Matmul☆18Updated 4 years ago
- CUDA Matrix Multiplication Optimization☆186Updated 9 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆61Updated 8 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆76Updated this week
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆25Updated 3 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆44Updated last month
- ☆26Updated last year
- A Winograd Minimal Filter Implementation in CUDA☆24Updated 3 years ago
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated last week
- Sandbox for TVM and playing around!☆22Updated 2 years ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆81Updated last year