leimao / Nsight-Systems-Docker-Image
Nsight Systems in Docker
☆17Updated 9 months ago
Related projects: ⓘ
- llama INT4 cuda inference with AWQ☆46Updated 2 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆93Updated last week
- Open Source Projects from Pallas Lab☆17Updated 2 years ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆82Updated 6 months ago
- ☆37Updated 2 months ago
- GEMM and Winograd based convolutions using CUTLASS☆24Updated 4 years ago
- ☆66Updated last year
- A Winograd Minimal Filter Implementation in CUDA☆20Updated 3 years ago
- GPTQ inference TVM kernel☆35Updated 4 months ago
- This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transfor…☆37Updated this week
- ☆14Updated 2 weeks ago
- Penn CIS 5650 (GPU Programming and Architecture) Final Project☆21Updated 9 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆52Updated 6 years ago
- TensorRT LLM Benchmark Configuration☆10Updated last month
- Sandbox for TVM and playing around!☆22Updated last year
- Converting a deep neural network to integer-only inference in native C via uniform quantization and the fixed-point representation.☆20Updated 2 years ago
- Benchmark PyTorch Custom Operators☆13Updated last year
- ☆23Updated last week
- Benchmark scripts for TVM☆73Updated 2 years ago
- ☆33Updated 5 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆40Updated 2 weeks ago
- An external memory allocator example for PyTorch.☆13Updated 2 years ago
- ☆17Updated this week
- ☆113Updated last year
- ☆17Updated 3 years ago
- A Python script to convert the output of NVIDIA Nsight Systems (in SQLite format) to JSON in Google Chrome Trace Event Format.☆19Updated 2 weeks ago
- ☆50Updated 3 months ago
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆22Updated last year
- ☆52Updated this week
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆20Updated 2 weeks ago