leimao / Nsight-Systems-Docker-Image
Nsight Systems In Docker
☆20Updated last year
Alternatives and similar repositories for Nsight-Systems-Docker-Image:
Users that are interested in Nsight-Systems-Docker-Image are comparing it to the libraries listed below
- Standalone Flash Attention v2 kernel without libtorch dependency☆108Updated 7 months ago
- This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai☆33Updated this week
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated 3 weeks ago
- ☆28Updated 2 months ago
- llama INT4 cuda inference with AWQ☆54Updated 3 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆91Updated 3 weeks ago
- End to End steps for adding custom ops in PyTorch.☆21Updated 4 years ago
- study of Ampere' Sparse Matmul☆18Updated 4 years ago
- A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!☆49Updated last week
- Benchmark code for the "Online normalizer calculation for softmax" paper☆91Updated 6 years ago
- Open Source Projects from Pallas Lab☆20Updated 3 years ago
- ☆69Updated 2 years ago
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆30Updated last year
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆31Updated 2 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆73Updated 3 weeks ago
- Penn CIS 5650 (GPU Programming and Architecture) Final Project☆29Updated last year
- ☆68Updated 3 months ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆135Updated 2 years ago
- [EMNLP Findings 2024] MobileQuant: Mobile-friendly Quantization for On-device Language Models☆56Updated 7 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆61Updated 7 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆63Updated last week
- ☆63Updated 5 months ago
- study of cutlass☆21Updated 5 months ago
- A Python script to convert the output of NVIDIA Nsight Systems (in SQLite format) to JSON in Google Chrome Trace Event Format.☆33Updated 3 months ago
- GPTQ inference TVM kernel☆38Updated last year
- ☆143Updated 2 years ago
- Model compression for ONNX☆91Updated 5 months ago
- GEMM and Winograd based convolutions using CUTLASS☆26Updated 4 years ago
- Benchmark scripts for TVM☆74Updated 3 years ago
- Optimize GEMM with tensorcore step by step☆25Updated last year