leimao / Nsight-Systems-Docker-ImageLinks
Nsight Systems In Docker
☆20Updated last year
Alternatives and similar repositories for Nsight-Systems-Docker-Image
Users that are interested in Nsight-Systems-Docker-Image are comparing it to the libraries listed below
Sorting:
- Open Source Projects from Pallas Lab☆21Updated 4 years ago
- llama INT4 cuda inference with AWQ☆55Updated 9 months ago
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.☆20Updated 9 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆112Updated last year
- ☆76Updated last year
- study of Ampere' Sparse Matmul☆18Updated 4 years ago
- ☆165Updated 2 years ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- Model compression for ONNX☆98Updated last year
- ☆39Updated last year
- This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transfor…☆83Updated this week
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆110Updated last year
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆121Updated last year
- Penn CIS 5650 (GPU Programming and Architecture) Final Project☆45Updated last year
- GPTQ inference TVM kernel☆39Updated last year
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆111Updated 11 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆47Updated 3 months ago
- [EMNLP Findings 2024] MobileQuant: Mobile-friendly Quantization for On-device Language Models☆68Updated last year
- ☆207Updated 4 years ago
- A curated list for Efficient Large Language Models☆11Updated last year
- Flexible simulator for mixed precision and format simulation of LLMs and vision transformers.☆51Updated 2 years ago
- ☆37Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Updated 2 months ago
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆33Updated 2 years ago
- LLM Inference with Microscaling Format☆32Updated last year
- ☆83Updated 9 months ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆141Updated 2 years ago
- Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption☆109Updated 2 years ago
- A collection of research papers on efficient training of DNNs☆70Updated 3 years ago
- QONNX: Arbitrary-Precision Quantized Neural Networks in ONNX☆164Updated this week