morousg / cvGPUSpeedupLinks
A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!
☆51Updated this week
Alternatives and similar repositories for cvGPUSpeedup
Users that are interested in cvGPUSpeedup are comparing it to the libraries listed below
Sorting:
- A tool convert TensorRT engine/plan to a fake onnx☆39Updated 2 years ago
- Model compression for ONNX☆96Updated 7 months ago
- Awesome code, projects, books, etc. related to CUDA☆17Updated last week
- Zero-copy multimodal vector DB with CUDA and CLIP/SigLIP☆59Updated last month
- Nsight Systems In Docker☆20Updated last year
- Simple tool for partial optimization of ONNX. Further optimize some models that cannot be optimized with onnx-optimizer and onnxsim by se…☆19Updated last year
- Stable Diffusion in TensorRT 8.5+☆14Updated 2 years ago
- A CUDA kernel for NHWC GroupNorm for PyTorch☆19Updated 7 months ago
- Python scripts performing optical flow estimation using the NeuFlowV2 model in ONNX.☆47Updated 9 months ago
- An easy way to run, test, benchmark and tune OpenCL kernel files☆23Updated last year
- ☆32Updated last week
- HunyuanDiT with TensorRT and libtorch☆17Updated last year
- FlexAttention w/ FlashAttention3 Support☆26Updated 8 months ago
- Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.☆10Updated last year
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆19Updated last week
- [WIP] Better (FP8) attention for Hopper☆30Updated 4 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆38Updated 2 weeks ago
- ☆16Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆110Updated 9 months ago
- Docker scripts for building ONNX Runtime with TensorRT and OpenVINO in manylinux environment☆22Updated 2 years ago
- ☆29Updated 4 months ago
- ☆35Updated 2 years ago
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Updated 2 years ago
- ☆18Updated 2 years ago
- ☆11Updated last year
- study of cutlass☆21Updated 7 months ago
- A very simple tool for situations where optimization with onnx-simplifier would exceed the Protocol Buffers upper file size limit of 2GB,…☆17Updated last year
- OneFlow Serving☆20Updated 2 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- Memory-Efficient CUDA kernels for training ConvNets with PyTorch.☆41Updated 4 months ago