NVIDIA/TensorRT-Edge-LLM

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/NVIDIA/TensorRT-Edge-LLM)

NVIDIA / TensorRT-Edge-LLM

High-performance, light-weight C++ LLM and VLM Inference Software for Physical AI

☆410

Alternatives and similar repositories for TensorRT-Edge-LLM

Users that are interested in TensorRT-Edge-LLM are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

vllm-project / vllm-xpu-kernels
View on GitHub
The vLLM XPU kernels for Intel GPU
☆44Updated this week
NVIDIA / Deep-Learning-Accelerator-SW
View on GitHub
NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.
☆234Jun 10, 2024Updated last year
IEIAuto / AutoDRRT
View on GitHub
☆57Jan 5, 2026Updated 4 months ago
latentCall145 / channels-last-groupnorm
View on GitHub
A CUDA kernel for NHWC GroupNorm for PyTorch
☆23Nov 15, 2024Updated last year
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆81Aug 12, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
ros-acceleration / adaptive_component
View on GitHub
A composable container for Adaptive ROS 2 Node computations. Select between FPGA, CPU or GPU at run-time.
☆12Apr 14, 2022Updated 4 years ago
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆155May 10, 2025Updated last year
Qingrenn / mmdeploy-summer-camp
View on GitHub
🐱 ncnn int8 模型量化评估
☆14Oct 10, 2022Updated 3 years ago
ThomasVonWu / Awesome-VLMs-Strawberry
View on GitHub
A collection of VLMs papers, blogs, and projects, with a focus on VLMs in Autonomous Driving and related reasoning techniques.
☆11Nov 16, 2024Updated last year
thanhlnbka / yolov7-triton-deepstream
View on GitHub
☆25Oct 10, 2022Updated 3 years ago
NVIDIA-AI-IOT / cuDLA-samples
View on GitHub
YOLOv5 on Orin DLA
☆225Feb 18, 2024Updated 2 years ago
tile-ai / AttentionEngine
View on GitHub
☆52May 19, 2025Updated last year
PINTO0309 / jetson-tensorflow-pytorch-build
View on GitHub
Provides an environment for compiling TensorFlow or PyTorch with CUDA for aarch64 on an x86 machine. This is for Jetson. If you build usi…
☆14Feb 27, 2021Updated 5 years ago
LeiWang1999 / TVM.CMakeExtend
View on GitHub
Tutorials of Extending and importing TVM with CMAKE Include dependency.
☆16Oct 11, 2024Updated last year
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
NVIDIA / TensorRT-RTX
View on GitHub
NVIDIA TensorRT-RTX is an SDK for high-performance AI inference on NVIDIA RTX GPUs. This repository contains Open-Source Software compone…
☆99Mar 18, 2026Updated 2 months ago
mlc-ai / mlc-python
View on GitHub
☆38Jul 19, 2025Updated 10 months ago
NVIDIA / Model-Optimizer
View on GitHub
A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresse…
☆2,750Updated this week
zhaohb / MeloTTS-OV
View on GitHub
Using OpenVINO to speed up MeloTTS inference
☆15Nov 1, 2024Updated last year
Kazuhito00 / LDC-ONNX-Sample
View on GitHub
LDC: Lightweight Dense CNN for Edge DetectionのPythonでのONNX推論サンプル
☆15May 6, 2023Updated 3 years ago
ArthurinRUC / cutlass-notes
View on GitHub
From Minimal GEMM to Everything
☆207Updated this week
Edwardwaw / ttfnet
View on GitHub
☆10Dec 21, 2020Updated 5 years ago
richjjj / duscratch
View on GitHub
搜藏的希望的代码片段
☆13Jun 6, 2023Updated 2 years ago
ZhangZhiPku / cutile-examples
View on GitHub
cutile kernel examples
☆49Apr 3, 2026Updated last month
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
Tartisan / apollo_ros
View on GitHub
☆13Mar 26, 2022Updated 4 years ago
binabik-ai / mcp-rosbags
View on GitHub
MCP Server to interface with and analyze rosbags offline
☆22Sep 21, 2025Updated 8 months ago
AXERA-TECH / ONNX-YOLO-World-Open-Vocabulary-Object-Detection
View on GitHub
Python scripts performing Open Vocabulary Object Detection using the YOLO-World model in ONNX. And Export the ONNX model for AXera's NPU
☆12Aug 11, 2025Updated 9 months ago
HydraQYH / hp_rms_norm
View on GitHub
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
☆30Jan 22, 2026Updated 4 months ago
Mandylove1993 / CUDA-FastBEV
View on GitHub
TensorRT deploy and PTQ/QAT tools development for FastBEV, total time only need 6.9ms!!!
☆308Dec 8, 2023Updated 2 years ago
TRT2022 / trtllm-llama
View on GitHub
☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化
☆52Oct 20, 2023Updated 2 years ago
AyakaGEMM / Hands-on-GEMM
View on GitHub
☆153Mar 18, 2024Updated 2 years ago
liuyanyi / AD-Toolbox
View on GitHub
Aerial Detection Toolbox
☆11Jan 18, 2023Updated 3 years ago
abdelfattah-lab / nitro
View on GitHub
Lightweight Python Wrapper for OpenVINO, enabling LLM inference on NPUs
☆29Dec 17, 2024Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
sgl-project / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆24May 14, 2026Updated last week
weishengying / tiny-flash-attention
View on GitHub
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆59Aug 12, 2024Updated last year
LeiWang1999 / Stream-k.tvm
View on GitHub
☆20Sep 28, 2024Updated last year
cavedweller509 / LMDeploy-Jetson
View on GitHub
Deploying LLMs offline on the NVIDIA Jetson platform marks the dawn of a new era in embodied intelligence, where devices can function ind…
☆107Mar 23, 2024Updated 2 years ago
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆44Nov 22, 2024Updated last year
FeiGeChuanShu / trt2023
View on GitHub
NVIDIA TensorRT Hackathon 2023复赛选题：通义千问Qwen-7B用TensorRT-LLM模型搭建及优化
☆43Oct 20, 2023Updated 2 years ago
triton-inference-server / tensorrt_backend
View on GitHub
The Triton backend for TensorRT.
☆88May 8, 2026Updated 2 weeks ago