yinuotxie / Efficient-LLM-Inferencing-on-GPUsView external linksLinks
Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆44Dec 11, 2023Updated 2 years ago
Alternatives and similar repositories for Efficient-LLM-Inferencing-on-GPUs
Users that are interested in Efficient-LLM-Inferencing-on-GPUs are comparing it to the libraries listed below
Sorting:
- ☆11Sep 21, 2022Updated 3 years ago
- 🎓Automatically Update circult-eda-mlsys-tinyml Papers Daily using Github Actions (Update Every 8th hours)☆10Updated this week
- ☆10Oct 8, 2021Updated 4 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 10 months ago
- ☆14Nov 3, 2025Updated 3 months ago
- ☆13Jun 23, 2022Updated 3 years ago
- 本仓库在OpenVINO推理框架下部署Nanodet检测算法,并重写预处理和后处理部分,具有超高性能!让你在Intel CPU平台上的检测速度起飞! 并基于NNCF和PPQ工具将模型 量化(PTQ)至int8精度,推理速度更快!☆16Jun 14, 2023Updated 2 years ago
- 一个轻量化的大模型推理框架☆21May 26, 2025Updated 8 months ago
- FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …☆32Dec 21, 2024Updated last year
- RISCV C and Triton AI-Benchmark☆23Jan 28, 2026Updated 2 weeks ago
- A practical way of learning Swizzle☆36Feb 3, 2025Updated last year
- SGEMM optimization with cuda step by step☆21Mar 23, 2024Updated last year
- Benchmark code for the "Online normalizer calculation for softmax" paper☆105Jul 27, 2018Updated 7 years ago
- ☆22Mar 5, 2024Updated last year
- Optimize softmax in triton in many cases☆22Sep 6, 2024Updated last year
- 使用 CUDA C++ 实现的 llama 模型推理框架☆64Nov 8, 2024Updated last year
- ☆29Oct 20, 2019Updated 6 years ago
- 对 tensorRT_Pro 开源项目理解☆22Feb 23, 2023Updated 2 years ago
- The official code for Dropping Backward Propagation (DropBP)☆31Oct 29, 2024Updated last year
- ☆30Nov 16, 2024Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆120Mar 13, 2024Updated last year
- mHC kernels implemented in CUDA☆252Jan 14, 2026Updated last month
- This is a repository to practice multi-thread programming in C++☆28Feb 21, 2024Updated last year
- Matrix Multiplication on GPU using Shared Memory considering Coalescing and Bank Conflicts☆25Aug 29, 2022Updated 3 years ago
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆35Sep 15, 2023Updated 2 years ago
- EESAST 2020 暑期培训☆28Jan 24, 2023Updated 3 years ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆209Sep 21, 2024Updated last year
- ☆69Mar 19, 2023Updated 2 years ago
- ☆33Jul 23, 2024Updated last year
- ☆32Jul 17, 2024Updated last year
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆79Aug 12, 2024Updated last year
- This project is intended to build and deploy an SNPE model on Qualcomm Devices, which are having unsupported layers which are not part of…☆10Oct 4, 2021Updated 4 years ago
- ☆11May 30, 2025Updated 8 months ago
- Framework for studying cryptographic hash functions using SAT.☆10Dec 21, 2021Updated 4 years ago
- HackerRank, LeetCode, Cracking the Coding Interview Solutions in Python/C++☆11Jan 20, 2024Updated 2 years ago
- All Resources from Stanford CS106B 2021☆23Jul 11, 2025Updated 7 months ago
- Python library for the simulation of probabilistic circuits.☆11Feb 1, 2026Updated 2 weeks ago
- Automated Continuous Data Quality Measurement☆12Nov 15, 2023Updated 2 years ago
- Examples of CUDA implementations by Cutlass CuTe☆270Jul 1, 2025Updated 7 months ago