Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆43Dec 11, 2023Updated 2 years ago
Alternatives and similar repositories for Efficient-LLM-Inferencing-on-GPUs
Users that are interested in Efficient-LLM-Inferencing-on-GPUs are comparing it to the libraries listed below
Sorting:
- ☆11Sep 21, 2022Updated 3 years ago
- 🎓Automatically Update circult-eda-mlsys-tinyml Papers Daily using Github Actions (Update Every 8th hours)☆10Updated this week
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆31Apr 2, 2025Updated 11 months ago
- ☆14Nov 3, 2025Updated 4 months ago
- ☆13Jun 23, 2022Updated 3 years ago
- 一个轻量化的大模型推理框架☆21May 26, 2025Updated 9 months ago
- 本仓库在OpenVINO推理框架下部署Nanodet检测算法,并重写预处理和后处理部分,具有超高性能!让你在Intel CPU平台上的检测速度起飞! 并基于NNCF和PPQ工具将模型量化(PTQ)至int8精度,推理速度更快!☆15Jun 14, 2023Updated 2 years ago
- RISCV C and Triton AI-Benchmark☆23Jan 28, 2026Updated last month
- SGEMM optimization with cuda step by step☆21Mar 23, 2024Updated last year
- A practical way of learning Swizzle☆37Feb 3, 2025Updated last year
- Benchmark code for the "Online normalizer calculation for softmax" paper☆108Jul 27, 2018Updated 7 years ago
- ☆22Mar 5, 2024Updated 2 years ago
- ☆29Oct 20, 2019Updated 6 years ago
- 对 tensorRT_Pro 开源项目理解☆22Feb 23, 2023Updated 3 years ago
- 使用 CUDA C++ 实现的 llama 模型推理框架☆64Nov 8, 2024Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆119Mar 13, 2024Updated last year
- The official code for Dropping Backward Propagation (DropBP)☆32Oct 29, 2024Updated last year
- Matrix Multiplication on GPU using Shared Memory considering Coalescing and Bank Conflicts☆24Aug 29, 2022Updated 3 years ago
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆34Sep 15, 2023Updated 2 years ago
- EESAST 2020 暑期培训☆28Jan 24, 2023Updated 3 years ago
- ☆33Jul 23, 2024Updated last year
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆79Aug 12, 2024Updated last year
- All Resources from Stanford CS106B 2021☆24Jul 11, 2025Updated 7 months ago
- ☆11May 30, 2025Updated 9 months ago
- This project is intended to build and deploy an SNPE model on Qualcomm Devices, which are having unsupported layers which are not part of…☆10Oct 4, 2021Updated 4 years ago
- HackerRank, LeetCode, Cracking the Coding Interview Solutions in Python/C++☆11Jan 20, 2024Updated 2 years ago
- Examples of CUDA implementations by Cutlass CuTe☆272Jul 1, 2025Updated 8 months ago
- This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.☆43Sep 29, 2025Updated 5 months ago
- 大规模并行处理器编程实战 第二版答案☆35Jun 4, 2022Updated 3 years ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆107Jun 28, 2025Updated 8 months ago
- Network on chip based neural network accelerator☆10Mar 25, 2021Updated 4 years ago
- FlappyBird愤怒的小鸟 c++游戏实现 学习代码☆10Nov 16, 2018Updated 7 years ago
- Extending BookSim2.0 and HotSpot6.0 for Power, Performance and Thermal evaluation of 3D NoC Architectures☆13Aug 9, 2019Updated 6 years ago
- RSV Scenario Modeling Hub☆16Mar 2, 2026Updated last week
- LLM-DSE: Searching Accelerator Parameters with LLM Agents☆13May 22, 2025Updated 9 months ago
- ☆115May 16, 2025Updated 9 months ago
- simple rebar detection competition https://www.datafountain.cn/competitions/332/details☆38Jan 17, 2019Updated 7 years ago
- ☆55Mar 15, 2025Updated 11 months ago
- ☆104Sep 9, 2024Updated last year