Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆44Dec 11, 2023Updated 2 years ago
Alternatives and similar repositories for Efficient-LLM-Inferencing-on-GPUs
Users that are interested in Efficient-LLM-Inferencing-on-GPUs are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆10Oct 8, 2021Updated 4 years ago
- ☆11Sep 21, 2022Updated 3 years ago
- 🎓Automatically Update circult-eda-mlsys-tinyml Papers Daily using Github Actions (Update Every 8th hours)☆10Updated this week
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated last year
- ☆14Nov 3, 2025Updated 5 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- manage my star project on github☆11Jul 23, 2020Updated 5 years ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆110Jul 27, 2018Updated 7 years ago
- ☆13Jun 23, 2022Updated 3 years ago
- ☆29Oct 20, 2019Updated 6 years ago
- Make triton easier☆50Jun 12, 2024Updated last year
- FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …☆31Dec 21, 2024Updated last year
- 本仓库在OpenVINO推理框架下部署Nanodet检测算法,并重写预处理和后处理部分,具有超高性能!让你在Intel CPU平台上的检测速度起飞! 并基于NNCF和PPQ工具将模型量化(PTQ)至int8精度,推理速度更快!☆16Jun 14, 2023Updated 2 years ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆118Mar 13, 2024Updated 2 years ago
- 一个轻量化的大模型推理框架☆22May 26, 2025Updated 10 months ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- RISCV C and Triton AI-Benchmark☆24Jan 28, 2026Updated 2 months ago
- A practical way of learning Swizzle☆37Feb 3, 2025Updated last year
- The official code for Dropping Backward Propagation (DropBP)☆32Oct 29, 2024Updated last year
- 1st Place Solution to iWildcam 2021: Count the number of animals of each species present in a sequence of images☆12Jun 24, 2021Updated 4 years ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆213Sep 21, 2024Updated last year
- Optimize softmax in triton in many cases☆23Sep 6, 2024Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Jun 28, 2025Updated 9 months ago
- SGEMM optimization with cuda step by step☆22Mar 23, 2024Updated 2 years ago
- EESAST 2020 暑期培训☆28Jan 24, 2023Updated 3 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Implementation of AdaCQR(COLING 2025)☆15Dec 30, 2024Updated last year
- ☆32Jul 17, 2024Updated last year
- An application to simulate Tomasulo's algorithm☆11Jan 16, 2014Updated 12 years ago
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆82Aug 12, 2024Updated last year
- mHC kernels implemented in CUDA☆260Mar 9, 2026Updated last month
- 对 tensorRT_Pro 开源项目理解☆22Feb 23, 2023Updated 3 years ago
- InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference☆16Mar 30, 2025Updated last year
- ☆48Dec 11, 2020Updated 5 years ago
- ☆33Jul 23, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- ☆105Sep 9, 2024Updated last year
- Online documentation can be found at https://minres.github.io/SCViewer/☆21Apr 10, 2026Updated last week
- 使用 CUDA C++ 实现的 llama 模型推理框架☆65Nov 8, 2024Updated last year
- ☆41Sep 13, 2025Updated 7 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆53Aug 6, 2025Updated 8 months ago
- ☆14Jul 13, 2025Updated 9 months ago
- Examples of CUDA implementations by Cutlass CuTe☆272Jul 1, 2025Updated 9 months ago