keith2018 / TinyGPTLinks
Tiny C++ LLM inference implementation from scratch
☆95Updated last week
Alternatives and similar repositories for TinyGPT
Users that are interested in TinyGPT are comparing it to the libraries listed below
Sorting:
- 分层解耦的深度学习推理引擎☆76Updated 9 months ago
- Efficient inference of large language models.☆151Updated 2 months ago
- 使用 CUDA C++ 实现的 llama 模型推理框架☆62Updated last year
- Tutorials for writing high-performance GPU operators in AI frameworks.☆133Updated 2 years ago
- a simple general program language☆98Updated 3 months ago
- ☆125Updated last year
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆233Updated 2 weeks ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆94Updated 2 weeks ago
- Free resource for the book AI Compiler Development Guide☆47Updated 2 years ago
- A tiny deep learning training framework implemented from scratch in C++ that follows PyTorch's API.☆130Updated last week
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆206Updated last month
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆45Updated 5 months ago
- ☆21Updated 4 years ago
- SGEMM optimization with cuda step by step☆21Updated last year
- ☆97Updated 8 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆112Updated last year
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆134Updated 6 months ago
- Simple and efficient memory pool is implemented with C++11.☆10Updated 3 years ago
- FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.☆137Updated last week
- ☆70Updated 2 years ago
- Penn CIS 5650 (GPU Programming and Architecture) Final Project☆44Updated last year
- 用C++实现一个简单的Transformer模型。 Attention Is All You Need。☆52Updated 4 years ago
- ☆64Updated last week
- ☆27Updated last year
- GPTQ inference TVM kernel☆40Updated last year
- ☆76Updated last year
- 注释的nano_vllm仓库,并且完成了MiniCPM4的适配以及注册新模型的功能☆108Updated 3 months ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆190Updated 10 months ago
- llama 2 Inference☆43Updated 2 years ago
- This is a demo how to write a high performance convolution run on apple silicon☆57Updated 3 years ago