keith2018 / TinyGPT
Tiny C++11 GPT-2 inference implementation from scratch
☆58Updated 2 weeks ago
Alternatives and similar repositories for TinyGPT:
Users that are interested in TinyGPT are comparing it to the libraries listed below
- ☆124Updated last year
- 使用 CUDA C++ 实现的 llama 模型推理框架☆50Updated 5 months ago
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated last week
- Fast and memory-efficient exact attention☆62Updated last week
- Standalone Flash Attention v2 kernel without libtorch dependency☆108Updated 7 months ago
- qwen2 and llama3 cpp implementation☆44Updated 10 months ago
- Efficient inference of large language models.☆146Updated 4 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated 3 weeks ago
- ☆82Updated last month
- 分层解耦的深度学习推理引擎☆72Updated 2 months ago
- A tiny deep learning training framework implemented from scratch in C++ that follows PyTorch's API.☆48Updated 3 weeks ago
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆49Updated 5 months ago
- ☆11Updated last month
- ☆28Updated 2 months ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆66Updated last week
- A practical way of learning Swizzle☆18Updated 2 months ago
- 将MNN拆解的简易前向推理框架(for study!)☆22Updated 4 years ago
- ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化☆46Updated last year
- GPT2 implementation in C++ using Ort☆26Updated 4 years ago
- ☆71Updated 5 months ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.☆17Updated 3 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆35Updated last month
- TensorRT LLM Benchmark Configuration☆13Updated 9 months ago
- ☆63Updated 5 months ago
- ☆16Updated last year
- ☆20Updated 4 years ago
- GPTQ inference TVM kernel☆38Updated last year
- 📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.☆169Updated 2 weeks ago
- Inference Llama 2 in one file of pure C++☆83Updated last year