dianhsu / transformer-cpp-cpu
用C++实现一个简单的Transformer模型。 Attention Is All You Need。
☆37Updated 3 years ago
Related projects: ⓘ
- Swin Transformer C++ Implementation☆53Updated 3 years ago
- play gemm with tvm☆81Updated last year
- 分层解耦的深度学习推理引擎☆58Updated 3 weeks ago
- ☆90Updated 6 months ago
- ☆15Updated last week
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆29Updated last month
- ☆56Updated last week
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆40Updated last week
- ☆18Updated 3 years ago
- EasyNN是一个面向教学而开发的神经网络推理框架,旨在让大家0基础也能自主完成推理框架编写!☆22Updated 3 weeks ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆20Updated last week
- Standalone Flash Attention v2 kernel without libtorch dependency☆93Updated last week
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆13Updated last year
- ☆92Updated 3 years ago
- ☆77Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆82Updated 6 months ago
- ☆133Updated 2 months ago
- simplify >2GB large onnx model☆41Updated 6 months ago
- ☆32Updated 3 months ago
- ☆18Updated 5 months ago
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆46Updated last month
- llama 2 Inference☆35Updated 10 months ago
- ☆70Updated 6 months ago
- A tutorial for CUDA&PyTorch☆110Updated last week
- ☆100Updated 5 months ago
- Some common CUDA kernel implementations (Not the fastest).☆11Updated last month
- Optimize GEMM with tensorcore step by step☆11Updated 9 months ago
- CPU Memory Compiler and Parallel programing☆24Updated 2 months ago
- ☆20Updated 3 months ago
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆22Updated last year