lenLRX / llm_simpleLinks

☆12

Alternatives and similar repositories for llm_simple

Users that are interested in llm_simple are comparing it to the libraries listed below

Sorting:

Ascend / AscendSpeed
☆79Updated last year
TRT2022 / trtllm-llama
☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化
☆48Updated last year
Rayrtfr / FasterTransformer
Transformer related optimization, including BERT, GPT
☆17Updated last year
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆69Updated 9 months ago
dhcode-cpp / online-softmax
simplest online-softmax notebook for explain Flash Attention
☆10Updated 5 months ago
li199603 / sgemm_with_cuda
SGEMM optimization with cuda step by step
☆19Updated last year
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆37Updated 3 months ago
YellowOldOdd / SDBI
Simple Dynamic Batching Inference
☆145Updated 3 years ago
FeiGeChuanShu / trt2023
NVIDIA TensorRT Hackathon 2023复赛选题：通义千问Qwen-7B用TensorRT-LLM模型搭建及优化
☆42Updated last year
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆110Updated 8 months ago
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆72Updated last month
OpenPPL / ppl.nn.llm
☆139Updated last year
OpenPPL / ppl.llm.serving
☆127Updated 5 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆85Updated 5 months ago
Archermmt / yolov3_dcn_nv_hackthon2021
☆10Updated 4 years ago
AlibabaPAI / FLASHNN
☆96Updated 8 months ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆130Updated last year
Oneflow-Inc / oneflow-xrt
☆23Updated 2 years ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
DeepLink-org / CVFusion
CVFusion is an open-source deep learning compiler to fuse the OpenCV operators.
☆29Updated 2 years ago
ZihaoZhao / CUDA_study
☆45Updated 5 years ago
CalebDu / Awesome-Cute
☆73Updated 3 weeks ago
Oneflow-Inc / serving
OneFlow Serving
☆20Updated last month
pigirons / conv3x3_m1
This is a demo how to write a high performance convolution run on apple silicon
☆54Updated 3 years ago
MARD1NO / CUDA-PPT
☆93Updated 2 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆79Updated 3 weeks ago
microsoft / chunk-attention
☆76Updated last month
void-main / FasterTransformer
Transformer related optimization, including BERT, GPT
☆59Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆70Updated 11 months ago
OpenPPL / ppl.pmx
☆58Updated 6 months ago