lenLRX / llm_simpleLinks
☆12Updated this week
Alternatives and similar repositories for llm_simple
Users that are interested in llm_simple are comparing it to the libraries listed below
Sorting:
- ☆79Updated last year
- ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化☆48Updated last year
- Transformer related optimization, including BERT, GPT☆17Updated last year
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆69Updated 9 months ago
- simplest online-softmax notebook for explain Flash Attention☆10Updated 5 months ago
- SGEMM optimization with cuda step by step☆19Updated last year
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆37Updated 3 months ago
- Simple Dynamic Batching Inference☆145Updated 3 years ago
- NVIDIA TensorRT Hackathon 2023复赛选题:通义千问Qwen-7B用TensorRT-LLM模型搭建及优化☆42Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆110Updated 8 months ago
- Fast and memory-efficient exact attention☆72Updated last month
- ☆139Updated last year
- ☆127Updated 5 months ago
- Implement Flash Attention using Cute.☆85Updated 5 months ago
- ☆10Updated 4 years ago
- ☆96Updated 8 months ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- ☆23Updated 2 years ago
- GPTQ inference TVM kernel☆40Updated last year
- CVFusion is an open-source deep learning compiler to fuse the OpenCV operators.☆29Updated 2 years ago
- ☆45Updated 5 years ago
- ☆73Updated 3 weeks ago
- OneFlow Serving☆20Updated last month
- This is a demo how to write a high performance convolution run on apple silicon☆54Updated 3 years ago
- ☆93Updated 2 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆79Updated 3 weeks ago
- ☆76Updated last month
- Transformer related optimization, including BERT, GPT☆59Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆70Updated 11 months ago
- ☆58Updated 6 months ago