lenLRX / llm_simple
☆11Updated last week
Alternatives and similar repositories for llm_simple
Users that are interested in llm_simple are comparing it to the libraries listed below
Sorting:
- ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化☆47Updated last year
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated last week
- GPTQ inference TVM kernel☆38Updated last year
- ☆18Updated last year
- Triton Documentation in Chinese Simplified / Triton 中文文档☆71Updated last month
- ☆79Updated last year
- ☆84Updated last month
- Standalone Flash Attention v2 kernel without libtorch dependency☆108Updated 8 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated last month
- Implement Flash Attention using Cute.☆82Updated 5 months ago
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆65Updated 9 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆36Updated 2 months ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆69Updated 11 months ago
- ☆94Updated 8 months ago
- simplify >2GB large onnx model☆56Updated 5 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- ☆65Updated 6 months ago
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆51Updated 6 months ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆98Updated last month
- OneFlow Serving☆20Updated last month
- Quantized Attention on GPU☆45Updated 5 months ago
- ☆16Updated last year
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆15Updated 11 months ago
- A simple calculation for LLM MFU.☆38Updated 2 months ago
- ☆70Updated last week
- simplest online-softmax notebook for explain Flash Attention☆10Updated 4 months ago
- 使用 CUDA C++ 实现的 llama 模型推理框架☆56Updated 6 months ago
- ☆23Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆100Updated last year