L1aoXingyu / llm-infer-bench
☆11Updated last year
Alternatives and similar repositories for llm-infer-bench:
Users that are interested in llm-infer-bench are comparing it to the libraries listed below
- Odysseus: Playground of LLM Sequence Parallelism☆64Updated 7 months ago
- ☆52Updated last week
- ☆13Updated 9 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆38Updated 10 months ago
- GPTQ inference TVM kernel☆38Updated 8 months ago
- ☆31Updated 7 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆35Updated 10 months ago
- ☆59Updated last month
- PyTorch bindings for CUTLASS grouped GEMM.☆58Updated 2 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆22Updated 10 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 7 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆27Updated 2 months ago
- TensorRT LLM Benchmark Configuration☆12Updated 5 months ago
- An object detection codebase based on MegEngine.☆28Updated 2 years ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆75Updated this week
- Triton implementation of Flash Attention2.0☆26Updated last year
- Quantized Attention on GPU☆34Updated last month
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Updated last year
- OneFlow Serving☆20Updated 3 weeks ago
- TVMScript kernel for deformable attention☆24Updated 3 years ago
- An external memory allocator example for PyTorch.☆14Updated 3 years ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆36Updated last month
- ☆72Updated 5 months ago
- Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…☆45Updated last year
- ☆27Updated last month
- AFPQ code implementation☆19Updated last year