L1aoXingyu / llm-infer-bench
☆11Updated last year
Alternatives and similar repositories for llm-infer-bench:
Users that are interested in llm-infer-bench are comparing it to the libraries listed below
- Odysseus: Playground of LLM Sequence Parallelism☆66Updated 9 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- Quantized Attention on GPU☆45Updated 3 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- ☆64Updated 3 months ago
- TVMScript kernel for deformable attention☆25Updated 3 years ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆26Updated last year
- GPTQ inference TVM kernel☆39Updated 10 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 9 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆35Updated last week
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆35Updated last year
- Efficient Mixture of Experts for LLM Paper List☆45Updated 3 months ago
- An object detection codebase based on MegEngine.☆28Updated 2 years ago
- An external memory allocator example for PyTorch.☆14Updated 3 years ago
- OneFlow Serving☆20Updated 2 months ago
- ☆30Updated 9 months ago
- ☆15Updated 11 months ago
- AFPQ code implementation☆20Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆73Updated 4 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆89Updated 3 weeks ago
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆46Updated 8 months ago
- Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…☆47Updated last year
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆17Updated 5 months ago
- ☆64Updated last month
- Benchmark tests supporting the TiledCUDA library.☆15Updated 4 months ago
- Depict GPU memory footprint during DNN training of PyTorch☆11Updated 2 years ago
- ☆27Updated 11 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆16Updated 9 months ago