L1aoXingyu / llm-infer-bench
☆11Updated last year
Alternatives and similar repositories for llm-infer-bench:
Users that are interested in llm-infer-bench are comparing it to the libraries listed below
- Odysseus: Playground of LLM Sequence Parallelism☆64Updated 7 months ago
- ☆65Updated last week
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆22Updated 11 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆35Updated 11 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆38Updated 11 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆64Updated 3 months ago
- GPTQ inference TVM kernel☆38Updated 9 months ago
- [AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models☆43Updated last year
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆40Updated 2 months ago
- ☆14Updated 10 months ago
- ☆30Updated 8 months ago
- Quantized Attention on GPU☆34Updated 2 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆86Updated this week
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 8 months ago
- ☆62Updated 2 months ago
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆46Updated 7 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆29Updated 3 months ago
- An object detection codebase based on MegEngine.☆28Updated 2 years ago
- AFPQ code implementation☆20Updated last year
- TVMScript kernel for deformable attention☆24Updated 3 years ago
- ☆60Updated 3 weeks ago
- LLM Inference with Microscaling Format☆17Updated 3 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆16Updated 8 months ago
- [NeurIPS 2024] Search for Efficient LLMs☆12Updated 3 weeks ago
- An external memory allocator example for PyTorch.☆14Updated 3 years ago
- ☆81Updated 5 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆20Updated 11 months ago