L1aoXingyu / llm-infer-bench
☆11Updated last year
Alternatives and similar repositories for llm-infer-bench:
Users that are interested in llm-infer-bench are comparing it to the libraries listed below
- Odysseus: Playground of LLM Sequence Parallelism☆66Updated 8 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- GPTQ inference TVM kernel☆39Updated 10 months ago
- Quantized Attention on GPU☆45Updated 3 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆26Updated last year
- ☆63Updated 3 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 8 months ago
- An object detection codebase based on MegEngine.☆28Updated 2 years ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- ☆27Updated 11 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆35Updated last year
- TVMScript kernel for deformable attention☆25Updated 3 years ago
- An external memory allocator example for PyTorch.☆14Updated 3 years ago
- Efficient Mixture of Experts for LLM Paper List☆41Updated 3 months ago
- ☆15Updated 11 months ago
- [AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models☆44Updated last year
- A MoE impl for PyTorch, [ATC'23] SmartMoE☆61Updated last year
- ☆30Updated 9 months ago
- Distributed DataLoader For Pytorch Based On Ray☆24Updated 3 years ago
- Code for ICML 2021 submission☆35Updated 3 years ago
- [NeurIPS 2024] Search for Efficient LLMs☆12Updated last month
- ☆20Updated 2 years ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆34Updated this week
- Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…☆47Updated last year
- study of cutlass☆21Updated 4 months ago
- ☆87Updated 6 months ago