L1aoXingyu / llm-infer-bench
☆11Updated last year
Related projects: ⓘ
- Odysseus: Playground of LLM Sequence Parallelism☆50Updated 3 months ago
- ☆13Updated 5 months ago
- ☆16Updated this week
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated 9 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆19Updated 6 months ago
- GPTQ inference TVM kernel☆35Updated 4 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆33Updated 6 months ago
- ☆52Updated this week
- ☆23Updated this week
- ☆29Updated 4 months ago
- ☆28Updated 3 months ago
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆37Updated 2 months ago
- [AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models☆34Updated 8 months ago
- OneFlow Serving☆20Updated 7 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆18Updated 6 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆14Updated this week
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Updated last year
- TVMScript kernel for deformable attention☆24Updated 2 years ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆18Updated 3 months ago
- ☆13Updated last year
- ☆67Updated last week
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆46Updated last month
- ☆23Updated 9 months ago
- [ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.☆69Updated 4 months ago
- study of cutlass☆18Updated last year
- [CVPR-2023] Towards Any Structural Pruning☆17Updated last year
- Distributed DataLoader For Pytorch Based On Ray☆24Updated 2 years ago
- An object detection codebase based on MegEngine.☆28Updated last year
- An external memory allocator example for PyTorch.☆13Updated 2 years ago