xfhelen / MMBench
An end-to-end benchmark suite of multi-modal DNN applications for system-architecture co-design
☆22Updated last month
Alternatives and similar repositories for MMBench:
Users that are interested in MMBench are comparing it to the libraries listed below
- ☆11Updated 7 months ago
- ☆19Updated last month
- The official PyTorch implementation of the NeurIPS2022 (spotlight) paper, Outlier Suppression: Pushing the Limit of Low-bit Transformer L…☆47Updated 2 years ago
- [NeurIPS 2023] ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer☆32Updated last year
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆32Updated 5 months ago
- A curated list of early exiting (LLM, CV, NLP, etc)☆38Updated 5 months ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆52Updated last month
- Code Repository of Evaluating Quantized Large Language Models☆114Updated 4 months ago
- [AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models☆42Updated last year
- LLM Inference with Microscaling Format☆17Updated 2 months ago
- [DAC 2024] EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive La…☆40Updated 7 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆31Updated 7 months ago
- ☆48Updated 9 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆35Updated 10 months ago
- The code repository of "MBQ: Modality-Balanced Quantization for Large Vision-Language Models"☆28Updated last month
- Code for ICML 2021 submission☆35Updated 3 years ago
- ☆18Updated 2 months ago
- Curated collection of papers in MoE model inference☆41Updated last week
- Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".☆109Updated last year
- The official implementation of the DAC 2024 paper GQA-LUT☆11Updated last month
- ☆41Updated 2 years ago
- ☆49Updated last year
- Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models☆41Updated 2 months ago
- ☆90Updated last year
- Codebase for ICML'24 paper: Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs☆24Updated 7 months ago
- [ICML 2023] This project is the official implementation of our accepted ICML 2023 paper BiBench: Benchmarking and Analyzing Network Binar…☆54Updated 10 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆23Updated 2 months ago
- Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…☆45Updated last year
- Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs☆15Updated last month
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆26Updated last month