aiha-lab / TernGEMMLinks
TernGEMM: General Matrix Multiply Library with Ternary Weights for Fast DNN Inference
☆14Updated 3 years ago
Alternatives and similar repositories for TernGEMM
Users that are interested in TernGEMM are comparing it to the libraries listed below
Sorting:
- ☆83Updated last year
- Tender: Accelerating Large Language Models via Tensor Decompostion and Runtime Requantization (ISCA'24)☆25Updated last year
- ☆35Updated last month
- This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"☆118Updated 3 months ago
- [ICML'21 Oral] I-BERT: Integer-only BERT Quantization☆265Updated 3 years ago
- PyTorch emulation library for Microscaling (MX)-compatible data formats☆340Updated 7 months ago
- Official implementation of EMNLP'23 paper "Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?"☆24Updated 2 years ago
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆68Updated last year
- ☆44Updated 2 years ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆281Updated 3 months ago
- ☆208Updated 4 years ago
- LLM Inference with Microscaling Format☆34Updated last year
- ☆61Updated last year
- ☆113Updated 2 years ago
- Torch2Chip (MLSys, 2024)☆55Updated 10 months ago
- The official PyTorch implementation of the NeurIPS2022 (spotlight) paper, Outlier Suppression: Pushing the Limit of Low-bit Transformer L…☆49Updated 3 years ago
- ☆112Updated 3 weeks ago
- ☆21Updated 2 years ago
- [HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning☆122Updated last year
- Official implementation of ICML'24 paper "LQER: Low-Rank Quantization Error Reconstruction for LLMs"☆19Updated last year
- ☆15Updated 3 years ago
- [ICLR2025]: OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitt…☆88Updated 10 months ago
- ☆169Updated 2 years ago
- This repository contains integer operators on GPUs for PyTorch.☆237Updated 2 years ago
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆96Updated 2 years ago
- Code repo for the paper "SpinQuant LLM quantization with learned rotations"☆372Updated 11 months ago
- ☆85Updated last year
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆120Updated last year
- The official implementation of the EMNLP 2023 paper LLM-FP4☆220Updated 2 years ago
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆482Updated last year