aiha-lab / TernGEMMLinks

TernGEMM: General Matrix Multiply Library with Ternary Weights for Fast DNN Inference

☆13

Alternatives and similar repositories for TernGEMM

Users that are interested in TernGEMM are comparing it to the libraries listed below

Sorting:

kssteven418 / Q-ASR
[ICASSP'22] Integer-only Zero-shot Quantization for Efficient Speech Recognition
☆33Updated 3 years ago
vineeths96 / Compressed-Transformers
In this repository, we explore model compression for transformer architectures via quantization. We specifically explore quantization awa…
☆24Updated 4 years ago
lutnn / blink-mm
☆15Updated last year
kaiqi123 / SQAKD
☆11Updated last year
shawnricecake / squant
[ICCAD 2025] Squant
☆14Updated 2 weeks ago
mlcommons / tiny_results_v0.7
This repository contains the results and code for the MLPerf™ Tiny Inference v0.7 benchmark.
☆18Updated 2 years ago
IntelLabs / FP8-Emulation-Toolkit
PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
☆110Updated 7 months ago
kssteven418 / I-BERT
[ICML'21 Oral] I-BERT: Integer-only BERT Quantization
☆253Updated 2 years ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆157Updated last year
Qualcomm-AI-research / transformer-quantization
☆206Updated 3 years ago
ECoLab-POSTECH / NIPQ
☆16Updated 2 years ago
maltanar / gemmbitserial
Fast matrix multiplication for few-bit integer matrices on CPUs.
☆28Updated 6 years ago
marsupialtail / gpu-sparsert
☆18Updated 4 years ago
minkjung / blankcollapse
☆10Updated 2 years ago
ictnlp / MonoAttn-Transducer
Code for ICML25 Paper "Overcoming Non-monotonicity in Transducer-based Streaming Generation"
☆11Updated last month
aiha-lab / Attention-Head-Pruning
Layer-wise Pruning of Transformer Heads for Efficient Language Modeling
☆21Updated 3 years ago
naver-aics / lut-gemm
☆61Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆77Updated 5 months ago
ag1988 / mel-asr
The accompanying code for "Exploring the limits of decoder-only models trained on public speech recognition corpora" (Ankit Gupta, George…
☆19Updated 9 months ago
jaewoosong / pocketnn
The official, proof-of-concept C++ implementation of PocketNN.
☆34Updated last year
fmfi-compbio / admm-pruning
☆28Updated 11 months ago
wimh966 / outlier_suppression
The official PyTorch implementation of the NeurIPS2022 (spotlight) paper, Outlier Suppression: Pushing the Limit of Low-bit Transformer L…
☆47Updated 2 years ago
sIncerass / QBERT
☆15Updated 2 years ago
kssteven418 / BigLittleDecoder
[NeurIPS'23] Speculative Decoding with Big Little Decoder
☆93Updated last year
xuchennlp / S2T
The project for speech translation
☆11Updated last year
AppleHolic / FastSpeech2
Refactored version of https://github.com/ming024/FastSpeech2
☆14Updated 3 years ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆95Updated 6 years ago
htqin / BiFSMNv2
Pytorch implementation of BiFSMNv2, TNNLS 2023
☆31Updated 2 years ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 10 months ago
kssteven418 / SqueezeLLM-gradients
☆20Updated last year