megvii-research / IntLLaMALinks

IntLLaMA: A fast and light quantization solution for LLaMA

☆18

Alternatives and similar repositories for IntLLaMA

Users that are interested in IntLLaMA are comparing it to the libraries listed below

Sorting:

megvii-research / basedet
An object detection codebase based on MegEngine.
☆28Updated 2 years ago
octoml / deformable-attention-kernel
TVMScript kernel for deformable attention
☆25Updated 3 years ago
BBuf / flash-rwkv
☆32Updated last year
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 11 months ago
RangiLyu / llama.mmengine
Training LLaMA language model with MMEngine! It supports LoRA fine-tuning!
☆41Updated 2 years ago
mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆92Updated 9 months ago
NVlabs / SMCP
☆22Updated 3 years ago
L1aoXingyu / llm-infer-bench
☆12Updated 2 years ago
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Updated 3 months ago
HubHop / vit-attention-benchmark
Benchmarking Attention Mechanism in Vision Transformers.
☆18Updated 3 years ago
facebookresearch / DepthShrinker
[ICML 2022] "DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks", by Yonggan …
☆72Updated 3 years ago
ThisisBillhe / torch_quantizer
torch_quantizer is a out-of-box quantization tool for PyTorch models on CUDA backend, specially optimized for Diffusion Models.
☆22Updated last year
HuangOwen / QAT-ACS
[TMLR] Official PyTorch implementation of paper "Efficient Quantization-aware Training with Adaptive Coreset Selection"
☆34Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
ModelTC / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆39Updated last year
Oneflow-Inc / OneFlow-Pruning
[CVPR-2023] Towards Any Structural Pruning
☆16Updated 2 years ago
ofsoundof / random_channel_pruning
☆17Updated 3 years ago
ziplab / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆30Updated last year
NVlabs / EfficientDL
☆34Updated 4 months ago
thuml / learn_torch.compile
torch.compile artifacts for common deep learning models, can be used as a learning resource for torch.compile
☆18Updated last year
ilur98 / DGQ
Official Code For Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
☆14Updated last year
facebookresearch / Ternary_Binary_Transformer
ACL 2023
☆39Updated 2 years ago
xdit-project / DiTCacheAnalysis
An auxiliary project analysis of the characteristics of KV in DiT Attention.
☆32Updated 10 months ago
sgl-project / DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆21Updated this week
NoakLiu / FastCache-xDiT
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]
☆43Updated last month
GATECH-EIC / SuperTickets
[ECCV 2022] SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning
☆20Updated 3 years ago
IST-DASLab / gemm-fp8
High Performance FP8 GEMM Kernels for SM89 and later GPUs.
☆20Updated 9 months ago
microsoft / AttentionEngine
☆101Updated 5 months ago
hikvision-research / Unified-Normalization
# Unified Normalization (ACM MM'22) By Qiming Yang, Kai Zhang, Chaoxiang Lan, Zhi Yang, Zheyang Li, Wenming Tan, Jun Xiao, and Shiliang P…
☆34Updated 2 years ago
Doraemonzzz / xmixers
Xmixers: A collection of SOTA efficient token/channel mixers
☆29Updated last month