qwopqwop200 / AutoQuant

☆9

Alternatives and similar repositories for AutoQuant:

Users that are interested in AutoQuant are comparing it to the libraries listed below

sgl-project / tensorrt-demo
TensorRT LLM Benchmark Configuration
☆12Updated 5 months ago
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆22Updated 7 months ago
lucasjinreal / wnnx_models
Various test models in WNNX format. It can view with `pip install wnetron && wnetron`
☆12Updated 2 years ago
tridao / flash-attention-wheels
☆45Updated last year
UmerHA / triton_util
Make triton easier
☆42Updated 7 months ago
megvii-research / IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA
☆18Updated last year
facebookresearch / coocmap
code for paper "Accessing higher dimensions for unsupervised word translation"
☆21Updated last year
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆25Updated last year
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆38Updated 10 months ago
zhenhuaw-me / onnxcli
ONNX Command-Line Toolbox
☆35Updated 3 months ago
mlc-ai / mlc-python
☆21Updated last week
kyegomez / MM1
PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training"
☆23Updated this week
facebookresearch / NasRec
NASRec Weight Sharing Neural Architecture Search for Recommender Systems
☆29Updated last year
microsoft / DeepSpeed-Kernels
☆57Updated 7 months ago
DefTruth / ffpa-attn-mma
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
☆49Updated this week
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated 3 months ago
TileLang / tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆17Updated 2 weeks ago
casper-hansen / AutoAWQ_kernels
☆62Updated last month
vedantroy / gpu_kernels
☆25Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆64Updated 7 months ago
facebookresearch / Ternary_Binary_Transformer
ACL 2023
☆38Updated last year
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆38Updated 8 months ago
SqueezeAILab / open_source_projects
Open Source Projects from Pallas Lab
☆20Updated 3 years ago
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆21Updated 2 weeks ago
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆44Updated last year
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆40Updated last year
ziplab / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆22Updated 10 months ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆56Updated last year
megvii-research / basedet
An object detection codebase based on MegEngine.
☆28Updated 2 years ago