tpoisonooo / llama.onnx

LLaMa/RWKV onnx models, quantization and testcase

☆345

Related projects: ⓘ

luchangli03 / export_llama_to_onnx
export llama to onnx
☆91Updated 3 months ago
wangzhaode / llm-export
llm-export can export llm model to onnx.
☆193Updated this week
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆273Updated last year
mit-han-lab / qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
☆399Updated 2 weeks ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆562Updated 2 weeks ago
OpenPPL / ppl.llm.serving
☆123Updated 3 months ago
openppl-public / ppl.nn.llm
☆140Updated 4 months ago
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.
☆141Updated 3 weeks ago
Ascend / pytorch
Ascend PyTorch adapter (torch_npu). Mirror of https://gitee.com/ascend/pytorch
☆226Updated this week
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆240Updated 2 weeks ago
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆156Updated 9 months ago
bytedance / decoupleQ
A quantization algorithm for LLM
☆98Updated 2 months ago
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆1,143Updated this week
bytedance / ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
☆451Updated 6 months ago
ModelTC / llmc
This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit…
☆227Updated this week
mlc-ai / tokenizers-cpp
Universal cross-platform tokenizers binding to HF and sentencepiece
☆246Updated last month
mlc-ai / relax
☆148Updated this week
BlinkDL / RWKV-CUDA
The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )
☆208Updated 4 months ago
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆342Updated this week
NVIDIA / TensorRT-Model-Optimizer
TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, sparsity, distillat…
☆439Updated this week
megvii-research / Sparsebit
A model compression and acceleration toolbox based on pytorch.
☆325Updated 8 months ago
hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆178Updated last year
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆470Updated this week
OpenGVLab / OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆672Updated last month
FlagOpen / FlagGems
FlagGems is an operator library for large language models implemented in Triton Language.
☆246Updated this week
luchangli03 / onnxsim_large_model
simplify >2GB large onnx model
☆41Updated 6 months ago
hahnyuan / LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…
☆276Updated last week
triton-inference-server / tensorrtllm_backend
The Triton TensorRT-LLM Backend
☆654Updated this week
void-main / FasterTransformer
Transformer related optimization, including BERT, GPT
☆58Updated 11 months ago
SafeAILab / EAGLE
Official Implementation of EAGLE-1 and EAGLE-2
☆749Updated 3 weeks ago