OscarSavolainen / Quantization-TutorialsLinks

A bunch of coding tutorials for my Youtube videos on Neural Network Quantization.

☆16

Alternatives and similar repositories for Quantization-Tutorials

Users that are interested in Quantization-Tutorials are comparing it to the libraries listed below

Sorting:

wangzyon / trt_learn
TensorRT encapsulation, learn, rewrite, practice.
☆28Updated 2 years ago
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆58Updated 7 months ago
zjhellofss / KuiperCourse
b站上的课程
☆76Updated last year
HuPengsheet / EasyNN
EasyNN是一个面向教学而开发的神经网络推理框架，旨在让大家0基础也能自主完成推理框架编写！
☆29Updated 9 months ago
JackonYang / hands-on-tvm
hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.
☆48Updated 2 years ago
Tlntin / trt2023
☆26Updated last year
KarhouTam / cuda-kernels
Some common CUDA kernel implementations (Not the fastest).
☆18Updated 2 months ago
sesmfs / onnx_quant_tool
An onnx-based quantitation tool.
☆71Updated last year
weishengying / tiny-flash-attention
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆41Updated 10 months ago
zjhellofss / triton_course
☆27Updated last month
shouxieai / tensorRT_quantization
该代码与B站上的视频 https://www.bilibili.com/video/BV18L41197Uz/?spm_id_from=333.788&vd_source=eefa4b6e337f16d87d87c2c357db8ca7 相关联。
☆69Updated last year
JieRen98 / SGEMM-SASS-Annotation
☆21Updated 4 years ago
Ranking666 / Base-quantization
base quantization methods including: QAT, PTQ, per_channel, per_tensor, dorefa, lsq, adaround, omse, Histogram, bias_correction.etc
☆46Updated 2 years ago
raymond1123 / hgemm
☆30Updated 7 months ago
luchangli03 / onnxsim_large_model
simplify >2GB large onnx model
☆58Updated 6 months ago
wangzyon / pyInfer
async inference for machine learning model
☆26Updated 2 years ago
InfiniTensor / RefactorGraph
分层解耦的深度学习推理引擎
☆73Updated 4 months ago
tsingmicro-toolchain / OnnxSlim
A Toolkit to Help Optimize Large Onnx Model
☆157Updated last year
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆70Updated 10 months ago
harleyszhang / lite_llama
A light llama-like llm inference framework based on the triton kernel.
☆128Updated last week
AyakaGEMM / Hands-on-GEMM
☆135Updated last year
YuxueYang1204 / CudaDemo
Implement custom operators in PyTorch with cuda/c++
☆63Updated 2 years ago
OpenPPL / ppl.kernel.cuda
☆36Updated 8 months ago
leimao / TensorRT-Custom-Plugin-Example
Quick and Self-Contained TensorRT Custom Plugin Implementation and Integration
☆61Updated 3 weeks ago
weishengying / cute_gemm
☆14Updated 10 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆38Updated 3 months ago
YangLinzhuo / cuda-sgemm-optimization
CUDA SGEMM optimization note
☆13Updated last year
FeiGeChuanShu / trt2023
NVIDIA TensorRT Hackathon 2023复赛选题：通义千问Qwen-7B用TensorRT-LLM模型搭建及优化
☆42Updated last year
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆96Updated last week
Ther-nullptr / circult-eda-mlsys-tinyml-arxiv-daily
🎓Automatically Update circult-eda-mlsys-tinyml Papers Daily using Github Actions (Update Every 8th hours)
☆10Updated this week