yqhu / profiler-workshopLinks

Example code for profiler workshop

☆33

Alternatives and similar repositories for profiler-workshop

Users that are interested in profiler-workshop are comparing it to the libraries listed below

Sorting:

RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆210Updated 9 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆307Updated 11 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆93Updated last week
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆205Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆163Updated 10 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆76Updated 9 months ago
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆274Updated last week
stanford-futuredata / stk
☆105Updated 9 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆109Updated 10 months ago
microsoft / SparTA
☆146Updated 10 months ago
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆357Updated 9 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆311Updated this week
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆125Updated 5 months ago
FasterDecoding / SnapKV
☆252Updated last year
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆292Updated 3 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆229Updated 4 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆292Updated 6 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆211Updated last year
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆157Updated last year
cli99 / llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
☆424Updated last month
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆217Updated 6 months ago
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆116Updated 3 months ago
pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated 10 months ago
ColfaxResearch / cutlass-kernels
☆208Updated 10 months ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆271Updated last year
jy-yuan / KIVI
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆303Updated 4 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆70Updated 11 months ago
kssteven418 / BigLittleDecoder
[NeurIPS'23] Speculative Decoding with Big Little Decoder
☆92Updated last year
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆127Updated this week