tridao / cutlass_quant

☆16

Related projects: ⓘ

Aleph-Alpha / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆23Updated 9 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆50Updated 3 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆35Updated 4 months ago
megvii-research / IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA
☆18Updated last year
BBuf / flash-rwkv
☆28Updated 3 months ago
sgl-project / tensorrt-demo
TensorRT LLM Benchmark Configuration
☆10Updated last month
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆20Updated 5 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆41Updated 3 weeks ago
microsoft / chunk-attention
☆29Updated 4 months ago
LeiWang1999 / Stream-k.tvm
☆14Updated last week
fmfi-compbio / admm-pruning
☆15Updated last month
L1aoXingyu / llm-infer-bench
☆11Updated last year
stanford-futuredata / stk
☆83Updated 3 weeks ago
Raincleared-Song / sparse_gpu_operator
GPU operators for sparse tensor operations
☆27Updated 6 months ago
Harry-Chen / InfMoE
Inference framework for MoE layers based on TensorRT with Python binding
☆41Updated 3 years ago
mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆73Updated 3 months ago
Gaffey / ExCP
Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".
☆37Updated 2 months ago
microsoft / DeepSpeed-Kernels
☆50Updated 3 months ago
DachengLi1 / AMP
(NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.
☆33Updated last year
xdit-project / DistVAE
A parallelism VAE avoids OOM for high resolution image generation
☆34Updated 2 months ago
triton-lang / kernels
☆27Updated 3 weeks ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆25Updated last month
VITA-Group / llm-kick
[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.
☆15Updated 6 months ago
AlibabaPAI / FLASHNN
☆67Updated last week
zhuzilin / pytorch-malloc
An external memory allocator example for PyTorch.
☆13Updated 2 years ago
NVIDIA / HMM_sample_code
CUDA 12.2 HMM demos
☆16Updated last month
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆29Updated 6 months ago
zyxxmu / cam
Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference
☆21Updated 3 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆93Updated last week
tridao / flash-attention-wheels
☆38Updated 9 months ago