IST-DASLab / torch_cgx

Pytorch distributed backend extension with compression support

☆16

Alternatives and similar repositories for torch_cgx:

Users that are interested in torch_cgx are comparing it to the libraries listed below

Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆174Updated 11 months ago
cchan / tccl
extensible collectives library in triton
☆85Updated 3 weeks ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆120Updated 3 weeks ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆84Updated 5 months ago
thu-ml / Jetfire-INT8Training
☆40Updated 9 months ago
triton-lang / kernels
☆78Updated 5 months ago
DachengLi1 / AMP
(NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.
☆38Updated 2 years ago
INT-FlashAttention2024 / INT-FlashAttention
☆68Updated 3 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆72Updated 7 months ago
gau-nernst / quantized-training
Explore training for quantized models
☆18Updated 3 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆65Updated last week
stanford-futuredata / stk
☆103Updated 8 months ago
amirzandieh / QJL
QJL: 1-Bit Quantized JL transform for KV Cache Quantization with Zero Overhead
☆23Updated 3 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆67Updated 3 weeks ago
hao-ai-lab / MuxServe
☆59Updated 10 months ago
mayank31398 / cute-kernels
A bunch of kernels that might make stuff slower 😉
☆34Updated this week
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆294Updated this week
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆82Updated last week
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆91Updated last month
ScalingIntelligence / hydragen
Hydragen: High-Throughput LLM Inference with Shared Prefixes
☆36Updated 11 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆160Updated 9 months ago
Youhe-Jiang / IJCAI2023-OptimalShardedDataParallel
[IJCAI2023] An automated parallel training system that combines the advantages from both data and model parallelism. If you have any inte…
☆51Updated last year
FasterDecoding / TEAL
☆122Updated 2 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆248Updated 6 months ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆117Updated last year
luuyin / OWL
Official Pytorch Implementation of "Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity"
☆64Updated 10 months ago
sandeepkumar-skb / pytorch_custom_op
End to End steps for adding custom ops in PyTorch.
☆21Updated 4 years ago
yifuwang / symm-mem-recipes
☆68Updated 4 months ago
microsoft / SparTA
☆141Updated 9 months ago
spcl / sten
Sparsity support for PyTorch
☆34Updated last month