IST-DASLab / torch_cgx
Pytorch distributed backend extension with compression support
☆16Updated last month
Alternatives and similar repositories for torch_cgx:
Users that are interested in torch_cgx are comparing it to the libraries listed below
- Fast Hadamard transform in CUDA, with a PyTorch interface☆174Updated 11 months ago
- extensible collectives library in triton☆85Updated 3 weeks ago
- Collection of kernels written in Triton language☆120Updated 3 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆84Updated 5 months ago
- ☆40Updated 9 months ago
- ☆78Updated 5 months ago
- (NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.☆38Updated 2 years ago
- ☆68Updated 3 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 7 months ago
- Explore training for quantized models☆18Updated 3 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆65Updated last week
- ☆103Updated 8 months ago
- QJL: 1-Bit Quantized JL transform for KV Cache Quantization with Zero Overhead☆23Updated 3 months ago
- DeeperGEMM: crazy optimized version☆67Updated 3 weeks ago
- ☆59Updated 10 months ago
- A bunch of kernels that might make stuff slower 😉☆34Updated this week
- Fast low-bit matmul kernels in Triton☆294Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆82Updated last week
- 16-fold memory access reduction with nearly no loss☆91Updated last month
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆36Updated 11 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆160Updated 9 months ago
- [IJCAI2023] An automated parallel training system that combines the advantages from both data and model parallelism. If you have any inte…☆51Updated last year
- ☆122Updated 2 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆248Updated 6 months ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆117Updated last year
- Official Pytorch Implementation of "Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity"☆64Updated 10 months ago
- End to End steps for adding custom ops in PyTorch.☆21Updated 4 years ago
- ☆68Updated 4 months ago
- ☆141Updated 9 months ago
- Sparsity support for PyTorch☆34Updated last month