IST-DASLab / torch_cgx
Pytorch distributed backend extension with compression support
☆16Updated 4 months ago
Alternatives and similar repositories for torch_cgx:
Users that are interested in torch_cgx are comparing it to the libraries listed below
- [IJCAI2023] An automated parallel training system that combines the advantages from both data and model parallelism. If you have any inte…☆51Updated last year
- ☆67Updated 3 months ago
- extensible collectives library in triton☆83Updated 4 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆143Updated 8 months ago
- Fast low-bit matmul kernels in Triton☆238Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 5 months ago
- Collection of kernels written in Triton language☆105Updated this week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆229Updated this week
- ☆26Updated last year
- ☆44Updated last month
- A minimal implementation of vllm.☆33Updated 6 months ago
- ☆35Updated 7 months ago
- Flexible simulator for mixed precision and format simulation of LLMs and vision transformers.☆47Updated last year
- Applied AI experiments and examples for PyTorch☆225Updated this week
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆102Updated 4 months ago
- Cataloging released Triton kernels.☆168Updated last month
- ☆23Updated 3 months ago
- Research and development for optimizing transformers☆125Updated 4 years ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆77Updated this week
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆117Updated 6 months ago
- ☆107Updated last month
- PyTorch bindings for CUTLASS grouped GEMM.☆64Updated 3 months ago
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 6 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆272Updated last month
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆105Updated 2 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆199Updated last year
- TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.☆56Updated this week
- ☆180Updated this week
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆295Updated 7 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆62Updated 11 months ago