ROCm / bitsandbytes
8-bit CUDA functions for PyTorch
☆42Updated last week
Alternatives and similar repositories for bitsandbytes:
Users that are interested in bitsandbytes are comparing it to the libraries listed below
- Fast and memory-efficient exact attention☆159Updated this week
- Development repository for the Triton language and compiler☆108Updated this week
- AMD related optimizations for transformer models☆68Updated 4 months ago
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆37Updated 6 months ago
- ☆109Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆67Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆237Updated 4 months ago
- QuIP quantization☆51Updated 11 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆79Updated this week
- 8-bit CUDA functions for PyTorch, ported to HIP for use in AMD GPUs☆48Updated last year
- GPTQ inference Triton kernel☆297Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 7 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆36Updated last year
- 8-bit CUDA functions for PyTorch Rocm compatible☆39Updated 11 months ago
- Easy and Efficient Quantization for Transformers☆192Updated 3 weeks ago
- Ahead of Time (AOT) Triton Math Library☆54Updated this week
- Collection of kernels written in Triton language☆107Updated last week
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆11Updated 8 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆65Updated 6 months ago
- Explore training for quantized models☆16Updated last month
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆100Updated 7 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆150Updated 9 months ago
- Efficient 3bit/4bit quantization of LLaMA models☆19Updated last year
- Applied AI experiments and examples for PyTorch☆232Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆260Updated 4 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆88Updated this week
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆351Updated 3 months ago
- ☆34Updated this week
- PB-LLM: Partially Binarized Large Language Models☆151Updated last year
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆231Updated last week