ROCm / bitsandbytes
8-bit CUDA functions for PyTorch
☆46Updated last month
Alternatives and similar repositories for bitsandbytes:
Users that are interested in bitsandbytes are comparing it to the libraries listed below
- Fast and memory-efficient exact attention☆163Updated this week
- 8-bit CUDA functions for PyTorch Rocm compatible☆39Updated last year
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆37Updated 7 months ago
- Development repository for the Triton language and compiler☆114Updated this week
- AMD related optimizations for transformer models☆72Updated 4 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆243Updated 5 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆92Updated this week
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆11Updated 9 months ago
- 8-bit CUDA functions for PyTorch, ported to HIP for use in AMD GPUs☆49Updated last year
- llama.cpp fork with additional SOTA quants and improved performance☆231Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆70Updated this week
- ☆26Updated this week
- ☆113Updated last week
- Ahead of Time (AOT) Triton Math Library☆56Updated 2 weeks ago
- Fast low-bit matmul kernels in Triton☆275Updated this week
- DEPRECATED!☆52Updated 9 months ago
- python package of rocm-smi-lib☆20Updated 6 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆236Updated last month
- Efficient 3bit/4bit quantization of LLaMA models☆19Updated last year
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆103Updated 8 months ago
- GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ☆100Updated last year
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆84Updated this week
- oneCCL Bindings for Pytorch*☆91Updated this week
- GPTQ inference Triton kernel☆300Updated last year
- Linux based GDDR6/GDDR6X VRAM temperature reader for NVIDIA RTX 3000/4000 series GPUs.☆97Updated 7 months ago
- This repository contains the experimental PyTorch native float8 training UX☆222Updated 8 months ago
- AI Tensor Engine for ROCm☆142Updated this week
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆234Updated this week
- ☆20Updated last week
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆202Updated 4 months ago