ROCm / bitsandbytes
8-bit CUDA functions for PyTorch
☆38Updated last week
Related projects ⓘ
Alternatives and complementary repositories for bitsandbytes
- Fast and memory-efficient exact attention☆139Updated this week
- Development repository for the Triton language and compiler☆93Updated this week
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆89Updated this week
- AMD related optimizations for transformer models☆57Updated 2 weeks ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆45Updated this week
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆11Updated 4 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆173Updated 4 months ago
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆248Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆253Updated last month
- ☆67Updated last week
- GPTQ inference Triton kernel☆284Updated last year
- PB-LLM: Partially Binarized Large Language Models☆148Updated last year
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated last month
- Fast Inference of MoE Models with CPU-GPU Orchestration☆172Updated this week
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆187Updated this week
- ☆114Updated 7 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- ☆505Updated 3 weeks ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆262Updated last year
- Hackable and optimized Transformers building blocks, supporting a composable construction.☆20Updated last week
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆153Updated this week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆36Updated last year
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆63Updated this week
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels☆98Updated last year
- ring-attention experiments☆97Updated last month
- Efficient 3bit/4bit quantization of LLaMA models☆19Updated last year
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆125Updated this week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆350Updated 8 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆209Updated 3 weeks ago