ROCm / xformersLinks

Hackable and optimized Transformers building blocks, supporting a composable construction.

☆31

Alternatives and similar repositories for xformers

Users that are interested in xformers are comparing it to the libraries listed below

Sorting:

WaveSpeedAI / QuantumAttention
[WIP] Better (FP8) attention for Hopper
☆31Updated 4 months ago
huggingface / optimum-amd
AMD related optimizations for transformer models
☆80Updated 3 weeks ago
ROCm / bitsandbytes
8-bit CUDA functions for PyTorch
☆53Updated 3 weeks ago
ROCm / flash-attention
Fast and memory-efficient exact attention
☆177Updated this week
KONAKONA666 / q8_kernels
☆71Updated 6 months ago
ROCm / triton
Development repository for the Triton language and compiler
☆125Updated this week
aredden / torch-cublas-hgemm
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
☆72Updated 7 months ago
casper-hansen / AutoAWQ_kernels
☆75Updated 7 months ago
deepspeedai / DeepSpeed-Kernels
☆74Updated 3 months ago
abdelfattah-lab / nitro
Lightweight Python Wrapper for OpenVINO, enabling LLM inference on NPUs
☆21Updated 7 months ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆47Updated 11 months ago
graphcore-research / unit-scaling-demo
Unit Scaling demo and experimentation code
☆16Updated last year
BlinkDL / fast.c
Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.
☆72Updated 5 months ago
Repeerc / flash-attention-v2-RDNA3-minimal
a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…
☆43Updated 10 months ago
Cornell-RelaxML / qtip
☆139Updated 3 weeks ago
xdit-project / DiTCacheAnalysis
An auxiliary project analysis of the characteristics of KV in DiT Attention.
☆31Updated 7 months ago
chengzeyi / piflux
(WIP) Parallel inference for black-forest-labs' FLUX model.
☆19Updated 7 months ago
mlc-ai / mlc-python
☆35Updated this week
ROCm / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆85Updated this week
hpcaitech / TensorNVMe
A Python library transfers PyTorch tensors between CPU and NVMe
☆116Updated 7 months ago
tile-ai / tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆19Updated last week
OpenBMB / CPM.cu
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…
☆160Updated this week
chu-tianxiang / llama-cpp-torch
llama.cpp to PyTorch Converter
☆33Updated last year
ROCm / jax
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
☆24Updated this week
flashinfer-ai / cutlass-viz
☆60Updated 2 months ago
amd / UIF
☆60Updated last year
chu-tianxiang / QuIP-for-all
QuIP quantization
☆54Updated last year
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆79Updated last week
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆24Updated 3 weeks ago
ROCm / apex
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
☆22Updated this week