tjyuyao / cutexLinks

PyCUDA based PyTorch Extension Made Easy

☆25

Alternatives and similar repositories for cutex

Users that are interested in cutex are comparing it to the libraries listed below

Sorting:

DeMoriarty / custom_matmul_kernels
Customized matrix multiplication kernels
☆56Updated 3 years ago
graphcore-research / out-of-the-box-fp8-training
Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.
☆45Updated 11 months ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆157Updated last year
cli99 / flops-profiler
pytorch-profiler
☆51Updated 2 years ago
ptillet / triton-llvm-releases
☆21Updated last year
pytorch-labs / superblock
A block oriented training approach for inference time optimization.
☆33Updated 10 months ago
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆26Updated 8 months ago
lucidrains / autoregressive-linear-attention-cuda
CUDA implementation of autoregressive linear attention, with all the latest research findings
☆44Updated 2 years ago
lernapparat / torchhacks
Hacks for PyTorch
☆19Updated 2 years ago
aredden / torch-bnb-fp4
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
☆30Updated last year
mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆88Updated 5 months ago
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆79Updated last year
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆23Updated last week
zhenhuaw-me / onnxcli
ONNX Command-Line Toolbox
☆35Updated 8 months ago
ahennequ / pytorch-custom-mma
☆29Updated 2 years ago
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆67Updated last week
facebookresearch / MODel_opt
Memory Optimizations for Deep Learning (ICML 2023)
☆64Updated last year
Ryu1845 / hyena-jax
Implementation of Hyena Hierarchy in JAX
☆10Updated 2 years ago
NVIDIA / free-threaded-python
No-GIL Python environment featuring NVIDIA Deep Learning libraries.
☆61Updated 2 months ago
WaveSpeedAI / QuantumAttention
[WIP] Better (FP8) attention for Hopper
☆30Updated 4 months ago
facebookexperimental / protoquant
Prototype routines for GPU quantization written using PyTorch.
☆21Updated 4 months ago
graphcore-research / unit-scaling
A library for unit scaling in PyTorch
☆125Updated 7 months ago
glassroom / heinsen_attention
Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)
☆24Updated last year
proger / nanokitchen
Parallel Associative Scan for Language Models
☆18Updated last year
andylolu2 / simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
☆36Updated 11 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆90Updated 3 weeks ago
rlin27 / DeBut
Codes of the paper Deformable Butterfly: A Highly Structured and Sparse Linear Transform.
☆12Updated 3 years ago
acosharma / elita-transformer
Official Repository for Efficient Linear-Time Attention Transformers.
☆18Updated last year
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Updated last year
adityaiitb / PyProf
A GPU performance profiling tool for PyTorch models
☆22Updated 2 years ago