tjyuyao / cutexLinks
PyCUDA based PyTorch Extension Made Easy
☆25Updated last year
Alternatives and similar repositories for cutex
Users that are interested in cutex are comparing it to the libraries listed below
Sorting:
- Customized matrix multiplication kernels☆56Updated 3 years ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 11 months ago
- ☆157Updated last year
- pytorch-profiler☆51Updated 2 years ago
- ☆21Updated last year
- A block oriented training approach for inference time optimization.☆33Updated 10 months ago
- FlexAttention w/ FlashAttention3 Support☆26Updated 8 months ago
- CUDA implementation of autoregressive linear attention, with all the latest research findings☆44Updated 2 years ago
- Hacks for PyTorch☆19Updated 2 years ago
- Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops☆30Updated last year
- Patch convolution to avoid large GPU memory usage of Conv2D☆88Updated 5 months ago
- Experiment of using Tangent to autodiff triton☆79Updated last year
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated last week
- ONNX Command-Line Toolbox☆35Updated 8 months ago
- ☆29Updated 2 years ago
- Ahead of Time (AOT) Triton Math Library☆67Updated last week
- Memory Optimizations for Deep Learning (ICML 2023)☆64Updated last year
- Implementation of Hyena Hierarchy in JAX☆10Updated 2 years ago
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆61Updated 2 months ago
- [WIP] Better (FP8) attention for Hopper☆30Updated 4 months ago
- Prototype routines for GPU quantization written using PyTorch.☆21Updated 4 months ago
- A library for unit scaling in PyTorch☆125Updated 7 months ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated last year
- Parallel Associative Scan for Language Models☆18Updated last year
- The simplest but fast implementation of matrix multiplication in CUDA.☆36Updated 11 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆90Updated 3 weeks ago
- Codes of the paper Deformable Butterfly: A Highly Structured and Sparse Linear Transform.☆12Updated 3 years ago
- Official Repository for Efficient Linear-Time Attention Transformers.☆18Updated last year
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆25Updated last year
- A GPU performance profiling tool for PyTorch models☆22Updated 2 years ago