ahennequ / cuda-tensorcores-register-mappingLinks
☆19Updated 3 years ago
Alternatives and similar repositories for cuda-tensorcores-register-mapping
Users that are interested in cuda-tensorcores-register-mapping are comparing it to the libraries listed below
Sorting:
- Udacity CS344 Introduction to Parallell Programming (https://classroom.udacity.com/courses/cs344), with assignments/materials updated to …☆46Updated 4 years ago
- Customized matrix multiplication kernels☆57Updated 3 years ago
- Hacks for PyTorch☆19Updated 2 years ago
- ONNX Command-Line Toolbox☆35Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆46Updated last year
- ☆159Updated 2 years ago
- ☆34Updated 6 months ago
- ☆29Updated 3 years ago
- PyTorch interface for the IPU☆181Updated 2 years ago
- Texture mapping with variational auto-encoders☆40Updated 4 years ago
- An open source implementation of CLIP.☆33Updated 3 years ago
- CUDA implementation of autoregressive linear attention, with all the latest research findings☆46Updated 2 years ago
- Implementation of fused cosine similarity attention in the same style as Flash Attention☆219Updated 2 years ago
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆65Updated 3 years ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated 2 years ago
- Torch Distributed Experimental☆117Updated last year
- torch::deploy (multipy for non-torch uses) is a system that lets you get around the GIL problem by running multiple Python interpreters i…☆182Updated 3 months ago
- Loop Nest - Linear algebra compiler and code generator.☆21Updated 3 years ago
- A PyTorch Dataset that caches samples in shared memory, accessible globally to all processes☆22Updated 3 years ago
- Implementation for ACProp ( Momentum centering and asynchronous update for adaptive gradient methdos, NeurIPS 2021)☆16Updated 4 years ago
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Experimental scripts for researching data adaptive learning rate scheduling.☆22Updated 2 years ago
- Little article showing how to load pytorch's models with linear memory consumption☆34Updated 3 years ago
- Guide on how to convert custom PyTorch layers when using ONNX.☆22Updated 7 years ago
- Implementation of Kronecker Attention in Pytorch☆19Updated 5 years ago
- TVMScript kernel for deformable attention☆25Updated 4 years ago
- A place to store reusable transformer components of my own creation or found on the interwebs☆63Updated last week
- Prototype routines for GPU quantization written using PyTorch.☆21Updated 4 months ago
- Some CUDA design patterns and a bit of template magic for CUDA☆157Updated 2 years ago
- Memory Optimizations for Deep Learning (ICML 2023)☆113Updated last year