ahennequ / cuda-tensorcores-register-mappingLinks
☆19Updated 3 years ago
Alternatives and similar repositories for cuda-tensorcores-register-mapping
Users that are interested in cuda-tensorcores-register-mapping are comparing it to the libraries listed below
Sorting:
- Udacity CS344 Introduction to Parallell Programming (https://classroom.udacity.com/courses/cs344), with assignments/materials updated to …☆46Updated 4 years ago
- Customized matrix multiplication kernels☆57Updated 3 years ago
- ONNX Command-Line Toolbox☆35Updated last year
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆65Updated 3 years ago
- Hacks for PyTorch☆19Updated 2 years ago
- ☆34Updated 6 months ago
- Guide on how to convert custom PyTorch layers when using ONNX.☆22Updated 7 years ago
- torch::deploy (multipy for non-torch uses) is a system that lets you get around the GIL problem by running multiple Python interpreters i…☆182Updated 3 weeks ago
- Torch Distributed Experimental☆117Updated last year
- PyTorch interface for the IPU☆181Updated 2 years ago
- Memory Optimizations for Deep Learning (ICML 2023)☆114Updated last year
- CUDA implementation of autoregressive linear attention, with all the latest research findings☆46Updated 2 years ago
- ☆160Updated 2 years ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated 2 years ago
- Some CUDA design patterns and a bit of template magic for CUDA☆157Updated 2 years ago
- A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!☆54Updated last month
- Texture mapping with variational auto-encoders☆40Updated 4 years ago
- TVMScript kernel for deformable attention☆25Updated 4 years ago
- Little article showing how to load pytorch's models with linear memory consumption☆34Updated 3 years ago
- Simple notebooks to learn diffusion models on toy datasets☆17Updated 2 years ago
- A block oriented training approach for inference time optimization.☆34Updated last year
- An open source implementation of CLIP.☆33Updated 3 years ago
- Implementation of fused cosine similarity attention in the same style as Flash Attention☆219Updated 2 years ago
- pytest plugin for a better developer experience when working with the PyTorch test suite☆44Updated 4 years ago
- PyTorch implementation of L2L execution algorithm☆109Updated 2 years ago
- ☆50Updated last year
- Authors implementation of LieTransformer: Equivariant Self-Attention for Lie Groups☆36Updated 4 years ago
- Automatically insert nvtx ranges to PyTorch models☆22Updated 4 years ago
- A PyTorch Dataset that caches samples in shared memory, accessible globally to all processes☆23Updated 3 years ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆46Updated last year