drisspg / driss_torchLinks
Cuda extensions for PyTorch
☆11Updated 5 months ago
Alternatives and similar repositories for driss_torch
Users that are interested in driss_torch are comparing it to the libraries listed below
Sorting:
- ☆21Updated 7 months ago
- Experiment of using Tangent to autodiff triton☆80Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆58Updated 2 weeks ago
- extensible collectives library in triton☆89Updated 6 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 6 months ago
- Triton-based Symmetric Memory operators and examples☆32Updated this week
- Explore training for quantized models☆25Updated 3 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆108Updated last year
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆144Updated last year
- ☆89Updated last year
- train with kittens!☆62Updated 11 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆45Updated last month
- High-Performance SGEMM on CUDA devices☆107Updated 8 months ago
- Load compute kernels from the Hub☆293Updated last week
- Learn CUDA with PyTorch☆85Updated 2 weeks ago
- This repository contains the experimental PyTorch native float8 training UX☆223Updated last year
- FlashRNN - Fast RNN Kernels with I/O Awareness☆98Updated 4 months ago
- ☆173Updated last year
- ☆333Updated last month
- Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake usin…☆29Updated 7 months ago
- Parallel framework for training and fine-tuning deep neural networks☆65Updated 6 months ago
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆301Updated this week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆130Updated 10 months ago
- ☆19Updated 4 months ago
- python package of rocm-smi-lib☆24Updated 2 months ago
- Mixed precision training from scratch with Tensors and CUDA☆27Updated last year
- ☆18Updated last year
- Python bindings for ggml☆146Updated last year
- ☆49Updated last year
- Simple high-throughput inference library☆142Updated 4 months ago