☆93Nov 11, 2025Updated 4 months ago
Alternatives and similar repositories for GPU_Programming
Users that are interested in GPU_Programming are comparing it to the libraries listed below
Sorting:
- Step by step implementation of a fast softmax kernel in CUDA☆62Jan 6, 2025Updated last year
- torch.compile artifacts for common deep learning models, can be used as a learning resource for torch.compile☆19Dec 22, 2023Updated 2 years ago
- BFloat16 Fused Adam Operator for PyTorch☆17Nov 16, 2024Updated last year
- General Matrix Multiplication using NVIDIA Tensor Cores☆28Jan 25, 2025Updated last year
- ☆91Feb 29, 2024Updated 2 years ago
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 5 months ago
- RAPIDS Deployment Documentation☆15Mar 11, 2026Updated last week
- A series of high-performance GEMM (General Matrix Multiply) implementations Iteratively optimised for H100 GPUs in Pure CUDA.☆73Feb 18, 2026Updated last month
- [ACL'24 Oral] Analysing The Impact of Sequence Composition on Language Model Pre-Training☆23Aug 18, 2024Updated last year
- Code for the EMNLP24 paper "A simple and effective L2 norm based method for KV Cache compression."☆18Dec 13, 2024Updated last year
- My study notes and hands-on projects for CUDA-based GPU programming☆10Dec 11, 2025Updated 3 months ago
- An example of how to use the multiprocessing package along with PyTorch.☆21Jan 15, 2021Updated 5 years ago
- A Regex engine which is implemented in a traditional way and able to generate graphics of finite automation.☆10May 3, 2018Updated 7 years ago
- A cookiecutter template for creating a new LLM plugin that adds tools to LLM☆29May 27, 2025Updated 9 months ago
- Comparing Deep Learning Inference of Pytorch models running on CPU, CUDA and TensorRT☆16Feb 20, 2022Updated 4 years ago
- NVIDIA tools guide☆164Jan 7, 2025Updated last year
- Repository to host ROCm Developer Hub Notebook Tutorials☆58Updated this week
- A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do☆472Mar 2, 2026Updated 2 weeks ago
- Read custom dataset☆12Mar 31, 2023Updated 2 years ago
- ☆14Apr 10, 2023Updated 2 years ago
- Flash Attention in raw Cuda C beating PyTorch☆38May 14, 2024Updated last year
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆18Feb 9, 2026Updated last month
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 7 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆163Oct 19, 2023Updated 2 years ago
- Finetuning BLOOM on a single GPU using gradient-accumulation☆31Mar 29, 2023Updated 2 years ago
- [UNMAINTAINED] md6 FTW☆10Mar 17, 2016Updated 10 years ago
- Variational Autoencoder with non-euclidean (hyperbolic) latent space☆12Nov 25, 2022Updated 3 years ago
- Apply GPU in ML and DL☆62Updated this week
- torchcomms: a modern PyTorch communications API☆349Updated this week
- ☆25Mar 9, 2026Updated last week
- 这是我在阅读《x86汇编语言 从实模式到保护模式》对每一章 节代码的理解,并注释了部分代码☆10Nov 26, 2019Updated 6 years ago
- 一个谷歌高清图片爬虫☆13Jan 7, 2020Updated 6 years ago
- ☆14May 18, 2025Updated 10 months ago
- ☆23Feb 16, 2022Updated 4 years ago
- Fast parallel RNN-Transducer.☆10Nov 1, 2019Updated 6 years ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆466Mar 10, 2025Updated last year
- A notebook testing CPU speed vs GPU speed with Pytorch and CUDA☆17Dec 25, 2021Updated 4 years ago
- a student trainning project for HLS and transformer☆11Oct 19, 2022Updated 3 years ago
- Implementation from scratch in CUDA C++ of image processing algorithms.☆22Oct 26, 2020Updated 5 years ago