mengwanguc / gpemuLinks
GPEmu, a GPU emulator for faster and cheaper prototyping and evaluation of deep learning system research
☆34Updated last year
Alternatives and similar repositories for gpemu
Users that are interested in gpemu are comparing it to the libraries listed below
Sorting:
- Tensor library & inference framework for machine learning☆114Updated 2 months ago
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning☆142Updated this week
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024☆184Updated last year
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆408Updated this week
- Samples of good AI generated CUDA kernels☆92Updated 6 months ago
- ☆75Updated 3 weeks ago
- Hashed Lookup Table based Matrix Multiplication (halutmatmul) - Stella Nera accelerator☆214Updated last year
- Pytorch script hot swap: Change code without unloading your LLM from VRAM☆125Updated 7 months ago
- Standalone commandline CLI tool for compiling Triton kernels☆20Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆61Updated last week
- Hand-Rolled GPU communications library☆72Updated last week
- High-Performance SGEMM on CUDA devices☆113Updated 10 months ago
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆140Updated 2 months ago
- tiny code to access tenstorrent blackhole☆61Updated 6 months ago
- ☆456Updated last week
- Helpful kernel tutorials and examples for tile-based GPU programming☆202Updated this week
- Compression for Foundation Models☆34Updated 4 months ago
- ☆85Updated 3 weeks ago
- 🏙 Interactive performance profiling and debugging tool for PyTorch neural networks.☆64Updated 10 months ago
- ☆17Updated last month
- Make triton easier☆49Updated last year
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆164Updated last week
- Simple high-throughput inference library☆150Updated 6 months ago
- Fast and Furious AMD Kernels☆309Updated last week
- Lightweight Llama 3 8B Inference Engine in CUDA C☆53Updated 8 months ago
- ☆28Updated 10 months ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆73Updated 10 months ago
- This project aims to enable language model inference on FPGAs, supporting AI applications in edge devices and environments with limited r…☆169Updated last year
- ☆219Updated 10 months ago
- Ship correct and fast LLM kernels to PyTorch☆125Updated 3 weeks ago