mag- / gpu_benchmarkLinks
Gpu benchmark
☆63Updated 5 months ago
Alternatives and similar repositories for gpu_benchmark
Users that are interested in gpu_benchmark are comparing it to the libraries listed below
Sorting:
- High-Performance SGEMM on CUDA devices☆97Updated 5 months ago
- Samples of good AI generated CUDA kernels☆84Updated last month
- Load compute kernels from the Hub☆207Updated this week
- RWKV-7: Surpassing GPT☆92Updated 8 months ago
- ☆71Updated 2 weeks ago
- ☆71Updated 6 months ago
- Mixed precision training from scratch with Tensors and CUDA☆24Updated last year
- QuIP quantization☆54Updated last year
- Make triton easier☆47Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆128Updated 7 months ago
- ring-attention experiments☆143Updated 9 months ago
- A collection of tricks and tools to speed up transformer models☆170Updated last month
- 👷 Build compute kernels☆77Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆135Updated this week
- Experimental GPU language with meta-programming☆23Updated 10 months ago
- ☆21Updated 4 months ago
- Experiment of using Tangent to autodiff triton☆79Updated last year
- LLM training in simple, raw C/CUDA☆99Updated last year
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆78Updated last month
- ☆80Updated last year
- ☆88Updated last year
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆72Updated 5 months ago
- Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning☆46Updated 4 months ago
- https://x.com/BlinkDL_AI/status/1884768989743882276☆28Updated 2 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆46Updated last year
- ☆49Updated last year
- Token Omission Via Attention☆128Updated 9 months ago
- [WIP] Better (FP8) attention for Hopper☆31Updated 4 months ago
- Inference of Mamba models in pure C☆188Updated last year
- Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs☆110Updated last year