huggingface / gpu-fryer
Where GPUs get cooked π©βπ³π₯
β229Updated 2 months ago
Alternatives and similar repositories for gpu-fryer
Users that are interested in gpu-fryer are comparing it to the libraries listed below
Sorting:
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ180Updated this week
- PyTorch per step fault tolerance (actively under development)β300Updated this week
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β245Updated this week
- β186Updated 3 months ago
- Scalable and Performant Data Loadingβ258Updated this week
- β155Updated last year
- β88Updated last year
- Inference server benchmarking toolβ59Updated 3 weeks ago
- Load compute kernels from the Hubβ119Updated last week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problemsβ324Updated last week
- Write a fast kernel and run it on Discord. See how you compare against the best!β44Updated this week
- A tool to configure, launch and manage your machine learning experiments.β146Updated this week
- TorchFix - a linter for PyTorch-using code with autofix supportβ141Updated 3 months ago
- prime-rl is a codebase for decentralized RL training at scaleβ211Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β536Updated this week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)β66Updated last month
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASSβ173Updated last week
- DeMo: Decoupled Momentum Optimizationβ186Updated 5 months ago
- β207Updated last week
- Learning about CUDA by writing PTX code.β129Updated last year
- Slides, notes, and materials for the workshopβ325Updated 11 months ago
- An extension of the nanoGPT repository for training small MOE models.β142Updated 2 months ago
- Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuningβ40Updated 2 months ago
- NanoGPT-speedrunning for the poor T4 enjoyersβ65Updated 3 weeks ago
- High-Performance SGEMM on CUDA devicesβ91Updated 3 months ago
- This repository contains the experimental PyTorch native float8 training UXβ224Updated 9 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.β132Updated last year
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β294Updated 2 weeks ago
- β163Updated 4 months ago
- LLM KV cache compression made easyβ481Updated last week