huggingface / gpu-fryerLinks
Where GPUs get cooked π©βπ³π₯
β230Updated 3 months ago
Alternatives and similar repositories for gpu-fryer
Users that are interested in gpu-fryer are comparing it to the libraries listed below
Sorting:
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ184Updated last week
- PyTorch per step fault tolerance (actively under development)β302Updated this week
- Load compute kernels from the Hubβ139Updated this week
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β249Updated this week
- β190Updated 3 months ago
- Scalable and Performant Data Loadingβ269Updated this week
- β210Updated 4 months ago
- β88Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!β44Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMsβ263Updated 7 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β301Updated last month
- β215Updated this week
- β179Updated last week
- An extension of the nanoGPT repository for training small MOE models.β147Updated 2 months ago
- Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUsβ380Updated last month
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)β105Updated this week
- Google TPU optimizations for transformers modelsβ112Updated 4 months ago
- LLM KV cache compression made easyβ493Updated 3 weeks ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β546Updated this week
- Efficient LLM Inference over Long Sequencesβ376Updated this week
- Inference server benchmarking toolβ67Updated last month
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"β153Updated 7 months ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problemsβ374Updated this week
- Fast low-bit matmul kernels in Tritonβ311Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASSβ181Updated 3 weeks ago
- β157Updated last year
- kernels, of the mega varietyβ329Updated this week
- This repository contains the experimental PyTorch native float8 training UXβ223Updated 10 months ago
- Best practices & guides on how to write distributed pytorch training codeβ433Updated 3 months ago
- Manage scalable open LLM inference endpoints in Slurm clustersβ258Updated 10 months ago