huggingface / picotron
Minimalistic 4D-parallelism distributed training framework for education purpose
☆987Updated last month
Alternatives and similar repositories for picotron:
Users that are interested in picotron are comparing it to the libraries listed below
- Minimalistic large language model 3D-parallelism training☆1,793Updated this week
- Recipes to scale inference-time compute of open models☆1,055Updated last month
- Best practices & guides on how to write distributed pytorch training code☆391Updated last month
- A bibliography and survey of the papers surrounding o1☆1,185Updated 5 months ago
- Puzzles for learning Triton☆1,577Updated 5 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,414Updated this week
- FlashInfer: Kernel Library for LLM Serving☆2,659Updated last week
- What would you do with 1000 H100s...☆1,035Updated last year
- Building blocks for foundation models.☆483Updated last year
- Training Large Language Model to Reason in a Continuous Latent Space☆1,062Updated 2 months ago
- Deep learning for dummies. All the practical details and useful utilities that go into working with real models.☆784Updated last month
- An ML Systems Onboarding list☆751Updated 2 months ago
- A throughput-oriented high-performance serving framework for LLMs☆794Updated 7 months ago
- Helpful tools and examples for working with flex-attention☆720Updated last week
- Muon is Scalable for LLM Training☆1,022Updated 3 weeks ago
- LLM KV cache compression made easy☆458Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆530Updated this week
- Flash Attention in ~100 lines of CUDA (forward pass only)☆779Updated 3 months ago
- Scalable toolkit for efficient model alignment☆765Updated last week
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆510Updated 5 months ago
- GPU programming related news and material links☆1,454Updated 3 months ago
- Tile primitives for speedy kernels☆2,259Updated this week
- Understanding R1-Zero-Like Training: A Critical Perspective☆863Updated this week
- Muon optimizer: +>30% sample efficiency with <3% wallclock overhead☆575Updated 3 weeks ago
- 🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"☆631Updated last month
- UNet diffusion model in pure CUDA☆602Updated 9 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,237Updated this week
- Ring attention implementation with flash attention☆737Updated last week
- 🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton☆2,287Updated this week
- ☆423Updated 9 months ago