rwitten / HighPerfLLMs2024Links
☆547Updated last year
Alternatives and similar repositories for HighPerfLLMs2024
Users that are interested in HighPerfLLMs2024 are comparing it to the libraries listed below
Sorting:
- Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUs☆781Updated this week
- Legible, Scalable, Reproducible Foundation Models with Named Tensors and Jax☆687Updated last month
- Building blocks for foundation models.☆585Updated last year
- ☆286Updated last year
- Pax is a Jax-based machine learning framework for training large scale models. Pax allows for advanced and fully configurable experimenta…☆542Updated this week
- What would you do with 1000 H100s...☆1,133Updated last year
- Deep learning for dummies. All the practical details and useful utilities that go into working with real models.☆829Updated 4 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆153Updated 2 years ago
- Puzzles for exploring transformers☆380Updated 2 years ago
- ☆225Updated last month
- Best practices & guides on how to write distributed pytorch training code☆552Updated 2 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆587Updated 4 months ago
- Open-source framework for the research and development of foundation models.☆673Updated this week
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆459Updated 2 weeks ago
- seqax = sequence modeling + JAX☆169Updated 5 months ago
- Minimal yet performant LLM examples in pure JAX☆219Updated 3 weeks ago
- ☆460Updated last year
- ☆340Updated 2 weeks ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆195Updated 6 months ago
- Accelerate, Optimize performance with streamlined training and serving options with JAX.☆326Updated this week
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.☆327Updated last month
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)☆718Updated this week
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆275Updated last month
- 🧱 Modula software package☆316Updated 4 months ago
- JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs wel…☆396Updated 6 months ago
- ☆533Updated 4 months ago
- Annotated version of the Mamba paper☆492Updated last year
- MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvement…☆404Updated this week
- Simple MPI implementation for prototyping or learning☆293Updated 4 months ago
- For optimization algorithm research and development.☆553Updated this week