samsja / muon_fsdp_2Links
Muon fsdp 2
☆27Updated last week
Alternatives and similar repositories for muon_fsdp_2
Users that are interested in muon_fsdp_2 are comparing it to the libraries listed below
Sorting:
- ☆113Updated last year
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆70Updated 11 months ago
- ☆81Updated last year
- ☆136Updated 5 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆144Updated last month
- Triton-based implementation of Sparse Mixture of Experts.☆225Updated 7 months ago
- The evaluation framework for training-free sparse attention in LLMs☆85Updated last month
- Code for studying the super weight in LLM☆113Updated 7 months ago
- ☆53Updated last year
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆93Updated last year
- supporting pytorch FSDP for optimizers☆83Updated 7 months ago
- ☆45Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆61Updated 9 months ago
- 🔥 A minimal training framework for scaling FLA models☆194Updated last month
- A library for unit scaling in PyTorch☆128Updated last week
- ☆122Updated last month
- ☆147Updated 2 years ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆138Updated 11 months ago
- ring-attention experiments☆145Updated 9 months ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆237Updated last month
- ☆97Updated 9 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆149Updated 3 weeks ago
- ☆82Updated 11 months ago
- Some preliminary explorations of Mamba's context scaling.☆216Updated last year
- Language models scale reliably with over-training and on downstream tasks☆97Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 11 months ago
- ☆127Updated last year
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆119Updated last year
- Understand and test language model architectures on synthetic tasks.☆220Updated last week
- ☆107Updated 10 months ago