wolfecameron / nanoMoELinks
An extension of the nanoGPT repository for training small MOE models.
☆147Updated 2 months ago
Alternatives and similar repositories for nanoMoE
Users that are interested in nanoMoE are comparing it to the libraries listed below
Sorting:
- ☆188Updated 3 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆126Updated 3 weeks ago
- PyTorch building blocks for the OLMo ecosystem☆222Updated this week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆184Updated last week
- minimal GRPO implementation from scratch☆90Updated 2 months ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆249Updated this week
- The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".☆173Updated 2 months ago
- Normalized Transformer (nGPT)☆181Updated 6 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆333Updated 5 months ago
- ☆168Updated 5 months ago
- Implementation of 🥥 Coconut, Chain of Continuous Thought, in Pytorch☆170Updated 5 months ago
- nanoGRPO is a lightweight implementation of Group Relative Policy Optimization (GRPO)☆105Updated 3 weeks ago
- ☆78Updated 10 months ago
- Tina: Tiny Reasoning Models via LoRA☆245Updated this week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆237Updated 4 months ago
- Understand and test language model architectures on synthetic tasks.☆197Updated 2 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 6 months ago
- LoRA and DoRA from Scratch Implementations☆203Updated last year
- Prune transformer layers☆69Updated last year
- code for training & evaluating Contextual Document Embedding models☆191Updated 3 weeks ago
- Code for studying the super weight in LLM☆104Updated 6 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆199Updated 10 months ago
- ☆190Updated this week
- Collection of autoregressive model implementation☆85Updated last month
- A MAD laboratory to improve AI architecture designs 🧪☆116Updated 5 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆62Updated 4 months ago
- Some preliminary explorations of Mamba's context scaling.☆212Updated last year
- ☆47Updated 9 months ago
- Experiments on Multi-Head Latent Attention☆88Updated 9 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆155Updated last month