idiap / sigma-gpt
σ-GPT: A New Approach to Autoregressive Models
☆61Updated 5 months ago
Alternatives and similar repositories for sigma-gpt:
Users that are interested in sigma-gpt are comparing it to the libraries listed below
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆91Updated 2 months ago
- Focused on fast experimentation and simplicity☆65Updated last month
- supporting pytorch FSDP for optimizers☆75Updated last month
- ☆53Updated last year
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆121Updated 9 months ago
- ☆75Updated 6 months ago
- Efficient optimizers☆154Updated this week
- Normalized Transformer (nGPT)☆146Updated 2 months ago
- Explorations into the proposal from the paper "Grokfast, Accelerated Grokking by Amplifying Slow Gradients"☆95Updated last month
- ☆45Updated 10 months ago
- Collection of autoregressive model implementation☆77Updated 3 weeks ago
- ☆49Updated 10 months ago
- WIP☆93Updated 5 months ago
- Understand and test language model architectures on synthetic tasks.☆177Updated 2 weeks ago
- A MAD laboratory to improve AI architecture designs 🧪☆102Updated last month
- 🧱 Modula software package☆134Updated this week
- ☆78Updated 9 months ago
- ☆149Updated last month
- $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources☆118Updated 2 weeks ago
- Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (NeurIPS 2024)☆183Updated 8 months ago
- A general framework for inference-time scaling and steering of diffusion models with arbitrary rewards.☆71Updated 2 weeks ago
- DeMo: Decoupled Momentum Optimization☆171Updated last month
- ☆69Updated last week
- ☆52Updated 2 months ago
- Implementation of the Llama architecture with RLHF + Q-learning☆157Updated last year
- The official repository for HyperZ⋅Z⋅W Operator Connects Slow-Fast Networks for Full Context Interaction.☆31Updated 2 weeks ago
- A single repo with all scripts and utils to train / fine-tune the Mamba model with or without FIM☆50Updated 9 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆113Updated last month
- Muon optimizer for neural networks: >30% extra sample efficiency, <3% wallclock overhead☆220Updated 3 weeks ago