rachtsy / KPCA_codeLinks
Implementation for robust ViT and scaled attention
☆19Updated 2 months ago
Alternatives and similar repositories for KPCA_code
Users that are interested in KPCA_code are comparing it to the libraries listed below
Sorting:
- Fork of Flame repo for training of some new stuff in development☆13Updated this week
- Code for the paper "Function-Space Learning Rates"☆20Updated last month
- ☆33Updated 8 months ago
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆66Updated 8 months ago
- Train a SmolLM-style llm on fineweb-edu in JAX/Flax with an assortment of optimizers.☆17Updated 2 months ago
- ☆18Updated last year
- Source code for the paper "Positional Attention: Expressivity and Learnability of Algorithmic Computation"☆14Updated last week
- ☆33Updated 5 months ago
- ☆23Updated 2 weeks ago
- This repo is based on https://github.com/jiaweizzhao/GaLore☆28Updated 8 months ago
- JAX Scalify: end-to-end scaled arithmetics☆16Updated 7 months ago
- ☆12Updated 3 months ago
- ☆32Updated 4 months ago
- gzip Predicts Data-dependent Scaling Laws☆35Updated last year
- Experimental scripts for researching data adaptive learning rate scheduling.☆23Updated last year
- PyTorch implementation for "Long Horizon Temperature Scaling", ICML 2023☆20Updated 2 years ago
- [Oral; Neurips OPT2024 ] μLO: Compute-Efficient Meta-Generalization of Learned Optimizers☆12Updated 2 months ago
- Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun☆52Updated 2 months ago
- Official implementation of "BERTs are Generative In-Context Learners"☆28Updated 2 months ago
- ☆19Updated 2 weeks ago
- Universal Neurons in GPT2 Language Models☆29Updated last year
- ☆23Updated 5 months ago
- Self contained pytorch implementation of a sinkhorn based router, for mixture of experts or otherwise☆35Updated 9 months ago
- Remasking Discrete Diffusion Models with Inference-Time Scaling☆21Updated 2 months ago
- Efficient Scaling laws and collaborative pretraining.☆16Updated 4 months ago
- Transformer with Mu-Parameterization, implemented in Jax/Flax. Supports FSDP on TPU pods.☆30Updated last week
- Official Code for Paper: Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation☆67Updated last week
- Implementation of Gradient Agreement Filtering, from Chaubard et al. of Stanford, but for single machine microbatches, in Pytorch☆25Updated 4 months ago
- The official implementation of Regularized Policy Gradient (RPG) (https://arxiv.org/abs/2505.17508)☆27Updated last week
- NanoGPT (124M) quality in 2.67B tokens☆28Updated last month