Zcchill / Value-Residual-LearningLinks
☆14Updated 10 months ago
Alternatives and similar repositories for Value-Residual-Learning
Users that are interested in Value-Residual-Learning are comparing it to the libraries listed below
Sorting:
- [ICML 2025] Code for "R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts"☆17Updated 10 months ago
- ☆40Updated 4 months ago
- Official implementation of the paper "You Do Not Fully Utilize Transformer's Representation Capacity"☆31Updated 8 months ago
- Resa: Transparent Reasoning Models via SAEs☆47Updated 4 months ago
- CS194-196 Course Project☆14Updated 11 months ago
- This is an implementation of the paper "Are We Done with Object-Centric Learning?"☆12Updated 4 months ago
- Learning to Skip the Middle Layers of Transformers☆17Updated 5 months ago
- ☆16Updated last year
- ☆19Updated 7 months ago
- Unofficial Implementation of Selective Attention Transformer☆20Updated last year
- HGRN2: Gated Linear RNNs with State Expansion☆56Updated last year
- This is a simple torch implementation of the high performance Multi-Query Attention☆16Updated 2 years ago
- [ICML'25] "Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding" by Jiajun Zhu, Peihao Wang, Ruisi…☆14Updated 7 months ago
- The official repo of continuous speculative decoding☆31Updated 10 months ago
- Official implementation of ECCV24 paper: POA☆24Updated last year
- Official PyTorch Implementation for Vision-Language Models Create Cross-Modal Task Representations, ICML 2025☆31Updated 9 months ago
- The official implementation of ICLR 2025 paper "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".☆17Updated 9 months ago
- Official code for the paper "Attention as a Hypernetwork"☆47Updated last year
- The official implementation of HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization☆18Updated 10 months ago
- ☆20Updated 3 months ago
- ☆20Updated 2 months ago
- ☆19Updated last year
- The official repository for our paper "The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns …☆16Updated 7 months ago
- User-friendly implementation of the Mixture-of-Sparse-Attention (MoSA). MoSA selects distinct tokens for each head with expert choice rou…☆28Updated 8 months ago
- [COLM 2025] "C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing"☆19Updated 9 months ago
- CatMAE☆14Updated 2 years ago
- Code for NOLA, an implementation of "nola: Compressing LoRA using Linear Combination of Random Basis"☆57Updated last year
- Measuring the Signal to Noise Ratio in Language Model Evaluation☆28Updated 5 months ago
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers☆26Updated 11 months ago
- ☆46Updated last year