kpup1710 / CAMExLinks
[ICLR 2025] CAMEx: Curvature-Aware Merging of Experts
☆22Updated 6 months ago
Alternatives and similar repositories for CAMEx
Users that are interested in CAMEx are comparing it to the libraries listed below
Sorting:
- LibMoE: A LIBRARY FOR COMPREHENSIVE BENCHMARKING MIXTURE OF EXPERTS IN LARGE LANGUAGE MODELS☆40Updated 2 months ago
- ☆72Updated 6 months ago
- Official Code for Paper: Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation☆121Updated 2 months ago
- One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation☆41Updated 10 months ago
- ☆148Updated 11 months ago
- ☆21Updated 11 months ago
- This is a PyTorch implementation of the paperViP A Differentially Private Foundation Model for Computer Vision☆36Updated 2 years ago
- ☆23Updated 7 months ago
- ☆182Updated 11 months ago
- Unofficial Implementation of Selective Attention Transformer☆17Updated 10 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆127Updated last year
- One-stop solutions for Mixture of Experts and Mixture of Depth modules in PyTorch.☆24Updated 3 months ago
- Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion mod…☆86Updated this week
- We study toy models of skill learning.☆30Updated 7 months ago
- [NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623☆86Updated 11 months ago
- Official code for the ICML 2024 paper "The Entropy Enigma: Success and Failure of Entropy Minimization"☆53Updated last year
- A More Fair and Comprehensive Comparison between KAN and MLP☆172Updated last year
- Conference schedule, top papers, and analysis of the data for NeurIPS 2023!☆120Updated last year
- Official Code Repository for the paper "Continuous Diffusion Model for Language Modeling".☆40Updated 5 months ago
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆114Updated 11 months ago
- Official implementation for Equivariant Architectures for Learning in Deep Weight Spaces [ICML 2023]☆89Updated 2 years ago
- Official PyTorch implementation of DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs (ICML 2025 Oral)☆37Updated 2 months ago
- [ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)☆118Updated last month
- ☆30Updated last year
- User-friendly implementation of the Mixture-of-Sparse-Attention (MoSA). MoSA selects distinct tokens for each head with expert choice rou…☆26Updated 3 months ago
- Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"☆124Updated last year
- MambaFormer in-context learning experiments and implementation for https://arxiv.org/abs/2402.04248☆56Updated last year
- PyTorch implementation of Soft MoE by Google Brain in "From Sparse to Soft Mixtures of Experts" (https://arxiv.org/pdf/2308.00951.pdf)☆77Updated last year
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆104Updated last week
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆120Updated 10 months ago