kpup1710 / CAMExLinks
[ICLR 2025] CAMEx: Curvature-Aware Merging of Experts
☆22Updated 11 months ago
Alternatives and similar repositories for CAMEx
Users that are interested in CAMEx are comparing it to the libraries listed below
Sorting:
- LibMoE: A LIBRARY FOR COMPREHENSIVE BENCHMARKING MIXTURE OF EXPERTS IN LARGE LANGUAGE MODELS☆46Updated 3 weeks ago
- ☆79Updated 11 months ago
- ☆36Updated 10 months ago
- Official PyTorch implementation of DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs (ICML 2025 Oral)☆55Updated 7 months ago
- One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation☆46Updated 3 months ago
- Unofficial Implementation of Selective Attention Transformer☆20Updated last year
- ☆21Updated last year
- A More Fair and Comprehensive Comparison between KAN and MLP☆178Updated last year
- Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts.☆141Updated last year
- User-friendly implementation of the Mixture-of-Sparse-Attention (MoSA). MoSA selects distinct tokens for each head with expert choice rou…☆28Updated 8 months ago
- One-stop solutions for Mixture of Experts and Mixture of Depth modules in PyTorch.☆26Updated 8 months ago
- ☆152Updated last year
- Conference schedule, top papers, and analysis of the data for NeurIPS 2023!☆120Updated 2 years ago
- The repo for HiRA paper☆36Updated 3 weeks ago
- Official PyTorch Implementation for Vision-Language Models Create Cross-Modal Task Representations, ICML 2025☆31Updated 9 months ago
- A regression-alike loss to improve numerical reasoning in language models - ICML 2025☆27Updated 5 months ago
- [ICLR 2025] Large (Vision) Language Models are Unsupervised In-Context Learners☆22Updated 7 months ago
- ☆191Updated last year
- Official PyTorch Implementation of "The Hidden Attention of Mamba Models"☆231Updated 3 months ago
- LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently (ICML2025 Oral)☆28Updated 3 months ago
- The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"☆40Updated last year
- MambaFormer in-context learning experiments and implementation for https://arxiv.org/abs/2402.04248☆58Updated last year
- Defeating the Training-Inference Mismatch via FP16☆180Updated 2 months ago
- ☆19Updated 10 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"