Taishi-N324 / Drop-UpcyclingLinks

[ICLR 2025] Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

☆19

Alternatives and similar repositories for Drop-Upcycling

Users that are interested in Drop-Upcycling are comparing it to the libraries listed below

Sorting:

VILA-Lab / GBLM-Pruner
Are gradient information useful for pruning of LLMs?
☆47Updated 2 months ago
pixeli99 / MixLN
[ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…
☆27Updated 3 months ago
chuanyang-Zheng / DAPE
The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"
☆39Updated last year
qiuzh20 / gated_attention
The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink…
☆105Updated 2 months ago
ldery / Bonsai
Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"
☆28Updated last year
kamanphoebe / Look-into-MoEs
[NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Models
☆55Updated 9 months ago
MikaStars39 / StableMask
PyTorch implementation of StableMask (ICML'24)
☆14Updated last year
Infini-AI-Lab / S2FT
☆19Updated 10 months ago
VijayLingam95 / SVFT
☆33Updated 9 months ago
Arenaa / Accelerated-Generation-Techniques
This repository contains papers for a comprehensive survey on accelerated generation techniques in Large Language Models (LLMs).
☆11Updated last year
abdelfattah-lab / TokenButler
☆26Updated this week
assafbk / DeciMamba
DeciMamba: Exploring the Length Extrapolation Potential of Mamba (ICLR 2025)
☆31Updated 7 months ago
czg1225 / dParallel
dParallel: Learnable Parallel Decoding for dLLMs
☆42Updated last month
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆98Updated 11 months ago
BaohaoLiao / ApiQ
[EMNLP 2024] Quantize LLM to extremely low-bit, and finetune the quantized LLMs
☆15Updated last year
OpenSparseLLMs / Linearization
☆61Updated 4 months ago
sail-sg / SkyLadder
The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling
☆40Updated last month
TianjinYellow / SPAM-Optimizer
☆35Updated 8 months ago
JieShibo / MoLE
[ICML 2025 Oral] Mixture of Lookup Experts
☆54Updated 6 months ago
locuslab / llava-token-compression
☆44Updated last year
MathGenie / MathGenie
☆13Updated last year
AkideLiu / MiniCache
☆10Updated last year
VITA-Group / Random-MoE-as-Dropout
[ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…
☆55Updated 2 years ago
joonkeekim / Instructive-Decoding
Official repository of "Distort, Distract, Decode: Instruction-Tuned Model Can Refine its Response from Noisy Instructions", ICLR 2024 Sp…
☆21Updated last year
nasosger / MuToR
[NeurIPS '25] Multi-Token Prediction Needs Registers
☆24Updated 2 months ago
sail-sg / Attention-Sink
[ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)
☆135Updated 4 months ago
dmis-lab / Monet
[ICLR 2025] Monet: Mixture of Monosemantic Experts for Transformers
☆73Updated 4 months ago
kuleshov-group / remdm
Remasking Discrete Diffusion Models with Inference-Time Scaling
☆54Updated 8 months ago
OpenSparseLLMs / LLaMA-MoE-v2
🚀 LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training
☆88Updated 11 months ago
ShiZhengyan / InstructionModelling
[NeurIPS 2024 Main Track] Code for the paper titled "Instruction Tuning With Loss Over Instructions"
☆39Updated last year