UNITES-Lab / C2R-MoELinks
[NAACL'25 π SAC Award] Official code for "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design"
β10Updated 7 months ago
Alternatives and similar repositories for C2R-MoE
Users that are interested in C2R-MoE are comparing it to the libraries listed below
Sorting:
- (ACL 2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generationβ34Updated 3 months ago
- [ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMsβ92Updated 10 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inferenceβ53Updated 10 months ago
- β10Updated last year
- Implementation for the paper: CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inferenceβ25Updated 6 months ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Modelsβ33Updated last year
- β15Updated 10 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Modβ¦β30Updated last year
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMsβ39Updated last year
- LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verificationβ63Updated 2 months ago
- [ICML 2024] SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Modelsβ21Updated last year
- Accelerate LLM preference tuning via prefix sharing with a single line of codeβ43Updated 2 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.β49Updated 11 months ago
- Official Implementation of FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagationβ22Updated 4 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inferenceβ45Updated last year
- β59Updated 2 months ago
- β18Updated 9 months ago
- [ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsityβ56Updated 2 months ago
- Official implementation of the paper: "A deeper look at depth pruning of LLMs"β15Updated last year
- β14Updated last year
- β14Updated last year
- Beyond KV Caching: Shared Attention for Efficient LLMsβ19Updated last year
- Source code for the paper "LongGenBench: Long-context Generation Benchmark"β23Updated 11 months ago
- β28Updated 4 months ago
- Quantized Attention on GPUβ44Updated 10 months ago
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".β25Updated 10 months ago
- β23Updated 5 months ago
- [ICLR 2025] Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Betterβ16Updated 7 months ago
- Are gradient information useful for pruning of LLMs?β46Updated last month
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retentiβ¦β67Updated last year