Caiyun-AI / MUDDFormerLinks
☆63Updated 2 months ago
Alternatives and similar repositories for MUDDFormer
Users that are interested in MUDDFormer are comparing it to the libraries listed below
Sorting:
- [ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization☆75Updated last month
- [COLM 2025] LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation☆136Updated last week
- ☆91Updated last month
- ☆211Updated 5 months ago
- Parameter-Efficient Fine-Tuning for Foundation Models☆73Updated 3 months ago
- DeepSeek Native Sparse Attention pytorch implementation☆73Updated 4 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆322Updated 4 months ago
- A generalized framework for subspace tuning methods in parameter efficient fine-tuning.☆147Updated 2 weeks ago
- ZO2 (Zeroth-Order Offloading): Full Parameter Fine-Tuning 175B LLMs with 18GB GPU Memory☆148Updated 3 weeks ago
- Implementation of Switch Transformers from the paper: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficien…☆110Updated 3 months ago
- ☆147Updated 10 months ago
- Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI, derived from Ling.☆85Updated 3 weeks ago
- [ACL 2025] An official pytorch implement of the paper: Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement☆30Updated last month
- qwen-nsa☆68Updated 3 months ago
- tinybig for deep function learning☆61Updated last month
- Scaling Preference Data Curation via Human-AI Synergy☆69Updated last week
- ☆195Updated last year
- ☆64Updated last month
- [NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models☆49Updated last month
- ☆77Updated 3 months ago
- CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models☆143Updated last month
- [ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation☆110Updated last month
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆178Updated 3 weeks ago
- ☆108Updated last year
- Efficient Mixture of Experts for LLM Paper List☆79Updated 7 months ago
- ☆202Updated 8 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆99Updated last week
- A repository for DenseSSMs☆87Updated last year
- TransMLA: Multi-Head Latent Attention Is All You Need☆327Updated last week
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆121Updated 6 months ago