Official implementation for DenseMixer: Improving MoE Post-Training with Precise Router Gradient
☆66Aug 3, 2025Updated 7 months ago
Alternatives and similar repositories for DenseMixer
Users that are interested in DenseMixer are comparing it to the libraries listed below
Sorting:
- Official implementation for Text Generation Beyond Discrete Token Sampling☆21Aug 11, 2025Updated 6 months ago
- BFloat16 Fused Adam Operator for PyTorch☆16Nov 16, 2024Updated last year
- The official implementation of HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization☆18Mar 7, 2025Updated 11 months ago
- Does patch ordering affect context-limited vision transformers?☆17Oct 10, 2025Updated 4 months ago
- A tiny FP8 multiplication unit written in Verilog. TinyTapeout 2 submission.☆14Nov 23, 2022Updated 3 years ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- Transformers components but in Triton☆34May 9, 2025Updated 9 months ago
- Official repository for "BLEUBERI: BLEU is a surprisingly effective reward for instruction following"☆31Jun 5, 2025Updated 8 months ago
- General Reasoner: Advancing LLM Reasoning Across All Domains [NeurIPS25]☆221Nov 27, 2025Updated 3 months ago
- ☆17Nov 3, 2024Updated last year
- This repository contains code for the MicroAdam paper.☆21Dec 14, 2024Updated last year
- Implementation and datasets for "Training Language Models to Generate Quality Code with Program Analysis Feedback"☆41Jul 21, 2025Updated 7 months ago
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆107Mar 6, 2025Updated 11 months ago
- 自己阅读的多模态对话系统论文(及部分笔记)汇总☆22Jan 5, 2023Updated 3 years ago
- [ICLR 2026] PSFT is a trust-region–inspired fine-tuning objective that views SFT as a policy gradient method with constant advantages, co…☆35Sep 9, 2025Updated 5 months ago
- Benchmarking Benchmark Leakage in Large Language Models☆60May 20, 2024Updated last year
- Parsers for CUDA binary files☆24Dec 29, 2023Updated 2 years ago
- ☆129Jun 6, 2025Updated 8 months ago
- Implementation for FP8/INT8 Rollout for RL training without performence drop.☆293Nov 7, 2025Updated 3 months ago
- A simple no-install web UI for Ollama and OAI-Compatible APIs!☆31Jan 30, 2025Updated last year
- [EMNLP 2025] Verification Engineering for RL in Instruction Following☆50Jan 5, 2026Updated last month
- FlexAttention w/ FlashAttention3 Support☆27Oct 5, 2024Updated last year
- Artifacts of EVT ASPLOS'24☆29Mar 6, 2024Updated last year
- Muon fsdp 2☆54Aug 8, 2025Updated 6 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆144May 29, 2025Updated 9 months ago
- [ICML 2025] M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning☆71Jul 13, 2025Updated 7 months ago
- Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models☆45Sep 19, 2025Updated 5 months ago
- [KernelGYM & Dr. Kernel] A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations☆90Feb 6, 2026Updated 3 weeks ago
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆33Nov 29, 2024Updated last year
- ☆27Mar 27, 2025Updated 11 months ago
- Sparse Backpropagation for Mixture-of-Expert Training☆29Jul 2, 2024Updated last year
- mHC-lite: You Don’t Need 20 Sinkhorn-Knopp Iterations☆70Jan 12, 2026Updated last month
- A curated list of awesome resources dedicated to Scaling Laws for LLMs☆81Apr 10, 2023Updated 2 years ago
- [WIP] Better (FP8) attention for Hopper☆32Feb 24, 2025Updated last year
- [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modeling☆40Dec 2, 2023Updated 2 years ago
- BeHonest: Benchmarking Honesty in Large Language Models☆34Aug 15, 2024Updated last year
- ☆105Feb 19, 2026Updated last week
- NSA Triton Kernels written with GPT5 and Opus 4.1☆70Aug 12, 2025Updated 6 months ago
- Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning☆135Updated this week