JinjieNi / dlms-are-super-data-learnersLinks

The official github repo for "Diffusion Language Models are Super Data Learners".

☆200

Alternatives and similar repositories for dlms-are-super-data-learners

Users that are interested in dlms-are-super-data-learners are comparing it to the libraries listed below

Sorting:

wmn-231314 / diffusion-data-constraint
Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion mod…
☆108Updated 3 weeks ago
Gen-Verse / dLLM-RL
TraceRL & TraDo-8B: Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
☆317Updated this week
s-sahoo / Eso-LMs
Esoteric Language Models
☆106Updated last month
g-luo / vlm_cross_modal_reps
Official PyTorch Implementation for Vision-Language Models Create Cross-Modal Task Representations, ICML 2025
☆31Updated 6 months ago
ChenWu98 / algorithmic-creativity
[ICML 2025] Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
☆75Updated 5 months ago
HKUNLP / DiffuLLaMA
[ICLR2025] DiffuGPT and DiffuLLaMA: Scaling Diffusion Language Models via Adaptation from Autoregressive Models
☆334Updated 5 months ago
zhixuan-lin / forgetting-transformer
[ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning
☆133Updated 2 weeks ago
fal-ai-community / nano-mdm
Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun
☆57Updated 8 months ago
facebookresearch / Mixture-of-Transformers
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models. TMLR 2025.
☆122Updated 2 months ago
RobertCsordas / moeut
☆88Updated last year
horseee / dKV-Cache
[NeurIPS'25] dKV-Cache: The Cache for Diffusion Language Models
☆117Updated 6 months ago
lucidrains / coconut-pytorch
Implementation of 🥥 Coconut, Chain of Continuous Thought, in Pytorch
☆180Updated 5 months ago
kuleshov-group / remdm
Remasking Discrete Diffusion Models with Inference-Time Scaling
☆54Updated 8 months ago
Infini-AI-Lab / Multiverse
☆103Updated 2 months ago
sail-sg / Precision-RL
Defeating the Training-Inference Mismatch via FP16
☆149Updated last week
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆124Updated 4 months ago
goombalab / phi-mamba
Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…
☆116Updated last year
callsys / GMPO
Geometric-Mean Policy Optimization
☆92Updated this week
facebookresearch / PhysicsLM4
Physics of Language Models, Part 4
☆260Updated 3 months ago
HanGuo97 / log-linear-attention
☆254Updated 5 months ago
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆98Updated 11 months ago
sustcsonglin / linear-attention-and-beyond-slides
☆95Updated 8 months ago
complex-reasoning / RPG
Official implementation of Regularized Policy Gradient (RPG) (https://arxiv.org/abs/2505.17508)
☆54Updated last month
jxiw / MambaInLlama
[NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models
☆231Updated last month
jzhang38 / LongMamba
Some preliminary explorations of Mamba's context scaling.
☆216Updated last year
Sphere-AI-Lab / fda
Model Merging with Functional Dual Anchors
☆33Updated 3 weeks ago
amorehead / jvp_flash_attention
Flash Attention Triton kernel with support for second-order derivatives
☆112Updated last month
aakaran / reasoning-with-sampling
☆317Updated 2 weeks ago
JinjieNi / MegaDLMs
GPU-optimized framework for training diffusion language models at any scale. The backend of Quokka, Super Data Learners, and OpenMoE 2 tr…
☆272Updated last week
sail-sg / SkyLadder
The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling
☆40Updated last month