zwhong714 / weak-to-strong-preference-optimizationLinks
[ICLR 2025 Spotlight] Weak-to-strong preference optimization: stealing reward from weak aligned model
☆16Updated 11 months ago
Alternatives and similar repositories for weak-to-strong-preference-optimization
Users that are interested in weak-to-strong-preference-optimization are comparing it to the libraries listed below
Sorting:
- CoT-Valve: Length-Compressible Chain-of-Thought Tuning☆89Updated 11 months ago
- The official repository of NeurIPS'25 paper "Ada-R1: From Long-Cot to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization"☆21Updated 2 months ago
- ☆15Updated last year
- Code for "CREAM: Consistency Regularized Self-Rewarding Language Models", ICLR 2025.☆28Updated 11 months ago
- ☆33Updated 2 months ago
- The repository of the paper "REEF: Representation Encoding Fingerprints for Large Language Models," aims to protect the IP of open-source…☆73Updated last year
- codes for Efficient Test-Time Scaling via Self-Calibration☆19Updated 4 months ago
- Optimizing Anytime Reasoning via Budget Relative Policy Optimization☆51Updated 6 months ago
- [EMNLP 2025] LightThinker: Thinking Step-by-Step Compression☆131Updated 9 months ago
- Resources and paper list for 'Scaling Environments for Agents'. This repository accompanies our survey on how environments contribute to …☆57Updated this week
- ☆204Updated last month
- Official repository of the video reasoning benchmark MMR-V. Can Your MLLMs "Think with Video"?☆38Updated 7 months ago
- ☆20Updated 9 months ago
- ☆26Updated 5 months ago
- [NeurIPS 2025] Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains☆71Updated 6 months ago
- ☆34Updated 8 months ago
- [NeurIPS25 Spotlight] EMPO, A Fully Unsupervised RLVR Method☆93Updated 2 months ago
- JudgeLRM: Large Reasoning Models as a Judge☆40Updated last month
- [NeurIPS'25 Spotlight] ARM: Adaptive Reasoning Model☆64Updated 3 months ago
- [ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style☆73Updated 6 months ago
- Documentation at☆14Updated 10 months ago
- [ICLR 2025] Code and Data Repo for Paper "Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation"☆93Updated last year
- [ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)☆152Updated 6 months ago
- ☆17Updated 6 months ago
- ☆23Updated 11 months ago
- [ICML'25] Official code of paper "Fast Large Language Model Collaborative Decoding via Speculation"☆28Updated 7 months ago
- This is the official implementation of the paper "S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning"☆73Updated 9 months ago
- ☆45Updated last month
- ☆19Updated 7 months ago
- ☆25Updated 9 months ago