UCSC-VLAA / ReasoningEvalLinks
Official repo of Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains.
☆40Updated 3 months ago
Alternatives and similar repositories for ReasoningEval
Users that are interested in ReasoningEval are comparing it to the libraries listed below
Sorting:
- X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains☆48Updated 4 months ago
- SSRL: Self-Search Reinforcement Learning☆144Updated last month
- ☆40Updated 3 months ago
- MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning☆100Updated 2 months ago
- ☆51Updated 3 months ago
- Revisiting Mid-training in the Era of Reinforcement Learning Scaling☆176Updated 2 months ago
- ☆215Updated 7 months ago
- ☆48Updated 7 months ago
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆74Updated 4 months ago
- [ACL 2025] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems☆105Updated 3 months ago
- ☆94Updated last month
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆108Updated 11 months ago
- [ICLR 2025] SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction☆80Updated 6 months ago
- General Reasoner: Advancing LLM Reasoning Across All Domains [NeurIPS25]☆172Updated 3 months ago
- Process Reward Models That Think☆51Updated 2 months ago
- [ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples☆106Updated 2 months ago
- Repo for "Z1: Efficient Test-time Scaling with Code"☆64Updated 5 months ago
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]☆143Updated 10 months ago
- [NeurIPS 2025 Spotlight] Scaling Computer-Use Grounding via UI Decomposition and Synthesis☆110Updated 3 months ago
- ☆61Updated 3 months ago
- Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…☆66Updated 6 months ago
- ☆141Updated last year
- ☆35Updated 4 months ago
- m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models☆42Updated 5 months ago
- ☆76Updated last week
- ☆73Updated 6 months ago
- Resources for our paper: "EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms"☆128Updated 11 months ago
- ☆38Updated 8 months ago
- B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners☆85Updated 4 months ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" [COLM 2025]☆172Updated 2 months ago