open-compass / CompassJudgerLinks
The All-in-one Judge Models introduced by Opencompass
☆116Updated 6 months ago
Alternatives and similar repositories for CompassJudger
Users that are interested in CompassJudger are comparing it to the libraries listed below
Sorting:
- ☆320Updated last year
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale☆264Updated 6 months ago
- Implementation for OAgents: An Empirical Study of Building Effective Agents☆304Updated 3 months ago
- Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning☆86Updated 2 years ago
- ☆180Updated 9 months ago
- [ICML 2025] Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search☆108Updated 7 months ago
- [ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement☆193Updated last year
- ☆92Updated 8 months ago
- ☆322Updated last year
- (ICCV 2025) OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation☆95Updated last month
- ☆104Updated last year
- [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step☆304Updated last year
- General Reasoner: Advancing LLM Reasoning Across All Domains [NeurIPS25]☆214Updated 2 months ago
- HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models☆53Updated last year
- Repo of ACL 2025 Paper "Quantification of Large Language Model Distillation"☆93Updated 6 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆260Updated 8 months ago
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?☆136Updated last year
- Reformatted Alignment☆111Updated last year
- Pre-trained, Scalable, High-performance Reward Models via Policy Discriminative Learning.☆164Updated 4 months ago
- WideSearch: Benchmarking Agentic Broad Info-Seeking☆114Updated 3 months ago
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"☆56Updated 8 months ago
- ☆328Updated 8 months ago
- MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models☆58Updated 6 months ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆143Updated 2 months ago
- Scaling Preference Data Curation via Human-AI Synergy☆137Updated 6 months ago
- [ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation☆120Updated 8 months ago
- MMSearch-R1 is an end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search too…☆384Updated 5 months ago
- [ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents☆222Updated 7 months ago
- ☆517Updated last month
- [ACL 2025] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems☆124Updated 7 months ago