EQ-bench / eqbench3Links
☆48Updated 5 months ago
Alternatives and similar repositories for eqbench3
Users that are interested in eqbench3 are comparing it to the libraries listed below
Sorting:
- Verifiers for LLM Reinforcement Learning☆80Updated 9 months ago
- accompanying material for sleep-time compute paper☆119Updated 9 months ago
- ☆131Updated 9 months ago
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆96Updated 8 months ago
- GPT-4 Level Conversational QA Trained In a Few Hours☆65Updated last year
- Multi-Granularity LLM Debugger [ICSE2026]☆96Updated 7 months ago
- [EMNLP 2025] The official implementation for paper "Agentic-R1: Distilled Dual-Strategy Reasoning"☆102Updated 5 months ago
- ☆63Updated last year
- [ACL 2025] Agentic Knowledgeable Self-awareness☆91Updated 7 months ago
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆69Updated last year
- Analysis code for Neurips 2025 paper "SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks"☆56Updated 6 months ago
- ☆93Updated 8 months ago
- ☆132Updated 8 months ago
- Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies a…☆39Updated 10 months ago
- Code for the paper: CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models☆31Updated 10 months ago
- ☆102Updated last month
- ☆39Updated last year
- LLMs as Collaboratively Edited Knowledge Bases☆46Updated last year
- Data preparation code for CrystalCoder 7B LLM☆45Updated last year
- Data Synthesis for Deep Research Based on Semi-Structured Data☆198Updated last month
- ☆34Updated last year
- LLM reads a paper and produce a working prototype☆60Updated 9 months ago
- Code Implementation, Evaluations, Documentation, Links and Resources for Min P paper☆46Updated 5 months ago
- ☆96Updated last year
- Evaluating tool-augmented LLMs in conversation settings☆88Updated last year
- Training setup for Langchain's Open Deep Research☆75Updated 5 months ago
- ☆67Updated 10 months ago
- Nexusflow function call, tool use, and agent benchmarks.☆30Updated last year
- Open Implementations of LLM Analyses☆107Updated last year
- 🔧 Compare how Agent systems perform on several benchmarks. 📊🚀☆103Updated 6 months ago