open-compass / CriticBench
A comprehensive benchmark for evaluating critique ability of LLMs
☆25Updated 6 months ago
Related projects: ⓘ
- ☆31Updated 3 months ago
- This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"☆21Updated 2 months ago
- This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"☆38Updated 2 months ago
- This the implementation of LeCo☆16Updated 2 months ago
- Source code of "Reasons to Reject? Aligning Language Models with Judgments"☆54Updated 6 months ago
- Evaluating Mathematical Reasoning Beyond Accuracy☆32Updated 5 months ago
- Official repository for paper "Weak-to-Strong Extrapolation Expedites Alignment"☆62Updated 3 months ago
- Paper list and datasets for the paper: A Survey on Data Selection for LLM Instruction Tuning☆27Updated 7 months ago
- This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"☆36Updated 2 months ago
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆38Updated last month
- GPT as Human☆17Updated 8 months ago
- Feeling confused about super alignment? Here is a reading list☆42Updated 8 months ago
- Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large L…☆34Updated 2 months ago
- Code for Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models☆52Updated 2 months ago
- [ICML'2024] Can AI Assistants Know What They Don't Know?☆62Updated 7 months ago
- Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement☆21Updated last month
- Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs☆21Updated 2 months ago
- ☆71Updated 8 months ago
- [ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"☆51Updated 5 months ago
- [ICLR'24 spotlight] Tool-Augmented Reward Modeling☆33Updated 6 months ago
- ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆63Updated 5 months ago
- [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset☆77Updated 2 months ago
- CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs☆66Updated last month
- [ICML 2024] Selecting High-Quality Data for Training Language Models☆134Updated 3 months ago
- [arxiv:2406.17419]Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆62Updated last month
- ☆46Updated 2 weeks ago
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"☆49Updated 5 months ago
- ☆23Updated 2 months ago
- [ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Models☆72Updated 6 months ago
- The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":☆31Updated 5 months ago