open-compass / CIBenchLinks
Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
☆14Updated last year
Alternatives and similar repositories for CIBench
Users that are interested in CIBench are comparing it to the libraries listed below
Sorting:
- The code and data for the paper JiuZhang3.0☆49Updated last year
 - ☆107Updated 3 months ago
 - Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)☆95Updated 8 months ago
 - Implementations of online merging optimizers proposed by Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment☆79Updated last year
 - This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"☆50Updated last year
 - ☆17Updated 2 years ago
 - [ICLR'24 spotlight] Tool-Augmented Reward Modeling☆51Updated 4 months ago
 - [ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)☆174Updated 8 months ago
 - [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset☆107Updated 5 months ago
 - [ICML 2024] Selecting High-Quality Data for Training Language Models☆192Updated last year
 - [2024-ACL]: TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wildrounded Conversation☆46Updated 2 years ago
 - [ACL 2025] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLM…☆68Updated last year
 - [NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models☆62Updated 10 months ago
 - [ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Models☆78Updated last year
 - [ICML'2024] Can AI Assistants Know What They Don't Know?☆83Updated last year
 - [ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models☆117Updated 4 months ago
 - Towards Systematic Measurement for Long Text Quality☆36Updated last year
 - Implementation of ICML 23 Paper: Specializing Smaller Language Models towards Multi-Step Reasoning.☆131Updated 2 years ago
 - [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆122Updated last year
 - 服务器 GPU 监控程序,当 GPU 属性满足预设条件时通过微信发送提示消息☆32Updated 4 years ago
 - Official Repository of MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations☆101Updated last month
 - [EMNLP 2025] Verification Engineering for RL in Instruction Following☆40Updated 3 weeks ago
 - SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis☆109Updated 5 months ago
 - ☆30Updated 10 months ago
 - [ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning☆27Updated last year
 - ☆17Updated 11 months ago
 - ☆48Updated last year
 - [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆74Updated 3 weeks ago
 - ☆69Updated last year
 - ☆39Updated 3 months ago