open-compass / DevBench
A Comprehensive Benchmark for Software Development.
☆84Updated 3 months ago
Related projects: ⓘ
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆101Updated this week
- [ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planning☆162Updated 5 months ago
- Official github repo for the paper "Compression Represents Intelligence Linearly"☆121Updated 3 months ago
- ☆185Updated last month
- ☆170Updated last month
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search☆91Updated 3 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆148Updated 6 months ago
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆230Updated 5 months ago
- Reformatted Alignment☆111Updated 4 months ago
- [ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".☆211Updated last month
- ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆63Updated 5 months ago
- CodeRAG-Bench: Can Retrieval Augment Code Generation?☆54Updated 2 months ago
- A reading list on LLM based Synthetic Data Generation 🔥☆105Updated last month
- ☆76Updated 4 months ago
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆52Updated 2 months ago
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?☆100Updated 3 weeks ago
- ☆45Updated 2 months ago
- xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval☆70Updated 8 months ago
- [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset☆77Updated 2 months ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆100Updated 3 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆72Updated 4 months ago
- Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization☆96Updated 4 months ago
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆208Updated last week
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"☆87Updated 2 months ago
- Generative Judge for Evaluating Alignment☆208Updated 8 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆104Updated 2 months ago
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"☆49Updated 4 months ago
- awesome llm plaza: daily tracking all sorts of awesome topics of llm, e.g. llm for coding, robotics, reasoning, multimod etc.☆125Updated this week
- ToolBench, an evaluation suite for LLM tool manipulation capabilities.☆134Updated 6 months ago
- Token level visualization tools for large language models☆46Updated last month