open-compass / DevBench

A Comprehensive Benchmark for Software Development.

☆84

Related projects: ⓘ

THUNLP-MT / StableToolBench
A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
☆101Updated this week
zjunlp / AutoAct
[ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planning
☆162Updated 5 months ago
hkust-nlp / llm-compression-intelligence
Official github repo for the paper "Compression Represents Intelligence Linearly"
☆121Updated 3 months ago
QwenLM / AutoIF
☆185Updated last month
zorazrw / awesome-tool-llm
☆170Updated last month
THUDM / ReST-MCTS
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
☆91Updated 3 months ago
bigai-nlco / LooGLE
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models
☆148Updated 6 months ago
Ber666 / ToolkenGPT
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)
☆230Updated 5 months ago
GAIR-NLP / ReAlign
Reformatted Alignment
☆111Updated 4 months ago
xlang-ai / DS-1000
[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".
☆211Updated last month
Junjie-Ye / ToolEyes
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆63Updated 5 months ago
code-rag-bench / code-rag-bench
CodeRAG-Bench: Can Retrieval Augment Code Generation?
☆54Updated 2 months ago
wasiahmad / Awesome-LLM-Synthetic-Data
A reading list on LLM based Synthetic Data Generation 🔥
☆105Updated last month
InfiAgent / InfiAgent
☆76Updated 4 months ago
thunlp / DebugBench
The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".
☆52Updated 2 months ago
xlang-ai / Spider2-V
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
☆100Updated 3 weeks ago
thunlp / MatPlotAgent
☆45Updated 2 months ago
ntunlp / xCodeEval
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
☆70Updated 8 months ago
open-compass / MathBench
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
☆77Updated 2 months ago
xingyaoww / mint-bench
Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…
☆100Updated 3 months ago
ZubinGou / math-evaluation-harness
A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨
☆72Updated 4 months ago
SALT-NLP / DyLAN
Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
☆96Updated 4 months ago
OFA-Sys / gsm8k-ScRel
Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
☆208Updated last week
TIGER-AI-Lab / LongICLBench
Code and Data for "Long-context LLMs Struggle with Long In-context Learning"
☆87Updated 2 months ago
GAIR-NLP / auto-j
Generative Judge for Evaluating Alignment
☆208Updated 8 months ago
princeton-nlp / LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
☆104Updated 2 months ago
open-compass / Ada-LEval
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
☆49Updated 4 months ago
metame-ai / awesome-llm-plaza
awesome llm plaza: daily tracking all sorts of awesome topics of llm, e.g. llm for coding, robotics, reasoning, multimod etc.
☆125Updated this week
sambanova / toolbench
ToolBench, an evaluation suite for LLM tool manipulation capabilities.
☆134Updated 6 months ago
FateScript / token_visualizer
Token level visualization tools for large language models
☆46Updated last month