aymeric-roucher / agent_reasoning_benchmark
π§ Compare how Agent systems perform on several benchmarks. ππ
β41Updated 2 months ago
Related projects: β
- β90Updated last month
- Source code for our paper: "SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals".β62Updated 2 months ago
- AWM: Agent Workflow Memoryβ121Updated this week
- Evaluating LLMs with CommonGen-Liteβ83Updated 5 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"β68Updated last week
- Codebase accompanying the Summary of a Haystack paper.β65Updated 2 months ago
- Beating the GAIA benchmark with Transformers Agents. πβ56Updated 2 weeks ago
- β130Updated last week
- Attribute (or cite) statements generated by LLMs back to in-context information.β107Updated 2 weeks ago
- β85Updated 7 months ago
- β82Updated 3 weeks ago
- Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Modelsβ84Updated 11 months ago
- RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo rankerβ101Updated last week
- β75Updated 3 weeks ago
- π Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Papβ¦β81Updated last month
- Resources for our paper: "EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms"β73Updated 2 months ago
- β74Updated 9 months ago
- Automating enterprise workflows with multimodal agentsβ83Updated last month
- Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agentsβ102Updated 3 months ago
- Just a bunch of benchmark logs for different LLMsβ112Updated last month
- ARAGOG- Advanced RAG Output Grading. Exploring and comparing various Retrieval-Augmented Generation (RAG) techniques on AI research paperβ¦β91Updated 5 months ago
- CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/β167Updated this week
- Testing speed and accuracy of RAG with, and without Cross Encoder Reranker.β45Updated 8 months ago
- WebLINX is a benchmark for building web navigation agents with conversational capabilitiesβ111Updated 2 months ago
- β105Updated this week
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absoluteβ¦β48Updated 2 months ago
- This repository implements the chain of verification paper by Meta AIβ151Updated 11 months ago
- Official Implementation of "Multi-Head RAG: Solving Multi-Aspect Problems with LLMs"β155Updated 2 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracyβ93Updated 5 months ago
- Code for the paper "Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning"β30Updated 3 months ago