olly-styles / WorkBenchLinks
WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting.
☆41Updated 10 months ago
Alternatives and similar repositories for WorkBench
Users that are interested in WorkBench are comparing it to the libraries listed below
Sorting:
- Codebase accompanying the Summary of a Haystack paper.☆78Updated 8 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆110Updated 8 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆104Updated 5 months ago
- ☆120Updated 8 months ago
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆56Updated 5 months ago
- Evaluating LLMs with CommonGen-Lite☆90Updated last year
- A repository for transformer critique learning and generation☆90Updated last year
- Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.☆116Updated 10 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆99Updated last year
- ☆38Updated 10 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆128Updated last year
- A set of utilities for running few-shot prompting experiments on large-language models☆121Updated last year
- ☆27Updated this week
- Simple replication of [ColBERT-v1](https://arxiv.org/abs/2004.12832).☆80Updated last year
- Complex Function Calling Benchmark.☆112Updated 4 months ago
- Just a bunch of benchmark logs for different LLMs☆119Updated 10 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models☆96Updated last year
- Evaluating LLMs with fewer examples☆155Updated last year
- ☆114Updated 3 months ago
- Source code for the collaborative reasoner research project at Meta FAIR.☆87Updated last month
- LOFT: A 1 Million+ Token Long-Context Benchmark☆198Updated last month
- ☆75Updated 4 months ago
- Functional Benchmarks and the Reasoning Gap☆86Updated 8 months ago
- Source code for our paper: "SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals".☆67Updated 11 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆81Updated 8 months ago
- ☆40Updated 10 months ago
- Flacuna was developed by fine-tuning Vicuna on Flan-mini, a comprehensive instruction collection encompassing various tasks. Vicuna is al…☆111Updated last year
- Verifiers for LLM Reinforcement Learning☆56Updated last month
- Code for ExploreTom☆83Updated 5 months ago