olly-styles / WorkBenchLinks

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting.

☆41

Alternatives and similar repositories for WorkBench

Users that are interested in WorkBench are comparing it to the libraries listed below

Sorting:

salesforce / summary-of-a-haystack
Codebase accompanying the Summary of a Haystack paper.
☆78Updated 8 months ago
wang-research-lab / agentinstruct
Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"
☆110Updated 8 months ago
automix-llm / automix
Mixing Language Models with Self-Verification and Meta-Verification
☆104Updated 5 months ago
SALT-NLP / demonstrated-feedback
☆120Updated 8 months ago
oriyor / assistantbench
Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"
☆56Updated 5 months ago
allenai / CommonGen-Eval
Evaluating LLMs with CommonGen-Lite
☆90Updated last year
CarperAI / autocrit
A repository for transformer critique learning and generation
☆90Updated last year
vaughanlove / PromptBreeder
Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.
☆116Updated 10 months ago
Arize-ai / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆99Updated last year
patronus-ai / Lynx-hallucination-detection
☆38Updated 10 months ago
chaitanyamalaviya / ExpertQA
[Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers
☆128Updated last year
reasoning-machines / prompt-lib
A set of utilities for running few-shot prompting experiments on large-language models
☆121Updated last year
austrian-code-wizard / c3po
☆27Updated this week
Hannibal046 / nanoColBERT
Simple replication of [ColBERT-v1](https://arxiv.org/abs/2004.12832).
☆80Updated last year
THUDM / ComplexFuncBench
Complex Function Calling Benchmark.
☆112Updated 4 months ago
teknium1 / LLM-Benchmark-Logs
Just a bunch of benchmark logs for different LLMs
☆119Updated 10 months ago
GAIR-NLP / scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
☆42Updated last year
Anni-Zou / Meta-CoT
Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
☆96Updated last year
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆155Updated last year
SalesforceAIResearch / LaTRO
☆114Updated 3 months ago
facebookresearch / collaborative-reasoner
Source code for the collaborative reasoner research project at Meta FAIR.
☆87Updated last month
google-deepmind / loft
LOFT: A 1 Million+ Token Long-Context Benchmark
☆198Updated last month
SalesforceAIResearch / SFR-RAG
☆75Updated 4 months ago
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆86Updated 8 months ago
rhyang2021 / SELFGOAL
Source code for our paper: "SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals".
☆67Updated 11 months ago
zbambergerNLP / strategic-debate-tot
A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments
☆81Updated 8 months ago
asappresearch / webagents-step
☆40Updated 10 months ago
declare-lab / flacuna
Flacuna was developed by fine-tuning Vicuna on Flan-mini, a comprehensive instruction collection encompassing various tasks. Vicuna is al…
☆111Updated last year
bespokelabsai / verifiers
Verifiers for LLM Reinforcement Learning
☆56Updated last month
facebookresearch / ExploreToM
Code for ExploreTom
☆83Updated 5 months ago