olly-styles / WorkBench
WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting.
☆41Updated 9 months ago
Alternatives and similar repositories for WorkBench:
Users that are interested in WorkBench are comparing it to the libraries listed below
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆107Updated 7 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆104Updated 4 months ago
- Official repository for paper "ReasonIR Training Retrievers for Reasoning Tasks".☆112Updated last week
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆128Updated last year
- Codebase accompanying the Summary of a Haystack paper.☆77Updated 7 months ago
- Just a bunch of benchmark logs for different LLMs☆119Updated 9 months ago
- ☆36Updated 9 months ago
- Complex Function Calling Benchmark.☆99Updated 3 months ago
- Simple replication of [ColBERT-v1](https://arxiv.org/abs/2004.12832).☆80Updated last year
- Evaluating LLMs with CommonGen-Lite☆90Updated last year
- ☆120Updated 7 months ago
- Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles☆30Updated this week
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆54Updated 5 months ago
- A set of utilities for running few-shot prompting experiments on large-language models☆120Updated last year
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆172Updated 2 months ago
- ☆166Updated 8 months ago
- XTR/WARP (SIGIR'25) is an extremely fast and accurate retrieval engine based on Stanford's ColBERTv2/PLAID and Google DeepMind's XTR.☆127Updated this week
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆99Updated last year
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆49Updated 9 months ago
- ☆73Updated this week
- ☆74Updated 3 months ago
- Evaluating LLMs with fewer examples☆151Updated last year
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?☆124Updated 8 months ago
- CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments☆55Updated 2 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆79Updated 7 months ago
- Backtracing: Retrieving the Cause of the Query, EACL 2024 Long Paper, Findings.☆89Updated 9 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆134Updated 6 months ago
- ☆123Updated last month
- Evaluating tool-augmented LLMs in conversation settings☆84Updated 11 months ago
- Functional Benchmarks and the Reasoning Gap☆85Updated 7 months ago