olly-styles / WorkBench
WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting.
☆31Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for WorkBench
- ☆76Updated 10 months ago
- Codebase accompanying the Summary of a Haystack paper.☆71Updated last month
- Just a bunch of benchmark logs for different LLMs☆113Updated 3 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆97Updated last year
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆97Updated 7 months ago
- Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.☆100Updated last month
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆48Updated 4 months ago
- Generalist and Lightweight Model for Text Classification☆48Updated 2 months ago
- ☆111Updated last month
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆91Updated 4 months ago
- ☆38Updated 3 months ago
- Official code for the paper "ADaPT: As-Needed Decomposition and Planning with Language Models"☆71Updated 10 months ago
- ☆91Updated last month
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆73Updated 2 months ago
- Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.☆63Updated 3 months ago
- Experimental Code for StructuredRAG: Structured Outputs in Retrieval-Augmented Generation☆90Updated this week
- Beating the GAIA benchmark with Transformers Agents. 🚀☆62Updated last week
- ModuleFormer is a MoE-based architecture that includes two different types of experts: stick-breaking attention heads and feedforward exp…☆216Updated 7 months ago
- ☆64Updated last month
- Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models☆87Updated last year
- Evaluating LLMs with CommonGen-Lite☆84Updated 7 months ago
- Backtracing: Retrieving the Cause of the Query, EACL 2024 Long Paper, Findings.☆87Updated 3 months ago
- ☆46Updated 9 months ago
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆115Updated this week
- The first dense retrieval model that can be prompted like an LM☆62Updated last month
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆122Updated 7 months ago
- ☆131Updated 3 months ago
- 📝 Reference-Free automatic summarization evaluation with potential hallucination detection☆99Updated 9 months ago
- Let's build better datasets, together!☆202Updated 3 months ago
- Leverage your LangChain trace data for fine tuning☆37Updated 3 months ago