olly-styles / WorkBench
WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting.
☆33Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for WorkBench
- Codebase accompanying the Summary of a Haystack paper.☆72Updated 2 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆97Updated 7 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆80Updated 2 months ago
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆124Updated this week
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆92Updated 5 months ago
- Evaluating LLMs with CommonGen-Lite☆85Updated 8 months ago
- Synthetic Data for LLM Fine-Tuning☆97Updated 11 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆122Updated 8 months ago
- Let's build better datasets, together!☆205Updated this week
- Scalable Meta-Evaluation of LLMs as Evaluators☆41Updated 9 months ago
- RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker☆106Updated 3 weeks ago
- ☆93Updated last month
- ☆112Updated last month
- Evaluating LLMs with fewer examples☆134Updated 7 months ago
- ☆127Updated 3 months ago
- Simple replication of [ColBERT-v1](https://arxiv.org/abs/2004.12832).☆77Updated 8 months ago
- Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.☆64Updated last month
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆48Updated 4 months ago
- Steer LLM outputs towards a certain topic/subject and enhance response capabilities using activation engineering by adding steering vecto…☆203Updated 6 months ago
- Code accompanying "How I learned to start worrying about prompt formatting".☆95Updated last month
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"☆448Updated 8 months ago
- Functional Benchmarks and the Reasoning Gap☆78Updated last month
- awesome synthetic (text) datasets☆242Updated 3 weeks ago
- Beating the GAIA benchmark with Transformers Agents. 🚀☆62Updated 3 weeks ago
- Code for the EMNLP 2024 paper "Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps"☆109Updated 3 months ago
- 📝 Reference-Free automatic summarization evaluation with potential hallucination detection☆98Updated 10 months ago
- ☆64Updated last month
- Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.☆130Updated this week
- A simple unified framework for evaluating LLMs☆145Updated last week
- ☆131Updated 4 months ago