Su-Sea / ydc-deep-research-evalsLinks

you.com's framework for evaluating deep research systems.

☆32

Alternatives and similar repositories for ydc-deep-research-evals

Users that are interested in ydc-deep-research-evals are comparing it to the libraries listed below

Sorting:

SalesforceAIResearch / CRMArena
Official Repo for CRMArena and CRMArena-Pro
☆114Updated 2 months ago
ZeroSumEval / ZeroSumEval
A framework for pitting LLMs against each other in an evolving library of games ⚔
☆33Updated 5 months ago
salesforce / summary-of-a-haystack
Codebase accompanying the Summary of a Haystack paper.
☆79Updated 11 months ago
automix-llm / automix
Mixing Language Models with Self-Verification and Meta-Verification
☆110Updated 9 months ago
goncalorafaria / qalign
QAlign is a new test-time alignment approach that improves language model performance by using Markov chain Monte Carlo methods.
☆24Updated last week
davanstrien / haiku-dpo
Using open source LLMs to build synthetic datasets for direct preference optimization
☆65Updated last year
oriyor / assistantbench
Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"
☆62Updated 9 months ago
microsoft / llm-steer-instruct
A method for steering llms to better follow instructions
☆50Updated last month
patronus-ai / Lynx-hallucination-detection
☆43Updated last year
Arize-ai / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆103Updated last year
weaviate-tutorials / Hurricane
Writing Blog Posts with Generative Feedback Loops!
☆50Updated last year
SalesforceAIResearch / MCPEval
MCP-based Agent Deep Evaluation System
☆129Updated this week
allenai / infinigram-api
☆81Updated 2 weeks ago
Alignment-Lab-AI / datagen
a pipeline for using api calls to agnostically convert unstructured data into structured training data
☆31Updated 11 months ago
JoshuaPurtell / SmallBench
Small, simple agent task environments for training and evaluation
☆18Updated 10 months ago
stunningpixels / lou-eval
Track the progress of LLM context utilisation
☆55Updated 5 months ago
bespokelabsai / verifiers
Verifiers for LLM Reinforcement Learning
☆72Updated 5 months ago
teknium1 / LLM-Benchmark-Logs
Just a bunch of benchmark logs for different LLMs
☆119Updated last year
arcee-ai / DAM
☆54Updated 10 months ago
pacman100 / peft-codegen-25
☆23Updated 2 years ago
deshwalmahesh / PHUDGE
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…
☆49Updated last year
phunterlau / paper_without_code
LLM reads a paper and produce a working prototype
☆56Updated 5 months ago
s-smits / grpo-optuna
Optimizing Causal LMs through GRPO with weighted reward functions and automated hyperparameter tuning using Optuna
☆55Updated 7 months ago
plastic-labs / dspy-opentom
Exploration using DSPy to optimize modules to maximize performance on the OpenToM dataset
☆19Updated last year
S1M0N38 / dspy-arxiv
Explore the use of DSPy for extracting features from PDFs 🔎
☆45Updated last year
pgasawa / BARE
Leveraging Base Language Models for Few-Shot Synthetic Data Generation
☆34Updated last month
Columbia-NLP-Lab / PAPILLON
Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles
☆55Updated 4 months ago
ContextualAI / CLAIR_and_APO
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
☆60Updated last year
facebookresearch / matrix
Matrix (Multi-Agent daTa geneRation Infra and eXperimentation framework) is a versatile engine for multi-agent conversational data genera…
☆95Updated this week
mungg / FABLES
☆57Updated 11 months ago