Su-Sea / ydc-deep-research-evalsLinks
you.com's framework for evaluating deep research systems.
☆32Updated 4 months ago
Alternatives and similar repositories for ydc-deep-research-evals
Users that are interested in ydc-deep-research-evals are comparing it to the libraries listed below
Sorting:
- Official Repo for CRMArena and CRMArena-Pro☆114Updated 2 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆33Updated 5 months ago
- Codebase accompanying the Summary of a Haystack paper.☆79Updated 11 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆110Updated 9 months ago
- QAlign is a new test-time alignment approach that improves language model performance by using Markov chain Monte Carlo methods.☆24Updated last week
- Using open source LLMs to build synthetic datasets for direct preference optimization☆65Updated last year
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆62Updated 9 months ago
- A method for steering llms to better follow instructions☆50Updated last month
- ☆43Updated last year
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆103Updated last year
- Writing Blog Posts with Generative Feedback Loops!☆50Updated last year
- MCP-based Agent Deep Evaluation System☆129Updated this week
- ☆81Updated 2 weeks ago
- a pipeline for using api calls to agnostically convert unstructured data into structured training data☆31Updated 11 months ago
- Small, simple agent task environments for training and evaluation☆18Updated 10 months ago
- Track the progress of LLM context utilisation☆55Updated 5 months ago
- Verifiers for LLM Reinforcement Learning☆72Updated 5 months ago
- Just a bunch of benchmark logs for different LLMs☆119Updated last year
- ☆54Updated 10 months ago
- ☆23Updated 2 years ago
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆49Updated last year
- LLM reads a paper and produce a working prototype☆56Updated 5 months ago
- Optimizing Causal LMs through GRPO with weighted reward functions and automated hyperparameter tuning using Optuna☆55Updated 7 months ago
- Exploration using DSPy to optimize modules to maximize performance on the OpenToM dataset☆19Updated last year
- Explore the use of DSPy for extracting features from PDFs 🔎☆45Updated last year
- Leveraging Base Language Models for Few-Shot Synthetic Data Generation☆34Updated last month
- Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles☆55Updated 4 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆60Updated last year
- Matrix (Multi-Agent daTa geneRation Infra and eXperimentation framework) is a versatile engine for multi-agent conversational data genera…☆95Updated this week
- ☆57Updated 11 months ago