Asaf-Yehudai / LLM-Agent-Evaluation-SurveyLinks
Top papers related to LLM-based agent evaluation
☆70Updated 2 weeks ago
Alternatives and similar repositories for LLM-Agent-Evaluation-Survey
Users that are interested in LLM-Agent-Evaluation-Survey are comparing it to the libraries listed below
Sorting:
- Repository for "Attribute First, then Generate: Locally-attributable Grounded Text Generation", ACL 2024☆29Updated 6 months ago
- ☆61Updated 3 weeks ago
- ☆65Updated 2 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆57Updated 9 months ago
- ReBase: Training Task Experts through Retrieval Based Distillation☆29Updated 4 months ago
- Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation☆33Updated 4 months ago
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆62Updated last month
- A package dedicated for running benchmark agreement testing☆16Updated last month
- Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.☆40Updated 3 months ago
- ☆35Updated 3 weeks ago
- Verifiers for LLM Reinforcement Learning☆60Updated 2 months ago
- General Reasoner: Advancing LLM Reasoning Across All Domains☆142Updated 2 weeks ago
- SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning☆57Updated 2 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆32Updated 2 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆95Updated 3 weeks ago
- Source code for the collaborative reasoner research project at Meta FAIR.☆91Updated 2 months ago
- minimal GRPO implementation from scratch☆90Updated 3 months ago
- Maya: An Instruction Finetuned Multilingual Multimodal Model using Aya☆112Updated last month
- Official Repo for InSTA: Towards Internet-Scale Training For Agents☆42Updated this week
- DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents☆135Updated last week
- Open-Source LLM Coders with Co-Evolving Reinforcement Learning☆83Updated 3 weeks ago
- Improving Text Embedding of Language Models Using Contrastive Fine-tuning☆64Updated 10 months ago
- Code, results and other artifacts from the paper introducing the WildChat-50m dataset and the Re-Wild model family.☆29Updated 2 months ago
- The official implementation of Regularized Policy Gradient (RPG) (https://arxiv.org/abs/2505.17508)☆35Updated this week
- QAlign is a new test-time alignment approach that improves language model performance by using Markov chain Monte Carlo methods.☆23Updated 2 months ago
- Codebase accompanying the Summary of a Haystack paper.☆78Updated 9 months ago
- Source code for GreaTer ICLR 2025 - Gradient Over Reasoning makes Smaller Language Models Strong Prompt Optimizers☆29Updated 2 months ago
- PyTorch library for Active Fine-Tuning☆80Updated 4 months ago
- Official repository for paper "ReasonIR Training Retrievers for Reasoning Tasks".☆172Updated this week
- Simple GRPO scripts and configurations.☆58Updated 4 months ago