Asaf-Yehudai/LLM-Agent-Evaluation-Survey

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Asaf-Yehudai/LLM-Agent-Evaluation-Survey)

Asaf-Yehudai / LLM-Agent-Evaluation-Survey

Top papers related to LLM-based agent evaluation

☆97

Alternatives and similar repositories for LLM-Agent-Evaluation-Survey

Users that are interested in LLM-Agent-Evaluation-Survey are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

shahariel / TEAL
View on GitHub
TEAL: New Selection Strategy for Small Buffers in Experience Replay Class Incremental Learning
☆18Jan 21, 2025Updated last year
MoSalama98 / DSiRe
View on GitHub
Official implementation of "Dataset Size Recovery from LoRA Weights" paper.
☆34Jun 30, 2024Updated 2 years ago
eliahuhorwitz / MoTHer
View on GitHub
Official PyTorch Implementation for the "Unsupervised Model Tree Heritage Recovery" paper (ICLR 2025).
☆62Jul 1, 2025Updated last year
slp-rl / slamkit
View on GitHub
SlamKit is an open source tool kit for efficient training of SpeechLMs. It was used for "Slamming: Training a Speech Language Model on On…
☆230Mar 14, 2026Updated 3 months ago
jonkahana / ProbeGen
View on GitHub
An official implementation of ProbeGen
☆13Oct 20, 2024Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
oriern / SuperPAL
View on GitHub
☆24May 31, 2024Updated 2 years ago
IBM / unitxt
View on GitHub
🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data …
☆216May 27, 2026Updated last month
jonkahana / CLIPPR
View on GitHub
An official PyTorch implementation for CLIPPR
☆31Jul 22, 2023Updated 2 years ago
AsafShul / PoDD
View on GitHub
Official PyTorch Implementation for the "Distilling Datasets Into Less Than One Image" paper.
☆39Jun 6, 2024Updated 2 years ago
eliahuhorwitz / 3D-ADS
View on GitHub
Official Implementation for the "Back to the Feature: Classical 3D Features are (Almost) All You Need for 3D Anomaly Detection" paper (VA…
☆141Nov 28, 2022Updated 3 years ago
MaLA-LM / GlotEval
View on GitHub
GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific way
☆18Nov 4, 2025Updated 8 months ago
govtech-responsibleai / KnowOrNot
View on GitHub
☆28Feb 11, 2026Updated 4 months ago
lovodkin93 / attribute-first-then-generate
View on GitHub
Repository for "Attribute First, then Generate: Locally-attributable Grounded Text Generation", ACL 2024
☆30Dec 19, 2024Updated last year
panilya / awesome-ai-benchmarks
View on GitHub
Awesome AI Benchmarks
☆36Jan 16, 2026Updated 5 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
the-crypt-keeper / llm-webapps
View on GitHub
jQuery, React and Streamlit applications written by LLMs
☆15Dec 24, 2023Updated 2 years ago
barcavia / RealTime-DeepfakeDetection-in-the-RealWorld
View on GitHub
Real-Time Deepfake Detection in the Real-World
☆50Nov 30, 2024Updated last year
ShmuelRonen / Hebrew-Mistral-7B-GUI
View on GitHub
This repository contains a user-friendly Graphical User Interface (GUI) for interacting with the Hebrew-Mistral-7B language model.
☆15May 3, 2024Updated 2 years ago
google-research-datasets / QuoteSum
View on GitHub
QuoteSum is a textual QA dataset containing Semi-Extractive Multi-source Question Answering (SEMQA) examples written by humans, based on …
☆13Mar 25, 2024Updated 2 years ago
google-research-datasets / widget-caption
View on GitHub
The dataset includes widget captions that describes UI element's functionalities. It is used for training and evaluation of the widget ca…
☆23Jun 24, 2021Updated 5 years ago
nlp-tlp / llm-fmc
View on GitHub
Experiments on using ChatGPT for failure mode classification
☆12Sep 20, 2023Updated 2 years ago
DQle38 / Fair-Feature-Distillation-for-Visual-Recognition
View on GitHub
Official implementation of paper 'Fair Feature Distillation for Visual Recognition'
☆17Jun 23, 2021Updated 5 years ago
KalyanKS-NLP / LLM-Survey-Papers-Collection
View on GitHub
A category wise collection of 200+ LLM survey papers.
☆296Apr 7, 2025Updated last year
iyttor / GPNN
View on GitHub
Pytorch implementation of the paper: "Drop the GAN: In Defense of Patches Nearest Neighbors as Single Image Generative Models"
☆67Nov 27, 2021Updated 4 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
google-research-datasets / LLAMA1-Test-Set
View on GitHub
We introduce the LLAMA1 Test Set, a comprehensive open-domain world knowledge QA dataset for evaluating question-answering systems. We pr…
☆23Mar 14, 2024Updated 2 years ago
RoyiRa / GRADE-Quantifying-sample-diversity-in-text-to-image-models
View on GitHub
☆12Mar 5, 2025Updated last year
avivga / lord-pytorch
View on GitHub
Official pytorch re-implementation of "Demystifying Inter-Class Disentanglement", ICLR 2020.
☆13Dec 2, 2021Updated 4 years ago
IBM / awesome-agentic-workflow-optimization
View on GitHub
Survey paper: From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents.
☆72Apr 3, 2026Updated 3 months ago
0xSero / minimax-m2-proxy
View on GitHub
A proxy for minimax-m2, enabling interleaved thinking, and tool calls.
☆39Nov 21, 2025Updated 7 months ago
BIU-NLP / iFACETSUM
View on GitHub
Corpus exploration platform using advanced tools such as interactive summarization and multi document coreference resolution
☆12Jun 15, 2023Updated 3 years ago
lalaliat / Agent-Oriented-Planning
View on GitHub
☆26Feb 28, 2025Updated last year
SjJ1017 / CiteLab
View on GitHub
The predecessor of CiteLab.
☆18Feb 3, 2026Updated 5 months ago
IBM / FailureSensorIQ
View on GitHub
FailureSensorIQ, a dataset and benchmark to probe LLMs’ reasoning and comprehension of sensor–failure relationships in industrial systems…
☆46Jul 2, 2026Updated last week
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
IBM / AIMMX
View on GitHub
Automated AI Model Metadata eXtractor - automatically extracts and infers AI model-related from software repositories
☆11Sep 21, 2025Updated 9 months ago
ShovalMessica / NAST
View on GitHub
Official repository for NAST: Noise Aware Speech Tokenization for Speech Language Models (Interspeech 2024) https://arxiv.org/abs/2406.11…
☆46Jul 2, 2024Updated 2 years ago
shengchaochen82 / FFTS
View on GitHub
[AAAI'25] The implementation of paper "Federated Foundation Models on Heterogeneous Time Series" | The first work to explore time series …
☆24May 10, 2026Updated 2 months ago
IBM / mlflow-watsonml
View on GitHub
MLflow deployment plugin For IBM-cloud-watson-ml
☆15May 7, 2025Updated last year
SkyworkAI / agent-studio
View on GitHub
Environments, tools, and benchmarks for general computer agents
☆17Dec 3, 2024Updated last year
ServiceNow / EnterpriseOps-Gym
View on GitHub
Codebase for EnterpriseOps-Gym from ServiceNow
☆99Updated this week
odedlaz / uberlogs
View on GitHub
make logging fun again
☆20Apr 9, 2017Updated 9 years ago