Top papers related to LLM-based agent evaluation
☆89Oct 21, 2025Updated 5 months ago
Alternatives and similar repositories for LLM-Agent-Evaluation-Survey
Users that are interested in LLM-Agent-Evaluation-Survey are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- The official repo of the paper "StressTest: Can YOUR Speech LM Handle the Stress?"☆20Jul 9, 2025Updated 8 months ago
- Official PyTorch Implementation for the "A Deep Inverse-Mapping Model for a Flapping Robotic Wing" Paper (ICLR 2025)☆21Dec 16, 2025Updated 3 months ago
- FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes☆216Sep 18, 2025Updated 6 months ago
- ☆24May 31, 2024Updated last year
- 🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data …☆211Feb 16, 2026Updated last month
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- A Lossless Compression Library for AI pipelines☆312Mar 20, 2026Updated last week
- A repository to get acquainted with basic training tasks in natural language processing and machine learning☆11Dec 27, 2023Updated 2 years ago
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific way☆18Nov 4, 2025Updated 4 months ago
- This repository contains a user-friendly Graphical User Interface (GUI) for interacting with the Hebrew-Mistral-7B language model.☆15May 3, 2024Updated last year
- The AI Alliance project to define a reference stack for AI model and system evaluation, with evaluations, benchmarks, and leaderboards.☆13Mar 9, 2026Updated 3 weeks ago
- Python framework which enables you to transform how a user calls or infers an IBM Granite model and how the output from the model is retu…☆57Mar 20, 2026Updated last week
- jQuery, React and Streamlit applications written by LLMs☆16Dec 24, 2023Updated 2 years ago
- ☆10Jan 31, 2026Updated last month
- The dataset includes widget captions that describes UI element's functionalities. It is used for training and evaluation of the widget ca…☆23Jun 24, 2021Updated 4 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- QuoteSum is a textual QA dataset containing Semi-Extractive Multi-source Question Answering (SEMQA) examples written by humans, based on …☆13Mar 25, 2024Updated 2 years ago
- Leveraging Base Language Models for Few-Shot Synthetic Data Generation☆41Oct 18, 2025Updated 5 months ago
- Codebase for EnterpriseOps-Gym from ServiceNow☆71Mar 22, 2026Updated last week
- Sample application demonstrates how to use of Vanilla AI Agents framework to build a basic call center in the context of a generic TelCo …☆20Updated this week
- Corpus exploration platform using advanced tools such as interactive summarization and multi document coreference resolution☆12Jun 15, 2023Updated 2 years ago
- ☆21Feb 28, 2025Updated last year
- Implementation of KDR-Agent, the AAAI 2025 accepted paper, focusing on knowledge-driven reasoning for autonomous agents.☆18Nov 24, 2025Updated 4 months ago
- De-Identification of Medical Imaging Data: A Comprehensive Tool for Ensuring Patient Privacy☆21Updated this week
- FailureSensorIQ, a dataset and benchmark to probe LLMs’ reasoning and comprehension of sensor–failure relationships in industrial systems…☆35Mar 18, 2026Updated last week
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- A quick and optimized solution to manage llama based gguf quantized models, download gguf files, retreive messege formatting, add more mo…☆12Jan 13, 2024Updated 2 years ago
- Environments, tools, and benchmarks for general computer agents☆14Dec 3, 2024Updated last year
- make logging fun again☆19Apr 9, 2017Updated 8 years ago
- ☆27Sep 11, 2024Updated last year
- Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numba☆37Oct 16, 2025Updated 5 months ago
- Source code for paper "Trajectory of Alternating Direction Method of Multipliers and Adaptive Acceleration" of NeurIPS 2019☆10Jan 25, 2024Updated 2 years ago
- ☆10Nov 12, 2024Updated last year
- ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World☆25Jun 17, 2025Updated 9 months ago
- [KDD 2025] AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation☆33Nov 18, 2025Updated 4 months ago
- NordVPN Special Discount Offer • AdSave on top-rated NordVPN 1 or 2-year plans with secure browsing, privacy protection, and support for for all major platforms.
- ☆32Mar 20, 2026Updated last week
- This is the official repository of the paper "Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Schedulin…☆13Jul 27, 2025Updated 8 months ago
- Python wrapper for ONLP YAP https://github.com/OnlpLab/yap☆16Jan 27, 2023Updated 3 years ago
- Jax like function transformation engine but micro, microjax☆34Oct 25, 2024Updated last year
- Training and Benchmarking LLMs for Code Preference.☆38Nov 15, 2024Updated last year
- CUGA is an open-source generalist agent for the enterprise, supporting complex task execution on web and APIs, OpenAPI/MCP integrations, …☆694Updated this week
- [ICLR 2023] PyTorch code of Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees☆23Jun 19, 2023Updated 2 years ago