uiuc-kang-lab/agentic-benchmarks

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/uiuc-kang-lab/agentic-benchmarks)

uiuc-kang-lab / agentic-benchmarks

☆60

Alternatives and similar repositories for agentic-benchmarks

Users that are interested in agentic-benchmarks are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

agentica-project / R2E-Gym
View on GitHub
☆23Jul 10, 2025Updated last year
mrconter1 / BenchmarkAggregator
View on GitHub
Comprehensive LLM evaluation framework: GPQA Diamond to Chatbot Arena. Tests all major models equally, easily extensible.
☆17Aug 22, 2024Updated last year
uiuc-kang-lab / AdaptiveAttackAgent
View on GitHub
☆39Mar 12, 2025Updated last year
mbzuai-oryx / Agent-X
View on GitHub
ICLR 2026: Agent-X Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
☆43Apr 28, 2026Updated 3 months ago
siegelz / core-bench
View on GitHub
☆78Nov 23, 2025Updated 8 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
EmpathYang / ADEPT
View on GitHub
Source code and data for ADEPT: A DEbiasing PrompT Framework (AAAI-23).
☆15Dec 13, 2024Updated last year
uiuc-kang-lab / InjecAgent
View on GitHub
☆153Jul 2, 2024Updated 2 years ago
0xSero / sglang-moet
View on GitHub
SGLang-native serving for the Moet sign-symmetric W2 expert format with SM120 W2/W4 kernels, GLM-5.2 NVFP4 TP4 on 4x RTX PRO 6000
☆20Jul 10, 2026Updated 2 weeks ago
Ying1123 / llm-caching-multiplexing
View on GitHub
☆19Jun 3, 2023Updated 3 years ago
felixbinder / introspection_self_prediction
View on GitHub
Code for experiments on self-prediction as a way to measure introspection in LLMs
☆16Dec 10, 2024Updated last year
Tomiinek / Aargh
View on GitHub
☆12Jan 2, 2024Updated 2 years ago
rrgeorge-pdcontributions / NSFW-Words-List
View on GitHub
Text file containing NSFW words aggregated from various sources.
☆12Aug 23, 2020Updated 5 years ago
GAIR-NLP / OctoThinker
View on GitHub
Revisiting Mid-training in the Era of Reinforcement Learning Scaling
☆189Jul 23, 2025Updated last year
renll / SparseLT
View on GitHub
[EMNLP 2022] Language Model Pre-Training with Sparse Latent Typing
☆14Feb 10, 2023Updated 3 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
allenai / AskOlmo
View on GitHub
☆15Nov 19, 2025Updated 8 months ago
Victorwz / LaViA
View on GitHub
☆10Jul 13, 2024Updated 2 years ago
zhaodongsun / rppg_biometrics
View on GitHub
rPPG-based Biometric Authentication
☆11Jun 3, 2025Updated last year
ZQS1943 / DOCIE
View on GitHub
☆17Jun 15, 2022Updated 4 years ago
amrrs / python-codegen-ai
View on GitHub
python code generation from natural language prompt
☆15Jun 30, 2022Updated 4 years ago
Tencent-Hunyuan / Hunyuan-4B
View on GitHub
☆16Aug 5, 2025Updated 11 months ago
2prime / OpenBlackBox
View on GitHub
☆12Nov 5, 2019Updated 6 years ago
allenai / infinigram-api
View on GitHub
☆102Jul 16, 2026Updated last week
fra31 / rlhf-trojan-competition-submission
View on GitHub
☆19Feb 25, 2024Updated 2 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
reka-ai / rekaquant
View on GitHub
☆63Jul 10, 2025Updated last year
databricks / officeqa
View on GitHub
Repository for getting started with the OfficeQA Benchmark.
☆164Jul 21, 2026Updated last week
nikhilchandak / answer-matching
View on GitHub
Code for 'Answer Matching Outperforms Multiple Choice for Language Model Evaluation' paper
☆18Jul 4, 2025Updated last year
tansey / smoothfdr
View on GitHub
False discovery rate smoothing
☆14May 8, 2020Updated 6 years ago
wizard1203 / FuseFL
View on GitHub
FuseFL: One-Shot Federated Learning through the Lens of Causality with Progressive Model Fusion (NeurIPS 2024 Spotlight)
☆15Mar 31, 2025Updated last year
FrontierCS / FrontierSmith
View on GitHub
FrontierSmith, a new system that uses AI to synthesize open-ended coding problems at scale
☆50May 30, 2026Updated last month
liziniu / policy_optimization
View on GitHub
Code for Paper (Policy Optimization in RLHF: The Impact of Out-of-preference Data)
☆29Dec 19, 2023Updated 2 years ago
GeneralUserModels / napsack
View on GitHub
☆16Apr 4, 2026Updated 3 months ago
FreedomIntelligence / MyPhoneBench
View on GitHub
MyPhoneBench: Do Phone-Use Agents Respect Your Privacy?
☆24Apr 3, 2026Updated 3 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
abbyvansoest / maxent
View on GitHub
☆14May 30, 2019Updated 7 years ago
zhenyuhe00 / SWE-Swiss
View on GitHub
SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution
☆105Sep 24, 2025Updated 10 months ago
ise-uiuc / NablaFuzz
View on GitHub
Fuzzing Automatic Differentiation in Deep-Learning Libraries (ICSE'23)
☆27Mar 2, 2024Updated 2 years ago
marukosan93 / ORPDAD
View on GitHub
This is the official code repository of our dataset and ECCV 2024 paper entitled "Oulu Remote-photoplethysmography Physical Domain Attac…
☆14Jul 9, 2025Updated last year
apartresearch / DarkBench
View on GitHub
Benchmarking Dark Patterns in LLMs (ICLR 2025)
☆18Mar 29, 2025Updated last year
cyfer0618 / kaldi-pytorch-rnnlm
View on GitHub
Enable RNNLM lattice rescoring with Pytorch [kaldi]
☆12Jun 5, 2020Updated 6 years ago
t-vi / lod2021
View on GitHub
PyTorch Tutorial at the LOD2021 conference
☆21Oct 7, 2021Updated 4 years ago