GaryStack/Trustworthy-Evaluation

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/GaryStack/Trustworthy-Evaluation)

GaryStack / Trustworthy-Evaluation

Repository of paper "Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis" (ACL 2025 Main)

☆19

Alternatives and similar repositories for Trustworthy-Evaluation

Users that are interested in Trustworthy-Evaluation are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

THU-KEG / DICE
View on GitHub
DICE: Detecting In-distribution Data Contamination with LLM's Internal State
☆12Sep 21, 2024Updated last year
THU-KEG / DeepPrune
View on GitHub
🌿 DeepPrune: Parallel Scaling without Inter-trace Redundancy
☆21Apr 20, 2026Updated 3 months ago
HongbangYuan / OmniReward
View on GitHub
☆47Dec 16, 2025Updated 7 months ago
jinzhuoran / RAG-RewardBench
View on GitHub
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
☆18Dec 19, 2024Updated last year
THU-KEG / WaterBench
View on GitHub
[ACL2024-Main] Data and Code for WaterBench: Towards Holistic Evaluation of LLM Watermarks
☆32Nov 14, 2023Updated 2 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
chrisliu298 / llm-unlearn-eco
View on GitHub
[NeurIPS 2024] Large Language Model Unlearning via Embedding-Corrupted Prompts
☆41Sep 26, 2024Updated last year
CSHaitao / LegalAgentBench
View on GitHub
The official repo for our paper: LegalAgentBench: Evaluating LLM Agents in Legal Domainl
☆49Apr 10, 2026Updated 3 months ago
Trae1ounG / Pretrain_Space_RLVR
View on GitHub
[arxiv: 2604.14142] From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space
☆17Apr 16, 2026Updated 3 months ago
ModalMinds / gym-v
View on GitHub
A unified framework for vision-language environments with Gymnasium-compatible interface
☆35Mar 17, 2026Updated 4 months ago
cjj826 / GoalAct
View on GitHub
The repo for our paper: Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution (NCIIP 2025 Best Paper)
☆17Aug 18, 2025Updated 11 months ago
refkxh / DUSA
View on GitHub
[ACM MM 2023] Official implementation of DUSA: Decoupled Unsupervised Sim2Real Adaptation for Vehicle-to-Everything Collaborative Percept…
☆12Nov 17, 2023Updated 2 years ago
jinzhuoran / MiNer
View on GitHub
A Good Neighbor, A Found Treasure: Mining Treasured Neighbors for Knowledge Graph Entity Typing. EMNLP 2022
☆11Feb 1, 2023Updated 3 years ago
jinzhuoran / CogIE
View on GitHub
CogIE: An Information Extraction Toolkit for Bridging Text and CogNet. ACL 2021
☆71Aug 27, 2022Updated 3 years ago
chenlong-clock / DTELS-Bench
View on GitHub
[NAACL 2025 Main] DTELS: Towards Dynamic Granularity of Timeline Summarization
☆17Oct 9, 2025Updated 9 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
holarissun / RewardModelingBeyondBradleyTerry
View on GitHub
official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and…
☆73Apr 2, 2025Updated last year
codephage2020 / slock-desktop
View on GitHub
Slock workspace client for macOS.
☆27May 11, 2026Updated 2 months ago
JianyuanZhong / StableDRL
View on GitHub
☆15Updated this week
CSHaitao / LegalOne
View on GitHub
LegalOne: A Family of Foundation Models for Reliable Legal Reasoning
☆66Feb 3, 2026Updated 5 months ago
huawei-lin / RapidIn
View on GitHub
RapidIn: Scalable Influence Estimation for Large Language Models (LLMs). The implementation for paper "Token-wise Influential Training Da…
☆22Mar 10, 2026Updated 4 months ago
chenlong-clock / RULE-Unlearn
View on GitHub
[NeurIPS25] RULE: Reinforcement UnLEarning Achieves Forge-retain Pareto Optimality
☆20Oct 22, 2025Updated 9 months ago
Timothyxxx / TestTimeTrainingPapers
View on GitHub
☆59Apr 13, 2026Updated 3 months ago
zhaosuifeng / FinRAGBench-V
View on GitHub
FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain (EMNLP 2025)
☆19Jan 13, 2026Updated 6 months ago
Junjie-Ye / ToolEyes
View on GitHub
[COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆74May 13, 2025Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
interactivebench / InteractiveBench
View on GitHub
Official Project Page for Interactive Benchmarks
☆31May 12, 2026Updated 2 months ago
refkxh / BiCo
View on GitHub
[CVPR 2026 Highlight] Official implementation of BiCo: Composing Concepts from Images and Videos via Concept-prompt Binding
☆86May 31, 2026Updated last month
CLR-Lab / SimKO
View on GitHub
SimKO: Simple Pass@K Policy Optimization
☆31Oct 24, 2025Updated 8 months ago
amao0o0 / awesome-AI-Math-Datasets
View on GitHub
A collection of recent open-source math datasets for training and evaluating Math LLMs
☆32Apr 26, 2026Updated 2 months ago
THU-Team-Eureka / EurekAgent
View on GitHub
EurekAgent: an autonomous research system for metric-driven tasks, built with Claude Code. Define the problem and metric. Get breakthroug…
☆73Updated this week
THU-KEG / Xlore2.0
View on GitHub
Xlore2.0 Code[BaiduExtractor, HudongExtractor, WikiExtractor, XloreData, XloreWeb]
☆12Apr 5, 2017Updated 9 years ago
kuleshov-group / d2
View on GitHub
d2: Improved Techinques for Training Reasonoing Diffusion Language Models
☆16Mar 25, 2026Updated 3 months ago
DingWu1021 / Promsa
View on GitHub
Promsa: Search Agent Research
☆80Jul 7, 2026Updated 2 weeks ago
THU-KEG / Event-Level-Knowledge-Editing
View on GitHub
☆12Apr 25, 2024Updated 2 years ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
THU-KEG / Entity-Linking-Trends-and-History
View on GitHub
Papers about the trend of Entity Linking in recent years.
☆11Sep 5, 2022Updated 3 years ago
yunzhusong / NAACL2022-REFLECT
View on GitHub
Code for the paper: Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness
☆12Oct 22, 2023Updated 2 years ago
ShuyangCao / hibrids_summ
View on GitHub
Code for ACL 2022 paper "HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long Document Summarization".
☆13May 24, 2022Updated 4 years ago
MoonshotAI / Kimi-Researcher
View on GitHub
☆80Jun 20, 2025Updated last year
ShadeCloak / ADORA
View on GitHub
☆47Apr 9, 2025Updated last year
refkxh / C-Instructor
View on GitHub
[ECCV 2024] Official implementation of C-Instructor: Controllable Navigation Instruction Generation with Chain of Thought Prompting
☆31Dec 16, 2024Updated last year
DA-Open / DV-World
View on GitHub
[ICML 2026] DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
☆69Apr 29, 2026Updated 2 months ago