IDEA-FinAI / LLM-as-a-JudgeLinks

☆117

Alternatives and similar repositories for LLM-as-a-Judge

Users that are interested in LLM-as-a-Judge are comparing it to the libraries listed below

Sorting:

xlang-ai / BRIGHT
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
☆141Updated last month
OSU-NLP-Group / LLM-Knowledge-Conflict
[ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"
☆69Updated last year
ZitongYang / Synthetic_Continued_Pretraining
Code implementation of synthetic continued pretraining
☆114Updated 5 months ago
Ayanami0730 / deep_research_bench
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
☆82Updated this week
TIGER-AI-Lab / MAmmoTH2
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆144Updated 7 months ago
dwzhu-pku / LongEmbed
LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)
☆137Updated 7 months ago
shizhediao / R-Tuning
[NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't…
☆114Updated 11 months ago
ScalerLab / JudgeBench
☆85Updated 7 months ago
facebookresearch / ReasonIR
Official repository for paper "ReasonIR Training Retrievers for Reasoning Tasks".
☆170Updated 2 weeks ago
weizhepei / InstructRAG
[ICLR 2025] InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales
☆98Updated 4 months ago
zankner / CLoud
Critique-out-Loud Reward Models
☆66Updated 8 months ago
QwenLM / ProcessBench
Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
☆158Updated last month
kyegomez / Lets-Verify-Step-by-Step
"Improving Mathematical Reasoning with Process Supervision" by OPENAI
☆108Updated this week
ParticleMedia / RAGTruth
Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"
☆185Updated 6 months ago
THUDM / ComplexFuncBench
Complex Function Calling Benchmark.
☆114Updated 5 months ago
MozerWang / Loong
[EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
☆134Updated 7 months ago
GAIR-NLP / scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
☆42Updated last year
TIGER-AI-Lab / CritiqueFineTuning
Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"
☆157Updated 2 weeks ago
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆81Updated 10 months ago
princeton-nlp / LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
☆127Updated 11 months ago
xsc1234 / Search-in-the-Chain
Code for Search-in-the-Chain: Towards Accurate, Credible and Traceable Large Language Models for Knowledge-intensive Tasks
☆57Updated last year
GAIR-NLP / ReAlign
Reformatted Alignment
☆113Updated 8 months ago
rxlqn / awesome-llm-self-reflection
augmented LLM with self reflection
☆126Updated last year
Hannibal046 / xRAG
[Neurips2024] Source code for xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token
☆142Updated 11 months ago
GAIR-NLP / AIME-Preview
☆68Updated 3 months ago
icip-cas / Verifier-Engineering
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
☆59Updated 6 months ago
LiqiangJing / DSBench
DSBench: How Far are Data Science Agents from Becoming Data Science Experts?
☆55Updated 4 months ago
THU-KEG / AdaptThink
☆112Updated 3 weeks ago
tianyang-x / SaySelf
Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"
☆106Updated 8 months ago
zjunlp / WKM
[NeurIPS 2024] Agent Planning with World Knowledge Model
☆141Updated 6 months ago