Re-Align / just-evalLinks

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

☆87

Alternatives and similar repositories for just-eval

Users that are interested in just-eval are comparing it to the libraries listed below

Sorting:

wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆58Updated last year
liyucheng09 / Contamination_Detector
Lightweight tool to identify Data Contamination in LLMs evaluation
☆52Updated last year
TIGER-AI-Lab / MAmmoTH2
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆148Updated 11 months ago
GAIR-NLP / ReAlign
Reformatted Alignment
☆112Updated last year
abhika-m / FAVA
☆74Updated last year
GAIR-NLP / OPO
☆51Updated last year
chujiezheng / LLM-Extrapolation
Official repository for ACL 2025 paper "Model Extrapolation Expedites Alignment"
☆75Updated 5 months ago
GAIR-NLP / scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
☆42Updated last year
GasolSun36 / Iter-CoT
[NAACL 2024] Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models
☆86Updated last year
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆82Updated last year
IBM / SALMON
Self-Alignment with Principle-Following Reward Models
☆168Updated last month
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆51Updated 4 months ago
DAMO-NLP-SG / contrastive-cot
Contrastive Chain-of-Thought Prompting
☆68Updated last year
google / sycophancy-intervention
Scripts for generating synthetic finetuning data for reducing sycophancy.
☆116Updated 2 years ago
OSU-NLP-Group / llm-planning-eval
[ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"
☆54Updated last year
clinicalml / co-llm
Co-LLM: Learning to Decode Collaboratively with Multiple Language Models
☆121Updated last year
thunlp / Prompt-Transferability
On Transferability of Prompt Tuning for Natural Language Processing
☆100Updated last year
UKPLab / acl2025-diverse-cot
Code for the 2025 ACL publication "Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs"
☆33Updated 3 months ago
csitfun / LogiCoT
the instructions and demonstrations for building a formal logical reasoning capable GLM
☆54Updated last year
swj0419 / detect-pretrain-code
This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…
☆232Updated last year
Junjie-Ye / ToolEyes
[COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆69Updated 5 months ago
QingruZhang / PASTA
PASTA: Post-hoc Attention Steering for LLMs
☆123Updated 11 months ago
gpt4life / alpagasus
Unofficial implementation of AlpaGasus
☆93Updated 2 years ago
shizhediao / R-Tuning
[NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't…
☆121Updated last year
GAIR-NLP / OlympicArena
[NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
☆105Updated 7 months ago
lifan-yuan / CRAFT
Code for ICLR 2024 paper "CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets"
☆59Updated last year
cambridgeltl / PairS
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)
☆49Updated 9 months ago
chenhongqiao / ToolDec
Syntax Error-Free and Generalizable Tool Use for LLMs via Finite-State Decoding
☆27Updated last year
GAIR-NLP / Entropy-ABF
Official implementation for 'Extending LLMs’ Context Window with 100 Samples'
☆80Updated last year
princeton-nlp / LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
☆131Updated last year