OpenStellarTeam / DeltaBenchLinks

☆44

Alternatives and similar repositories for DeltaBench

Users that are interested in DeltaBench are comparing it to the libraries listed below

Sorting:

GAIR-NLP / weak-to-strong-reasoning
☆58Updated last year
THU-KEG / RM-Bench
[ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
☆64Updated 3 months ago
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆66Updated 11 months ago
sail-sg / CPO
[NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.
☆131Updated 7 months ago
GAIR-NLP / ReasonEval
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆73Updated 3 weeks ago
alisawuffles / proxy-tuning
Code associated with Tuning Language Models by Proxy (Liu et al., 2024)
☆121Updated last year
HKUNLP / critic-rl
[ICML 2025] Teaching Language Models to Critique via Reinforcement Learning
☆114Updated 5 months ago
SihengLi99 / LLM-Honesty-Survey
[2025-TMLR] A Survey on the Honesty of Large Language Models
☆61Updated 10 months ago
qiuzh20 / EMoE
Official PyTorch Implementation of EMoE: Unlocking Emergent Modularity in Large Language Models [main conference @ NAACL2024]
☆35Updated last year
TianHongZXY / RLVR-Decomposed
[NeurIPS 2025] Implementation for the paper "The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning"
☆115Updated 2 months ago
KbsdJames / omni-math-rule
The rule-based evaluation subset and code implementation of Omni-MATH
☆24Updated 10 months ago
GAIR-NLP / alignment-for-honesty
☆75Updated last year
chujiezheng / LLM-Extrapolation
Official repository for ACL 2025 paper "Model Extrapolation Expedites Alignment"
☆75Updated 5 months ago
princeton-nlp / MQuAKE
[EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
☆117Updated last year
test-time-interaction / TTI
☆63Updated 4 months ago
icip-cas / Verifier-Engineering
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
☆62Updated 10 months ago
GAIR-NLP / MetaCritique
Evaluate the Quality of Critique
☆36Updated last year
SihengLi99 / SEALONG
Large Language Models Can Self-Improve in Long-context Reasoning
☆73Updated 11 months ago
GAIR-NLP / BeHonest
BeHonest: Benchmarking Honesty in Large Language Models
☆34Updated last year
SparkJiao / dpo-trajectory-reasoning
[EMNLP 2024] Source code for the paper "Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing".
☆82Updated 9 months ago
GAIR-NLP / Preference-Dissection
☆25Updated last year
hkust-nlp / B-STaR
B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
☆85Updated 5 months ago
QwenLM / ProcessBench
Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
☆174Updated 5 months ago
princeton-nlp / LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
☆132Updated last year
yyDing1 / ScaleQuest
[ACL 2025] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLM…
☆68Updated last year
tatsu-lab / test_set_contamination
☆41Updated last year
zepingyu0512 / neuron-attribution
code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models
☆45Updated 11 months ago
dvlab-research / Mr-Ben
This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"
☆50Updated last year
shizhediao / R-Tuning
[NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't…
☆124Updated last year
eric-mitchell / serac
Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model
☆68Updated 3 years ago