A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
☆90Jan 29, 2024Updated 2 years ago
Alternatives and similar repositories for just-eval
Users that are interested in just-eval are comparing it to the libraries listed below
Sorting:
- ☆313Jun 9, 2024Updated last year
- Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!☆11Oct 16, 2024Updated last year
- An Empirical Study On Contrastive Search And Contrastive Decoding For Open-ended Text Generation☆27Jun 7, 2024Updated last year
- This repository contains data, code and models for contextual noncompliance.☆25Jul 18, 2024Updated last year
- ☆19Sep 16, 2025Updated 5 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆151Jul 19, 2024Updated last year
- [NAACL 2025] Representing Rule-based Chatbots with Transformers☆23Feb 9, 2025Updated last year
- ☆16Jul 23, 2024Updated last year
- [EMNLP 2024] Official implementation of "Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Ut…☆23Dec 4, 2024Updated last year
- Research work aimed at addressing the problem of modeling infinite-length context☆46Dec 18, 2025Updated 2 months ago
- ☆50Feb 5, 2023Updated 3 years ago
- ☆22Sep 2, 2025Updated 6 months ago
- Source code of "Reasons to Reject? Aligning Language Models with Judgments"☆58Feb 29, 2024Updated 2 years ago
- IntructIR, a novel benchmark specifically designed to evaluate the instruction following ability in information retrieval models. Our foc…☆32Jun 13, 2024Updated last year
- Momentum Decoding: Open-ended Text Generation as Graph Exploration☆19Jan 27, 2023Updated 3 years ago
- A framework for few-shot evaluation of autoregressive language models.☆12Jul 14, 2025Updated 7 months ago
- Source code for SWIFT, an efficient reward model.☆18Jan 13, 2026Updated last month
- SummScreen: A Dataset for Abstractive Screenplay Summarization (ACL 2022)☆41May 22, 2022Updated 3 years ago
- ☆23Jul 5, 2024Updated last year
- ☆50Jun 7, 2025Updated 8 months ago
- CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)☆73Jun 25, 2024Updated last year
- [ACL 2024] Code for the paper "ALaRM: Align Language Models via Hierarchical Rewards Modeling"☆25Mar 28, 2024Updated last year
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Oct 11, 2024Updated last year
- LongAttn :Selecting Long-context Training Data via Token-level Attention☆15Jul 16, 2025Updated 7 months ago
- ☆11Jan 3, 2024Updated 2 years ago
- ☆11Apr 6, 2024Updated last year
- Open-source repository for the OOPSLA'24 paper "CYCLE: Learning to Self-Refine Code Generation"☆10Mar 8, 2024Updated last year
- Align, a general text alignment function☆15Dec 7, 2023Updated 2 years ago
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆49Dec 22, 2023Updated 2 years ago
- ☆46Feb 8, 2024Updated 2 years ago
- ☆20Aug 14, 2025Updated 6 months ago
- Data and Code for Paper "Reflect Not Reflex: Inference-Based Common Ground Improves Dialogue Response Quality" (EMNLP 2022)☆11Nov 28, 2022Updated 3 years ago
- Know2BIO: A Comprehensive Dual-View Benchmark for Evolving Biomedical Knowledge Graphs☆14Feb 10, 2026Updated 3 weeks ago
- [ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs☆13Jun 20, 2025Updated 8 months ago
- Knowledge Graph based Question Answering benchmark.☆10Feb 1, 2020Updated 6 years ago
- Implementation for the paper "Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning"☆11Jan 10, 2025Updated last year
- ☆51Mar 2, 2024Updated 2 years ago
- 中文原生 等级化代码能力测试基准☆15Apr 11, 2024Updated last year
- ☆13Jul 2, 2025Updated 8 months ago