[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
☆218Dec 24, 2023Updated 2 years ago
Alternatives and similar repositories for FLASK
Users that are interested in FLASK are comparing it to the libraries listed below
Sorting:
- ☆11Sep 19, 2025Updated 5 months ago
- [ICLR 2024 & NeurIPS 2023 WS] An Evaluator LM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically d…☆311Nov 11, 2023Updated 2 years ago
- ☆24Dec 2, 2023Updated 2 years ago
- This repository contains the official code for the paper: "Prompt Injection: Parameterization of Fixed Inputs"☆32Sep 13, 2024Updated last year
- Reward Model을 이용하여 언어모델의 답변을 평가하기☆29Feb 23, 2024Updated 2 years ago
- IntructIR, a novel benchmark specifically designed to evaluate the instruction following ability in information retrieval models. Our foc…☆32Jun 13, 2024Updated last year
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning☆254Oct 31, 2023Updated 2 years ago
- [NeurIPS 2025] Reasoning Models Better Express Their Confidence"☆22Nov 19, 2025Updated 3 months ago
- [ACL 2023] Gradient Ascent Post-training Enhances Language Model Generalization☆29Sep 12, 2024Updated last year
- [ACL 2024 Findings & ICLR 2024 WS] An Evaluator VLM that is open-source, offers reproducible evaluation, and inexpensive to use. Specific…☆80Sep 13, 2024Updated last year
- Evaluate your LLM's response with Prometheus and GPT4 💯☆1,046Apr 25, 2025Updated 10 months ago
- Advanced Reasoning Benchmark Dataset for LLMs☆47Nov 19, 2023Updated 2 years ago
- Repository for "Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators"☆12Mar 25, 2025Updated 11 months ago
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- All-in-one repository for Fine-tuning & Pretraining (Large) Language Models☆15Mar 8, 2023Updated 2 years ago
- [ICML 2023] Exploring the Benefits of Training Expert Language Models over Instruction Tuning☆98Apr 26, 2023Updated 2 years ago
- [EMNLP 2023] Official repository for Dialogue Chain-of-Thought Distillation (DONUT & DOCTOR)☆11Nov 15, 2023Updated 2 years ago
- Official codebase for "SelFee: Iterative Self-Revising LLM Empowered by Self-Feedback Generation"☆228Jun 6, 2023Updated 2 years ago
- [NAACL 2024] Official repository for "KTRL+F: Knowledge-Augmented In-Document Search"☆23Oct 11, 2024Updated last year
- Source Code of Paper "GPTScore: Evaluate as You Desire"☆258Feb 21, 2023Updated 3 years ago
- Official repository for KoMT-Bench built by LG AI Research☆71Aug 8, 2024Updated last year
- Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging☆118Oct 23, 2023Updated 2 years ago
- This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"☆17Feb 22, 2024Updated 2 years ago
- Official implementation of "OffsetBias: Leveraging Debiased Data for Tuning Evaluators"☆25Sep 11, 2024Updated last year
- CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean☆48Dec 23, 2024Updated last year
- [EACL 2023] CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification☆42Apr 29, 2023Updated 2 years ago
- 언어모델을 학습하기 위한 공개 한국어 instruction dataset들을 모아두었습니다.☆452Apr 13, 2025Updated 10 months ago
- Code and data for "KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark" (LREC-COLING…☆17Apr 15, 2025Updated 10 months ago
- [NAACL 2024] Vision language model that reduces hallucinations through self-feedback guided revision. Visualizes attentions on image feat…☆47Aug 21, 2024Updated last year
- Dromedary: towards helpful, ethical and reliable LLMs.☆1,144Sep 18, 2025Updated 5 months ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆1,003Jun 21, 2025Updated 8 months ago
- [AAAI 2024] Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following☆79Sep 13, 2024Updated last year
- Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …☆2,684Updated this week
- ☆34Jan 7, 2026Updated last month
- Benchmarking LLMs with Challenging Tasks from Real Users☆246Nov 3, 2024Updated last year
- Chain-of-thought 방식을 활용하여 llama2를 fine-tuning☆10Nov 18, 2023Updated 2 years ago
- [TACL 2024] Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis☆11Nov 14, 2024Updated last year
- ☆11Jun 5, 2024Updated last year
- Reimplementation of the task generation part from the Alpaca paper☆119Apr 4, 2023Updated 2 years ago