baaivision / JudgeLMLinks
[ICLR 2025 Spotlight] An open-sourced LLM judge for evaluating LLM-generated answers.
☆415Updated 11 months ago
Alternatives and similar repositories for JudgeLM
Users that are interested in JudgeLM are comparing it to the libraries listed below
Sorting:
- Official repository for ORPO☆469Updated last year
- Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)☆386Updated 2 months ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆366Updated last year
- LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.☆778Updated last year
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"☆535Updated last year
- [COLM 2024] LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition☆667Updated last year
- FuseAI Project☆587Updated last year
- Generative Representational Instruction Tuning☆685Updated 7 months ago
- [ICLR 2024 & NeurIPS 2023 WS] An Evaluator LM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically d…☆308Updated 2 years ago
- [NeurIPS 2024 Spotlight] Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models☆677Updated 7 months ago
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …☆285Updated 2 years ago
- [ACL 2024] Progressive LLaMA with Block Expansion.☆514Updated last year
- Benchmarking LLMs with Challenging Tasks from Real Users☆246Updated last year
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆270Updated last year
- ☆564Updated last year
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆484Updated last year
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆456Updated last year
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.☆551Updated last year
- MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts☆354Updated 4 months ago
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them☆546Updated last year
- ☆123Updated last year
- Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".☆665Updated this week
- [ICLR 2024] Lemur: Open Foundation Models for Language Agents☆556Updated 2 years ago
- ☆313Updated last year
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]☆390Updated last year
- [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step☆304Updated last year
- Code for the paper 🌳 Tree Search for Language Model Agents☆219Updated last year
- OpenICL is an open-source framework to facilitate research, development, and prototyping of in-context learning.☆584Updated 2 years ago
- Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"☆223Updated last year
- Official repository of NEFTune: Noisy Embeddings Improves Instruction Finetuning☆409Updated last year