baaivision / JudgeLMLinks
[ICLR 2025 Spotlight] An open-sourced LLM judge for evaluating LLM-generated answers.
☆403Updated 9 months ago
Alternatives and similar repositories for JudgeLM
Users that are interested in JudgeLM are comparing it to the libraries listed below
Sorting:
- Official repository for ORPO☆465Updated last year
- FuseAI Project☆583Updated 9 months ago
- Generative Representational Instruction Tuning☆678Updated 4 months ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆477Updated last year
- Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)☆377Updated 3 weeks ago
- [ACL 2024] Progressive LLaMA with Block Expansion.☆511Updated last year
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆365Updated last year
- LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.☆753Updated last year
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆444Updated last year
- ☆552Updated 11 months ago
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …☆281Updated 2 years ago
- ☆313Updated last year
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"☆523Updated 10 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆244Updated last year
- MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts☆344Updated last month
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]☆360Updated last year
- [COLM 2024] LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition☆657Updated last year
- [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step☆297Updated last year
- FireAct: Toward Language Agent Fine-tuning☆284Updated 2 years ago
- The official evaluation suite and dynamic data release for MixEval.☆252Updated last year
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"☆471Updated last year
- AWM: Agent Workflow Memory☆353Updated 9 months ago
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆263Updated last year
- Complex Function Calling Benchmark.☆147Updated 9 months ago
- [ICLR 2024 & NeurIPS 2023 WS] An Evaluator LM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically d…☆306Updated 2 years ago
- ☆156Updated last year
- An implemtation of Everyting of Thoughts (XoT).☆154Updated last year
- VisualWebArena is a benchmark for multimodal agents.☆402Updated last year
- Codebase for Merging Language Models (ICML 2024)☆859Updated last year
- [Preprint] Learning to Filter Context for Retrieval-Augmented Generaton☆197Updated last year