PAIR-code / llm-comparator
LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR team.
β365Updated 3 months ago
Alternatives and similar repositories for llm-comparator:
Users that are interested in llm-comparator are comparing it to the libraries listed below
- Evaluate your LLM's response with Prometheus and GPT4 π―β854Updated 3 weeks ago
- Automated Evaluation of RAG Systemsβ532Updated 2 months ago
- awesome synthetic (text) datasetsβ256Updated 3 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsβ1,022Updated this week
- β489Updated 2 months ago
- [ICLR 2024 & NeurIPS 2023 WS] An Evaluator LM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically dβ¦β295Updated last year
- β243Updated last month
- Automatically evaluate your LLMs in Google Colabβ583Updated 8 months ago
- Banishing LLM Hallucinations Requires Rethinking Generalizationβ269Updated 6 months ago
- π€ Agent-as-a-Judge and DevAI datasetβ313Updated last week
- Framework for enhancing LLMs for RAG tasks using fine-tuning.β522Updated last month
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.β1,260Updated last week
- Easily embed, cluster and semantically label text datasetsβ493Updated 10 months ago
- An Open Source Toolkit For LLM Distillationβ439Updated 3 weeks ago
- Code for Husky, an open-source language agent that solves complex, multi-step reasoning tasks. Husky v1 addresses numerical, tabular and β¦β332Updated 7 months ago
- Official repo for "LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs".β213Updated 5 months ago
- AWM: Agent Workflow Memoryβ233Updated 2 months ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Modelsβ491Updated 7 months ago
- Official repository for ORPOβ432Updated 7 months ago
- Let's build better datasets, together!β249Updated last month
- Sample notebooks and prompts for LLM evaluationβ119Updated 2 months ago
- Python SDK for running evaluations on LLM generated responsesβ255Updated 2 weeks ago
- Code for explaining and evaluating late chunking (chunked pooling)β313Updated last month
- β151Updated 5 months ago
- Generative Representational Instruction Tuningβ588Updated last week
- Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard aβ¦β971Updated 3 weeks ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuningβ347Updated 4 months ago
- β137Updated 6 months ago
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifiβ¦β2,064Updated this week
- Attribute (or cite) statements generated by LLMs back to in-context information.β190Updated 3 months ago