PAIR-code / llm-comparator
LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR team.
β415Updated 2 months ago
Alternatives and similar repositories for llm-comparator:
Users that are interested in llm-comparator are comparing it to the libraries listed below
- Evaluate your LLM's response with Prometheus and GPT4 π―β911Updated last month
- awesome synthetic (text) datasetsβ272Updated 5 months ago
- Automatically evaluate your LLMs in Google Colabβ615Updated 11 months ago
- This project showcases an LLMOps pipeline that fine-tunes a small-size LLM model to prepare for the outage of the service LLM.β303Updated 3 weeks ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Modelsβ509Updated 9 months ago
- Automatic evals for LLMsβ373Updated this week
- Automated Evaluation of RAG Systemsβ579Updated 3 weeks ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsβ1,438Updated last week
- Model, Code & Data for the EMNLP'23 paper "Making Large Language Models Better Data Creators"β131Updated last year
- β515Updated 5 months ago
- A Lightweight Library for AI Observabilityβ241Updated 2 months ago
- A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various useβ¦β114Updated last week
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycleβ257Updated this week
- Banishing LLM Hallucinations Requires Rethinking Generalizationβ273Updated 9 months ago
- Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"β170Updated 4 months ago
- Code and Data for Tau-Benchβ437Updated 3 months ago
- π€ Benchmark Large Language Models Reliably On Your Dataβ240Updated last week
- Let's build better datasets, together!β259Updated 4 months ago
- Tutorial for building LLM routerβ193Updated 9 months ago
- β162Updated 4 months ago
- Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard aβ¦β1,143Updated 3 months ago
- Generative Representational Instruction Tuningβ620Updated last month
- A reading list on LLM based Synthetic Data Generation π₯β1,246Updated 2 months ago
- An agent benchmark with tasks in a simulated software company.β294Updated 2 weeks ago
- Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on taskβ¦β163Updated 7 months ago
- DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. β π€π€β1,010Updated 2 months ago
- Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.β197Updated 2 weeks ago
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data β¦β681Updated last month
- Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.β107Updated 7 months ago
- Official repository for ORPOβ448Updated 10 months ago