PAIR-code / llm-comparatorLinks

LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR team.

☆487

Alternatives and similar repositories for llm-comparator

Users that are interested in llm-comparator are comparing it to the libraries listed below

Sorting:

alopatenko / LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use…
☆143Updated this week
cfahlgren1 / observers
A Lightweight Library for AI Observability
☆251Updated 7 months ago
prometheus-eval / prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
☆1,002Updated 5 months ago
anyscale / llm-router
Tutorial for building LLM router
☆230Updated last year
quotient-ai / judges
A small library of LLM judges
☆294Updated 2 months ago
IntelLabs / RAG-FiT
Framework for enhancing LLMs for RAG tasks using fine-tuning.
☆750Updated 4 months ago
ServiceNow / TapeAgents
TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle
☆298Updated this week
mlabonne / llm-autoeval
Automatically evaluate your LLMs in Google Colab
☆660Updated last year
huggingface / yourbench
🤗 Benchmark Large Language Models Reliably On Your Data
☆404Updated 2 weeks ago
google / lmeval
☆232Updated 3 months ago
deepset-ai / haystack-cookbook
👩🏻‍🍳 A collection of example notebooks using Haystack
☆506Updated last week
wandb / wandbot
wandbot is a technical support bot for Weights & Biases' AI developer tools that can run in Discord, Slack, ChatGPT and Zendesk
☆310Updated last month
MadryLab / context-cite
Attribute (or cite) statements generated by LLMs back to in-context information.
☆291Updated last year
lamini-ai / Lamini-Memory-Tuning
Banishing LLM Hallucinations Requires Rethinking Generalization
☆275Updated last year
stanford-futuredata / ARES
Automated Evaluation of RAG Systems
☆664Updated 6 months ago
deep-diver / llamaduo
[ACL'25] Official Code for LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs
☆314Updated 3 months ago
stephenleo / llm-structured-output-benchmarks
Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on task…
☆179Updated last year
rungalileo / agent-leaderboard
Ranking LLMs on agentic tasks
☆194Updated last month
meta-llama / prompt-ops
An open-source tool for LLM prompt optimization.
☆657Updated 2 weeks ago
rajshah4 / LLM-Evaluation
Sample notebooks and prompts for LLM evaluation
☆151Updated this week
arthur-ai / bench
A tool for evaluating LLMs
☆424Updated last year
Liyan06 / MiniCheck
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [EMNLP 2024]
☆185Updated last month
davanstrien / awesome-synthetic-datasets
awesome synthetic (text) datasets
☆298Updated 3 months ago
simbianai / taskgen
Task-based Agentic Framework using StrictJSON as the core
☆458Updated 2 weeks ago
TheAgentCompany / TheAgentCompany
An agent benchmark with tasks in a simulated software company.
☆564Updated last week
rungalileo / hallucination-index
Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.
☆115Updated 2 months ago
athina-ai / athina-evals
Python SDK for running evaluations on LLM generated responses
☆292Updated 4 months ago
patronus-ai / financebench
☆216Updated 10 months ago
microsoft / sammo
A library for prompt engineering and optimization (SAMMO = Structure-aware Multi-Objective Metaprompt Optimization)
☆731Updated 3 months ago
CYQIQ / MultiCoT
Repository to demonstrate Chain of Table reasoning with multiple tables powered by LangGraph
☆147Updated last year