alopatenko / LLMEvaluationLinks
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
☆149Updated this week
Alternatives and similar repositories for LLMEvaluation
Users that are interested in LLMEvaluation are comparing it to the libraries listed below
Sorting:
- awesome synthetic (text) datasets☆305Updated 4 months ago
- ☆146Updated last year
- A small library of LLM judges☆301Updated 3 months ago
- Sample notebooks and prompts for LLM evaluation☆153Updated last week
- ARAGOG- Advanced RAG Output Grading. Exploring and comparing various Retrieval-Augmented Generation (RAG) techniques on AI research paper…☆114Updated last year
- RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker☆122Updated 2 weeks ago
- Attribute (or cite) statements generated by LLMs back to in-context information.☆297Updated last year
- Starter pack for NeurIPS LLM Efficiency Challenge 2023.☆126Updated 2 years ago
- Codebase accompanying the Summary of a Haystack paper.☆79Updated last year
- Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on task…☆179Updated last year
- Banishing LLM Hallucinations Requires Rethinking Generalization☆275Updated last year
- [ACL'25] Official Code for LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs☆314Updated 4 months ago
- ☆225Updated 11 months ago
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆50Updated last year
- In-Context Learning for eXtreme Multi-Label Classification (XMC) using only a handful of examples.☆442Updated last year
- This is the reproduction repository for my 🤗 Hugging Face blog post on synthetic data☆68Updated last year
- A set of scripts and notebooks on LLM finetunning and dataset creation☆111Updated last year
- Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.☆115Updated 3 months ago
- Low latency, High Accuracy, Custom Query routers for Humans and Agents. Built by Prithivi Da☆117Updated 7 months ago
- Notebooks for training universal 0-shot classifiers on many different tasks☆136Updated 10 months ago
- LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR t…☆495Updated 9 months ago
- ☆43Updated last year
- Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.☆247Updated 3 months ago
- LangFair is a Python library for conducting use-case level LLM bias and fairness assessments☆241Updated 2 weeks ago
- Model, Code & Data for the EMNLP'23 paper "Making Large Language Models Better Data Creators"☆134Updated 2 years ago
- ☆96Updated 7 months ago
- Benchmarking library for RAG☆239Updated last month
- Automatically evaluate your LLMs in Google Colab☆667Updated last year
- Let's build better datasets, together!☆264Updated 10 months ago
- Simple UI for debugging correlations of text embeddings☆299Updated 5 months ago