aiverify-foundation / LLM-Evals-CatalogueLinks
This repository stems from our paper, “Cataloguing LLM Evaluations”, and serves as a living, collaborative catalogue of LLM evaluation frameworks, benchmarks and papers.
☆18Updated last year
Alternatives and similar repositories for LLM-Evals-Catalogue
Users that are interested in LLM-Evals-Catalogue are comparing it to the libraries listed below
Sorting:
- ☆73Updated 11 months ago
- ARAGOG- Advanced RAG Output Grading. Exploring and comparing various Retrieval-Augmented Generation (RAG) techniques on AI research paper…☆112Updated last year
- Sample notebooks and prompts for LLM evaluation☆138Updated 3 months ago
- A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use…☆140Updated this week
- Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.☆115Updated last month
- ☆146Updated last year
- A framework for fine-tuning retrieval-augmented generation (RAG) systems.☆130Updated this week
- LangFair is a Python library for conducting use-case level LLM bias and fairness assessments☆232Updated 2 weeks ago
- Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on task…☆179Updated last year
- ☆20Updated last year
- EvalAssist is an open-source project that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other la…☆80Updated this week
- ☆37Updated last year
- Research repository on interfacing LLMs with Weaviate APIs. Inspired by the Berkeley Gorilla LLM.☆135Updated last month
- ☆89Updated 4 months ago
- ☆78Updated last week
- Optimized Large Language Models for Financial Applications – Efficient, Scalable, and Domain-Specific AI for Finance.☆51Updated 2 months ago
- 🔧 Compare how Agent systems perform on several benchmarks. 📊🚀☆102Updated last month
- Ranking LLMs on agentic tasks☆188Updated 2 weeks ago
- RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker☆117Updated this week
- Testing and evaluation framework for voice agents☆150Updated 3 months ago
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆49Updated last year
- Repository to demonstrate Chain of Table reasoning with multiple tables powered by LangGraph☆147Updated last year
- ☆35Updated 4 months ago
- A semantic research engine to get relevant papers based on a user query. Application frontend with Chainlit Copilot. Observability with L…☆83Updated last year
- ☆53Updated 11 months ago
- ☆124Updated 7 months ago
- RAGArch is a Streamlit-based application that empowers users to experiment with various components and parameters of Retrieval-Augmented …☆85Updated last year
- ☆30Updated last year
- Enterprise-grade memory framework for LLMs featuring GPU-optimized inference, vector storage, and automated scaling. Enables hyper-person…☆89Updated 4 months ago
- A curated list of awesome synthetic data tools (open source and commercial).☆208Updated last year