arthur-ai / benchLinks

A tool for evaluating LLMs

☆423

Alternatives and similar repositories for bench

Users that are interested in bench are comparing it to the libraries listed below

Sorting:

tigerlab-ai / tiger
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
☆398Updated last year
athina-ai / athina-evals
Python SDK for running evaluations on LLM generated responses
☆289Updated last month
fiddler-labs / fiddler-auditor
Fiddler Auditor is a tool to evaluate language models.
☆184Updated last year
whylabs / langkit
🔍 LangKit: An open-source toolkit for monitoring Large Language Models (LLMs). 📚 Extracts signals from prompts & responses, ensuring sa…
☆930Updated 8 months ago
TonicAI / tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
☆314Updated 3 weeks ago
arcee-ai / DALM
Domain Adapted Language Modeling Toolkit - E2E RAG
☆325Updated 8 months ago
philschmid / easyllm
☆461Updated last year
KarelDO / xmc.dspy
In-Context Learning for eXtreme Multi-Label Classification (XMC) using only a handful of examples.
☆433Updated last year
taylorai / galactic
data cleaning and curation for unstructured text
☆328Updated 11 months ago
langchain-ai / langchain-benchmarks
🦜💯 Flex those feathers!
☆252Updated 9 months ago
run-llama / ai-engineer-workshop
☆185Updated last year
stanford-futuredata / ARES
Automated Evaluation of RAG Systems
☆633Updated 4 months ago
MagnivOrg / prompt-layer-library
🍰 PromptLayer - Maintain a log of your prompts and OpenAI API requests. Track, debug, and replay old completions.
☆649Updated last week
scaleapi / llm-engine
Scale LLM Engine public repository
☆808Updated 2 weeks ago
mlabonne / llm-autoeval
Automatically evaluate your LLMs in Google Colab
☆649Updated last year
run-llama / finetune-embedding
Fine-Tuning Embedding for RAG with Synthetic Data
☆504Updated last year
v7labs / benchllm
Continuous Integration for LLM powered applications
☆247Updated last year
langchain-ai / auto-evaluator
☆772Updated last month
whyhow-ai / rule-based-retrieval
The Rule-based Retrieval package is a Python package that enables you to create and manage Retrieval Augmented Generation (RAG) applicati…
☆246Updated 9 months ago
wandb / wandbot
wandbot is a technical support bot for Weights & Biases' AI developer tools that can run in Discord, Slack, ChatGPT and Zendesk
☆302Updated this week
alopatenko / LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use…
☆129Updated 3 weeks ago
h2oai / h2o-wizardlm
Open-Source Implementation of WizardLM to turn documents into Q:A pairs for LLM fine-tuning
☆312Updated 9 months ago
epfl-dlab / aiflows
🤖🌊 aiFlows: The building blocks of your collaborative AI
☆258Updated last year
misbahsy / RAGTune
Tuning and Evaluation of RAG pipeline. (Automated optimization to be added soon)
☆264Updated last year
rajshah4 / LLM-Evaluation
Sample notebooks and prompts for LLM evaluation
☆136Updated last month
pchunduri6 / rag-demystified
An LLM-powered advanced RAG pipeline built from scratch
☆845Updated last year
ganarajpr / awesome-dspy
An Awesome list of curated DSPy resources.
☆386Updated 5 months ago
modal-labs / llm-finetuning
Guide for fine-tuning Llama/Mistral/CodeLlama models and more
☆613Updated 2 months ago
relari-ai / continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
☆501Updated 6 months ago
jxnl / n-levels-of-rag
☆195Updated last year