arthur-ai / bench
A tool for evaluating LLMs
☆407Updated 10 months ago
Alternatives and similar repositories for bench:
Users that are interested in bench are comparing it to the libraries listed below
- Domain Adapted Language Modeling Toolkit - E2E RAG☆316Updated 4 months ago
- Fiddler Auditor is a tool to evaluate language models.☆176Updated last year
- Python SDK for running evaluations on LLM generated responses☆272Updated last week
- Automated Evaluation of RAG Systems☆562Updated 4 months ago
- data cleaning and curation for unstructured text☆329Updated 7 months ago
- 🔍 LangKit: An open-source toolkit for monitoring Large Language Models (LLMs). 📚 Extracts signals from prompts & responses, ensuring sa…☆891Updated 4 months ago
- Data-Driven Evaluation for LLM-Powered Applications☆484Updated 2 months ago
- ☆761Updated last year
- In-Context Learning for eXtreme Multi-Label Classification (XMC) using only a handful of examples.☆414Updated last year
- OpenTelemetry Instrumentation for AI Observability☆339Updated this week
- The Rule-based Retrieval package is a Python package that enables you to create and manage Retrieval Augmented Generation (RAG) applicati…☆235Updated 5 months ago
- Tuning and Evaluation of RAG pipeline. (Automated optimization to be added soon)☆263Updated last year
- ☆184Updated last year
- 🦜💯 Flex those feathers!☆242Updated 5 months ago
- LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR t…☆396Updated last month
- This open-source repository offers reference code for integrating workplace datastores with Cohere's LLMs, enabling developers and busine…☆148Updated 5 months ago
- Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.☆288Updated 4 months ago
- Repository to demonstrate Chain of Table reasoning with multiple tables powered by LangGraph☆144Updated 11 months ago
- Sample notebooks and prompts for LLM evaluation☆123Updated 3 months ago
- wandbot is a technical support bot for Weights & Biases' AI developer tools that can run in Discord, Slack, ChatGPT and Zendesk☆289Updated 3 weeks ago
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"☆463Updated last year
- Evaluate your LLM's response with Prometheus and GPT4 💯☆883Updated this week
- Fine-Tuning Embedding for RAG with Synthetic Data☆489Updated last year
- Open-Source Implementation of WizardLM to turn documents into Q:A pairs for LLM fine-tuning☆300Updated 5 months ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models☆501Updated 8 months ago
- 🍰 PromptLayer - Maintain a log of your prompts and OpenAI API requests. Track, debug, and replay old completions.☆565Updated this week
- Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)☆392Updated last year
- Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding☆375Updated last year
- LangSmith Client SDK Implementations☆503Updated this week
- LLM Prompt Injection Detector☆1,215Updated 7 months ago