IBM / eval-assistLinks
EvalAssist is an open-source project that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other large language models by supporting users in iteratively refining evaluation criteria in a web-based user experience.
β27Updated this week
Alternatives and similar repositories for eval-assist
Users that are interested in eval-assist are comparing it to the libraries listed below
Sorting:
- π¦ Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data β¦β199Updated this week
- codebase release for EMNLP2023 paper publicationβ19Updated last month
- Synthetic Data Generation for Foundation Modelsβ21Updated 4 months ago
- A package dedicated for running benchmark agreement testingβ16Updated last month
- The Granite Guardian models are designed to detect risks in prompts and responses.β88Updated 3 months ago
- LM engine is a library for pretraining/finetuning LLMsβ57Updated this week
- Python framework which enables you to transform how a user calls or infers an IBM Granite model and how the output from the model is retuβ¦β30Updated this week
- TARGET is a benchmark for evaluating Table Retrieval for Generative Tasks such as Fact Verification and Text-to-SQLβ22Updated 2 weeks ago
- Contains all assets to run with Moonshot Library (Connectors, Datasets and Metrics)β35Updated this week
- π Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.β47Updated this week
- Embedding Recycling for Language modelsβ38Updated last year
- Interpretable and efficient predictors using pre-trained language models. Scikit-learn compatible.β42Updated 3 months ago
- We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in β¦β54Updated last year
- Project Debater Early Access Program Tutorialβ24Updated last month
- Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generationβ33Updated 4 months ago
- Efficient multi-prompt evaluation of LLMsβ19Updated 6 months ago
- β17Updated 3 months ago
- Code associated with the paper "Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists"β49Updated 3 years ago
- The repository contains generative AI analytics platform application code.β26Updated last month
- PyTorch package to train and audit ML models for Individual Fairnessβ66Updated last month
- Official repo for SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistencyβ35Updated 5 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.β58Updated last month
- Evaluate uncertainty, calibration, accuracy, and fairness of LLMs on real-world survey data!β22Updated 2 months ago
- Multi-Turn RAG Benchmarkβ58Updated last month
- β41Updated 5 months ago
- Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs. EMNLP 2024β23Updated 7 months ago
- A framework for fine-tuning retrieval-augmented generation (RAG) systems.β112Updated this week
- This repository contains data, code and models for contextual noncompliance.β23Updated 11 months ago
- Official Repository for Dataset Inference for LLMsβ34Updated 11 months ago
- Counterfactual Local Explanations of AI systemsβ28Updated 3 years ago