h2oai / h2o-LLM-evalLinks

Large-language Model Evaluation framework with Elo Leaderboard and A-B testing

☆52

Alternatives and similar repositories for h2o-LLM-eval

Users that are interested in h2o-LLM-eval are comparing it to the libraries listed below

Sorting:

reasoning-machines / prompt-lib
A set of utilities for running few-shot prompting experiments on large-language models
☆122Updated last year
salesforce / summary-of-a-haystack
Codebase accompanying the Summary of a Haystack paper.
☆79Updated 10 months ago
zetaalphavector / RAGElo
RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker
☆114Updated 3 weeks ago
wang-research-lab / agentinstruct
Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"
☆115Updated 10 months ago
chaitanyamalaviya / ExpertQA
[Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers
☆131Updated last year
automix-llm / automix
Mixing Language Models with Self-Verification and Meta-Verification
☆105Updated 7 months ago
daniel-furman / sft-demos
Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.
☆77Updated 9 months ago
neulab / ragged
Retrieval Augmented Generation Generalized Evaluation Dataset
☆54Updated 3 weeks ago
explodinggradients / nemesis
Reward Model framework for LLM RLHF
☆61Updated 2 years ago
togethercomputer / Llama-2-7B-32K-Instruct
☆85Updated last year
Anni-Zou / Meta-CoT
Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
☆97Updated last year
allenai / catwalk
This project studies the performance and robustness of language models and task-adaptation methods.
☆150Updated last year
SALT-NLP / demonstrated-feedback
☆125Updated 10 months ago
jakespringer / echo-embeddings
☆152Updated last year
aymeric-roucher / agent_reasoning_benchmark
🔧 Compare how Agent systems perform on several benchmarks. 📊🚀
☆99Updated this week
SalesforceAIResearch / SFR-RAG
☆77Updated 6 months ago
DaoD / INTERS
This is the repository for our paper "INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning"
☆204Updated 7 months ago
TIGER-AI-Lab / StructLM
Code and data for "StructLM: Towards Building Generalist Models for Structured Knowledge Grounding" (COLM 2024)
☆75Updated 9 months ago
yueyu1030 / AttrPrompt
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
☆152Updated last year
declare-lab / flacuna
Flacuna was developed by fine-tuning Vicuna on Flan-mini, a comprehensive instruction collection encompassing various tasks. Vicuna is al…
☆111Updated last year
bhargaviparanjape / language-programmes
☆172Updated 2 years ago
sail-sg / sailcraft
🚢 Data Toolkit for Sailor Language Models
☆94Updated 5 months ago
olly-styles / WorkBench
WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting.
☆43Updated last year
deshwalmahesh / PHUDGE
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…
☆49Updated last year
patronus-ai / Lynx-hallucination-detection
☆41Updated last year
msclar / formatspread
Code accompanying "How I learned to start worrying about prompt formatting".
☆108Updated 2 months ago
sileod / tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
☆185Updated last month
pygongnlp / CoSearchAgent
[SIGIR 2024 (Demo)] CoSearchAgent: A Lightweight Collborative Search Agent with Large Language Models
☆28Updated last year
princeton-nlp / LitSearch
[EMNLP 2024] A Retrieval Benchmark for Scientific Literature Search
☆93Updated 8 months ago
gersteinlab / Struc-Bench
[NAACL 2024] Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data? https://aclanthology.org/2024.naa…
☆54Updated last week