open-evals / evalsLinks

Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.

☆19

Alternatives and similar repositories for evals

Users that are interested in evals are comparing it to the libraries listed below

Sorting:

SjJ1017 / CiteLab
☆17Updated 2 months ago
xlang-ai / batch-prompting
[EMNLP 2023 Industry Track] A simple prompting approach that enables the LLMs to run inference in batches.
☆74Updated last year
kyleliang919 / Online-Subspace-Descent
This repo is based on https://github.com/jiaweizzhao/GaLore
☆29Updated 9 months ago
kernelmachine / cbtm
Code repository for the c-BTM paper
☆106Updated last year
argilla-io / distilabel-spin-dibt
Repository containing the SPIN experiments on the DIBT 10k ranked prompts
☆24Updated last year
dvlab-research / MR-GSM8K
Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs
☆48Updated 11 months ago
EleutherAI / stackexchange-dataset
Python tools for processing the stackexchange data dumps into a text dataset for Language Models
☆81Updated last year
xhan77 / in-context-alignment
In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning
☆35Updated last year
allenai / sso
Repository for Skill Set Optimization
☆13Updated 11 months ago
wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆58Updated last year
facebookresearch / lss_eval
This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found he…
☆31Updated last year
jdf-prog / LLM-Engines
☆50Updated 3 weeks ago
Harry-Chan / seq2seqlm-on-qg
☆14Updated 3 years ago
changjonathanc / tw_rouge
ROUGE score calculator with traditional chinese word segmentation
☆9Updated 4 years ago
Zyphra / Zyda_processing
☆35Updated last year
chtmp223 / suri
Suri: Multi-constraint instruction following for long-form text generation (EMNLP’24)
☆23Updated 7 months ago
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆81Updated 10 months ago
sunyt32 / torchscale
Transformers at any scale
☆41Updated last year
princeton-nlp / PTP
Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073
☆28Updated 11 months ago
Gen-Verse / CURE
Open-Source LLM Coders with Co-Evolving Reinforcement Learning
☆87Updated 3 weeks ago
ContextualAI / CLAIR_and_APO
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
☆57Updated 9 months ago
kyegomez / Infini-attention
Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTO…
☆55Updated this week
AlexWan0 / infini-gram
An unofficial implementation of the Infini-gram model proposed by Liu et al. (2024)
☆33Updated last year
GSYfate / knnlm-limits
Official code repo for paper "Great Memory, Shallow Reasoning: Limits of kNN-LMs"
☆23Updated last month
da03 / WildVisualizer
☆22Updated this week
scottlogic-alex / prm800k-denorm
Script for processing OpenAI's PRM800K process supervision dataset into an Alpaca-style instruction-response format
☆27Updated last year
TheDuckAI / arb
Advanced Reasoning Benchmark Dataset for LLMs
☆47Updated last year
kyegomez / Reka-Torch
Implementation of the model: "Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models" in PyTorch
☆30Updated this week
austrian-code-wizard / c3po
☆27Updated this week
john-hewitt / implicit-ins
Codebase for Instruction Following without Instruction Tuning
☆34Updated 9 months ago