openai / simple-evalsLinks

☆4,125

Alternatives and similar repositories for simple-evals

Users that are interested in simple-evals are comparing it to the libraries listed below

Sorting:

allenai / open-instruct
AllenAI's post-training codebase
☆3,263Updated this week
gkamradt / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆2,056Updated last year
zou-group / textgrad
TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients. Published in Nature.
☆3,012Updated 2 months ago
argilla-io / distilabel
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆2,903Updated this week
rllm-org / rllm
Democratizing Reinforcement Learning for LLMs
☆4,534Updated this week
maitrix-org / llm-reasoners
A library for advanced large language model reasoning
☆2,291Updated 4 months ago
arcee-ai / mergekit
Tools for merging pretrained large language models.
☆6,394Updated last month
meta-pytorch / torchtune
PyTorch native post-training library
☆5,547Updated this week
huggingface / lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,021Updated this week
openai / transformer-debugger
☆4,100Updated last year
openai / mle-bench
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆1,029Updated last week
mlfoundations / dclm
DataComp for Language Models
☆1,381Updated last month
SWE-bench / SWE-bench
SWE-bench: Can Language Models Resolve Real-world Github Issues?
☆3,692Updated last week
tatsu-lab / alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆1,877Updated 2 months ago
NovaSky-AI / SkyThought
Sky-T1: Train your own O1 preview model within $450
☆3,341Updated 3 months ago
allenai / dolma
Data and tools for generating and inspecting OLMo pre-training data.
☆1,332Updated last month
mlabonne / llm-datasets
Curated list of datasets and tools for post-training.
☆3,792Updated 2 months ago
huggingface / datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆2,687Updated last week
huggingface / alignment-handbook
Robust recipes to align language models with human and AI preferences
☆5,406Updated last month
openai / prm800k
800,000 step-level correctness labels on LLM solutions to MATH problems
☆2,058Updated 2 years ago
allenai / OLMo
Modeling, training, eval, and inference code for OLMo
☆6,044Updated last week
huggingface / search-and-learn
Recipes to scale inference-time compute of open models
☆1,111Updated 5 months ago
open-thoughts / open-thoughts
Fully open data curation for reasoning models
☆2,120Updated last month
stanford-crfm / helm
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …
☆2,519Updated this week
ShengranHu / ADAS
[ICLR 2025] Automated Design of Agentic Systems
☆1,438Updated 8 months ago
Open-Source-O1 / Open-O1
☆1,350Updated 11 months ago
AIDC-AI / Marco-o1
An Open Large Reasoning Model for Real-World Solutions
☆1,522Updated 4 months ago
tencent-ailab / persona-hub
Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
☆1,362Updated 8 months ago
THUDM / AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
☆2,864Updated last week
PrimeIntellect-ai / verifiers
Environments for LLM Reinforcement Learning
☆3,338Updated this week