NVIDIA-NeMo / EvaluatorLinks

Open-source library for scalable, reproducible evaluation of AI models and benchmarks.

☆106

Alternatives and similar repositories for Evaluator

Users that are interested in Evaluator are comparing it to the libraries listed below

Sorting:

snowflakedb / ArcticTraining
ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)
☆257Updated this week
huggingface / llm-swarm
Manage scalable open LLM inference endpoints in Slurm clusters
☆277Updated last year
NVIDIA / logits-processor-zoo
A collection of LogitsProcessors to customize and enhance LLM behavior for specific tasks.
☆375Updated 5 months ago
allenai / olmes
Reproducible, flexible LLM evaluations
☆293Updated 2 weeks ago
RulinShao / retrieval-scaling
Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".
☆220Updated last month
facebookresearch / LayerSkip
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
☆349Updated 7 months ago
naver / bergen
Benchmarking library for RAG
☆248Updated last month
allenai / wimbd
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆224Updated last year
llm-efficiency-challenge / neurips_llm_efficiency_challenge
NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day
☆258Updated 2 years ago
allenai / OLMo-core
PyTorch building blocks for the OLMo ecosystem
☆482Updated this week
zai-org / ComplexFuncBench
Complex Function Calling Benchmark.
☆149Updated 10 months ago
lm-sys / llm-decontaminator
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆315Updated last year
mlcommons / modelbench
Run safety benchmarks against AI models and view detailed reports showing how well they performed.
☆112Updated this week
booydar / babilong
BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.
☆226Updated 3 months ago
princeton-nlp / HELMET
The HELMET Benchmark
☆187Updated 3 months ago
jxmorris12 / cde
code for training & evaluating Contextual Document Embedding models
☆201Updated 6 months ago
mlfoundations / evalchemy
Automatic evals for LLMs
☆559Updated 5 months ago
ServiceNow / PipelineRL
A scalable asynchronous reinforcement learning implementation with in-flight weight updates.
☆322Updated this week
bminixhofer / zett
Code for Zero-Shot Tokenizer Transfer
☆142Updated 10 months ago
Cohere-Labs-Community / m-rewardbench
Official Code for M-RᴇᴡᴀʀᴅBᴇɴᴄʜ: Evaluating Reward Models in Multilingual Settings (ACL 2025 Main)
☆38Updated 6 months ago
ServiceNow / Fast-LLM
Accelerating your LLM training to full speed! Made with ❤️ by ServiceNow Research
☆265Updated this week
google-deepmind / loft
LOFT: A 1 Million+ Token Long-Context Benchmark
☆218Updated 5 months ago
withmartian / routerbench
The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System
☆151Updated last year
NVlabs / Minitron
A family of compressed models obtained via pruning and knowledge distillation
☆359Updated last month
llm-merging / LLM-Merging
LLM-Merging: Building LLMs Efficiently through Merging
☆207Updated last year
CodeCreator / WebOrganizer
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
☆69Updated 7 months ago
arcee-ai / EvolKit
EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…
☆243Updated last year
facebookresearch / ReasonIR
Official repository for paper "ReasonIR Training Retrievers for Reasoning Tasks".
☆209Updated 5 months ago
Mohammadjafari80 / GSM8K-RLVR
A simplified implementation for experimenting with RLVR on GSM8K, This repository provides a starting point for exploring reasoning.
☆145Updated 10 months ago
allenai / DataDecide
☆36Updated 3 months ago