Open-source library for scalable, reproducible evaluation of AI models and benchmarks.
☆240Mar 20, 2026Updated this week
Alternatives and similar repositories for Evaluator
Users that are interested in Evaluator are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆10Jun 5, 2025Updated 9 months ago
- AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solu…☆182Updated this week
- [EMNLP 2025] The official implementation of "Zero-shot Multimodal Document Retrieval via Cross-Modal Question Generation"☆15Aug 26, 2025Updated 6 months ago
- BERT score for text generation☆12Jan 15, 2025Updated last year
- 🤗 Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.☆17Jun 5, 2025Updated 9 months ago
- [COLING'25] Gen-SQL: Efficient Text-to-SQL By Bridging Natural Language Question And Database Schema With Pseudo-Schema☆24Jul 9, 2025Updated 8 months ago
- KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models☆25Aug 24, 2024Updated last year
- ☆54Feb 11, 2025Updated last year
- YetAnotherWandbClient☆13Mar 16, 2026Updated last week
- This repository contains data, code and models for contextual noncompliance.☆25Jul 18, 2024Updated last year
- StrategyQA 데이터 세트 번역☆23Apr 12, 2024Updated last year
- A Finance Dataset Benchmark for Natural Language Queries☆26Dec 7, 2020Updated 5 years ago
- 🎨 NeMo Data Designer: A general library for generating high-quality synthetic data from scratch or based on seed data.☆866Updated this week
- A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.☆180Feb 26, 2026Updated 3 weeks ago
- KURE: 고려대학교에서 개발한, 한국어 검색에 특화된 임베딩 모델☆209Feb 26, 2026Updated 3 weeks ago
- Rust crate for some audio utilities☆27Mar 8, 2025Updated last year
- Programatically edit the W&B UI☆22Updated this week
- Code for the paper "Closing the Curious Case of Neural Text Degeneration"☆12Apr 9, 2025Updated 11 months ago
- 한국어 심리 상담 데이터셋☆81Jun 20, 2023Updated 2 years ago
- Bias, Hate classification with KoELECTRA 👿☆27Jun 12, 2023Updated 2 years ago
- Reproducible and flexible LLM evaluations for scientific reasoning.☆26Jul 23, 2025Updated 8 months ago
- Comprehensive LLM evaluation at scale: A production-ready framework for evaluating large language models across multiple benchmarks.☆38Updated this week
- The constitution for a decentralized autonomous organization for accelerating clinical research through open-source software collaboratio…☆11Apr 21, 2022Updated 3 years ago
- CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean☆48Dec 23, 2024Updated last year
- A small rust-based data loader☆36Feb 20, 2026Updated last month
- Benchmark in Korean Context☆138Sep 26, 2023Updated 2 years ago
- The tool facilitates debugging convergence issues and testing new algorithms and recipes for training LLMs using Nvidia libraries such as…☆19Sep 17, 2025Updated 6 months ago
- KoCLIP: Korean port of OpenAI CLIP, in Flax☆155Dec 28, 2025Updated 2 months ago
- Developer Asset Hub for NVIDIA Nemotron — A one-stop resource for training recipes, usage cookbooks, datasets, and full end-to-end refere…☆725Updated this week
- #인권코퍼스☆31Oct 6, 2023Updated 2 years ago
- SLM-SQL: An Exploration of Small Language Models for Text-to-SQL☆31Aug 12, 2025Updated 7 months ago
- A recipe for constituency parsing, disfluency tagging and obtaining the fluent transcripts of English Fisher dataset☆13May 2, 2021Updated 4 years ago
- Test LLMs against jailbreaks and unprecedented harms☆40Oct 19, 2024Updated last year
- REverse-Engineered Reasoning for Open-Ended Generation☆94Sep 10, 2025Updated 6 months ago
- Codebase the paper "The Remarkable Robustness of LLMs: Stages of Inference?"☆19Jun 11, 2025Updated 9 months ago
- An unofficial implementation of SOLAR-10.7B model and the newly proposed interlocked-DUS(iDUS) implementation and experiment details.☆14Mar 20, 2024Updated 2 years ago
- ☆39Feb 7, 2025Updated last year
- Retrieval-Augmented Generation battle!☆64Updated this week
- [ACL25] FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation☆47Jan 28, 2026Updated last month