simple-bench / SimpleBenchLinks

☆170

Alternatives and similar repositories for SimpleBench

Users that are interested in SimpleBench are comparing it to the libraries listed below

Sorting:

NousResearch / Open-Reasoning-Tasks
A comprehensive repository of reasoning tasks for LLMs (and beyond)
☆450Updated last year
EQ-bench / EQ-Bench
A benchmark for emotional intelligence in large language models
☆369Updated last year
Mihaiii / backtrack_sampler
An easy-to-understand framework for LLM samplers that rewind and revise generated tokens
☆145Updated 8 months ago
QuixiAI / OpenChatML
☆162Updated 2 months ago
willccbb / mlx_parallm
Fast parallel LLM inference for MLX
☆224Updated last year
aidanmclaughlin / AidanBench
Aidan Bench attempts to measure <big_model_smell> in LLMs.
☆312Updated 4 months ago
casper-hansen / OpenCoconut
OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.
☆172Updated 9 months ago
haizelabs / verdict
Inference-time scaling for LLMs-as-a-judge.
☆304Updated 3 weeks ago
teknium1 / LLM-Benchmark-Logs
Just a bunch of benchmark logs for different LLMs
☆118Updated last year
migtissera / Sensei
Generate Synthetic Data Using OpenAI, MistralAI or AnthropicAI
☆221Updated last year
arcprize / arc-agi-benchmarking
Testing baseline LLMs performance across various models
☆319Updated 2 weeks ago
microsoft / GRIN-MoE
GRadient-INformed MoE
☆264Updated last year
normal-computing / extended-mind-transformers
☆123Updated last year
Aider-AI / polyglot-benchmark
Coding problems used in aider's polyglot benchmark
☆184Updated 10 months ago
epang080516 / arc_agi
SoTA Approach for ARC-AGI 2
☆126Updated last month
xjdr-alt / entropix-local
smol models are fun too
☆93Updated 11 months ago
jerber / arc_agi
☆62Updated 3 months ago
haizelabs / Awesome-LLM-Judges
⚖️ Awesome LLM Judges ⚖️
☆132Updated 6 months ago
OpenPipe / deductive-reasoning
Train your own SOTA deductive reasoning model
☆109Updated 7 months ago
Danau5tin / calculator_agent_rl
Training an LLM to use a calculator with multi-turn reinforcement learning, achieving a **62% absolute increase in evaluation accuracy**.
☆57Updated 5 months ago
javirandor / anthropic-tokenizer
Approximation of the Claude 3 tokenizer by inspecting generation stream
☆142Updated last year
SWE-agent / SWE-ReX
Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.
☆349Updated this week
harishsg993010 / LLM-Research-Scripts
☆434Updated last year
lechmazur / confabulations
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
☆230Updated 2 months ago
METR / eval-analysis-public
Public repository containing METR's DVC pipeline for eval data analysis
☆124Updated 6 months ago
teknium1 / ShareGPT-Builder
☆116Updated 10 months ago
adobe-research / dynasaur
Official repository for "DynaSaur: Large Language Agents Beyond Predefined Actions"
☆349Updated 10 months ago
PrimeIntellect-ai / prime-environments
Training-Ready RL Environments + Evals
☆132Updated this week
magicproduct / hash-hop
Long context evaluation for large language models
☆224Updated 7 months ago
anyscale / llm-router
Tutorial for building LLM router
☆231Updated last year