arcprize / model_baseline

Testing baseline LLMs performance across various models

☆260

Alternatives and similar repositories for model_baseline

Users that are interested in model_baseline are comparing it to the libraries listed below

Sorting:

goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆109Updated 3 weeks ago
arcprize / ARC-AGI-2
☆247Updated last month
LeonGuertler / TextArena
A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning
☆156Updated this week
NousResearch / Open-Reasoning-Tasks
A comprehensive repository of reasoning tasks for LLMs (and beyond)
☆439Updated 7 months ago
jerber / lang-jepa
☆111Updated 4 months ago
open-thought / reasoning-gym
procedural reasoning datasets
☆580Updated this week
casper-hansen / OpenCoconut
OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.
☆172Updated 4 months ago
PrimeIntellect-ai / genesys
☆125Updated last month
aidanmclaughlin / AidanBench
Aidan Bench attempts to measure <big_model_smell> in LLMs.
☆299Updated 3 weeks ago
groundlight / r1_vlm
Build your own visual reasoning model
☆362Updated this week
NousResearch / atropos
Atropos is a Language Model Reinforcement Learning Environments framework for collecting and evaluating LLM trajectories through diverse …
☆357Updated this week
SWE-Gym / SWE-Gym
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆455Updated last week
facebookresearch / swe-rl
Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆517Updated 2 months ago
PrimeIntellect-ai / prime-rl
prime-rl is a codebase for decentralized RL training at scale
☆211Updated this week
ekinakyurek / marc
Public repository for "The Surprising Effectiveness of Test-Time Training for Abstract Reasoning"
☆307Updated 5 months ago
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆191Updated 5 months ago
doomslide / hyperobject
☆97Updated 7 months ago
agora-protocol / paper-demo
☆150Updated 2 months ago
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆173Updated 2 months ago
NovaSky-AI / SkyRL
SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning
☆261Updated this week
facebookresearch / MLGym
MLGym A New Framework and Benchmark for Advancing AI Research Agents
☆492Updated this week
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆211Updated last month
willccbb / verifiers
Verifiers for LLM Reinforcement Learning
☆953Updated this week
haizelabs / Awesome-LLM-Judges
⚖️ Awesome LLM Judges ⚖️
☆97Updated 2 weeks ago
magicproduct / hash-hop
Long context evaluation for large language models
☆208Updated 2 months ago
facebookresearch / memory
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…
☆327Updated 5 months ago
scicode-bench / SciCode
A benchmark that challenges language models to code solutions for scientific problems
☆119Updated this week
brendanhogan / DeepSeekRL-Extended
Exploring Applications of GRPO
☆212Updated last week
rgreenblatt / arc_draw_more_samples_pub
Draw more samples
☆189Updated 10 months ago
emergent-misalignment / emergent-misalignment
☆148Updated 2 months ago