princeton-pli / hal-harnessLinks

☆102

Alternatives and similar repositories for hal-harness

Users that are interested in hal-harness are comparing it to the libraries listed below

Sorting:

ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆175Updated 4 months ago
METR / RE-Bench
☆94Updated 3 months ago
UKGovernmentBEIS / inspect_evals
Collection of evals for Inspect AI
☆198Updated this week
LeonGuertler / TextArena
A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning
☆225Updated this week
princeton-nlp / intercode
[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
☆223Updated last year
kanishkg / stream-of-search
Repository for the paper Stream of Search: Learning to Search in Language
☆149Updated 6 months ago
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆88Updated 10 months ago
Yu-Fangxu / FoR
[ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples
☆103Updated last week
SalesforceAIResearch / LaTRO
☆118Updated 5 months ago
data-for-agents / insta
Official Repo for InSTA: Towards Internet-Scale Training For Agents
☆52Updated 3 weeks ago
withmartian / routerbench
The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System
☆131Updated last year
aorwall / moatless-tree-search
☆99Updated last month
zorazrw / agent-workflow-memory
AWM: Agent Workflow Memory
☆297Updated 6 months ago
StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…
☆232Updated 2 months ago
SALT-NLP / collaborative-gym
Framework and toolkits for building and evaluating collaborative agents that can work together with humans.
☆90Updated 3 months ago
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆235Updated 3 months ago
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆225Updated 2 weeks ago
ServiceNow / TapeAgents
TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle
☆288Updated last week
princeton-nlp / USACO
Can Language Models Solve Olympiad Programming?
☆119Updated 6 months ago
MadryLab / context-cite
Attribute (or cite) statements generated by LLMs back to in-context information.
☆261Updated 9 months ago
ryoungj / ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
☆152Updated last year
sher222 / LeReT
Learning to Retrieve by Trying - Source code for Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval
☆49Updated 9 months ago
apple / ToolSandbox
☆194Updated 11 months ago
xiaowu0162 / LongMemEval
Benchmarking Chat Assistants on Long-Term Interactive Memory (ICLR 2025)
☆159Updated 3 months ago
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆160Updated last year
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆113Updated last year
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆111Updated last year
METR / vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
☆100Updated this week
kohjingyu / search-agents
Code for the paper 🌳 Tree Search for Language Model Agents
☆208Updated last year
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆86Updated last year