hud-evals / hud-sdkLinks

HUD SDK

☆71

Alternatives and similar repositories for hud-sdk

Users that are interested in hud-sdk are comparing it to the libraries listed below

Sorting:

convergence-ai / webgames
Challenges for general-purpose web-browsing AI agents
☆60Updated last month
PrimeIntellect-ai / genesys
☆129Updated 3 months ago
ai8hyf / OpenResearchAssistant
An automated tool for discovering insights from research papaer corpora
☆138Updated last year
haizelabs / Awesome-LLM-Judges
⚖️ Awesome LLM Judges ⚖️
☆107Updated 2 months ago
google-deepmind / latent-multi-hop-reasoning
[ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?
☆71Updated 3 months ago
haizelabs / verdict
Inference-time scaling for LLMs-as-a-judge.
☆251Updated this week
agora-protocol / paper-demo
☆162Updated 4 months ago
brendanhogan / picoDeepResearch
☆64Updated last month
McGill-NLP / weblinx
WebLINX is a benchmark for building web navigation agents with conversational capabilities
☆153Updated 5 months ago
casper-hansen / OpenCoconut
OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.
☆173Updated 6 months ago
OpenPipe / deductive-reasoning
Train your own SOTA deductive reasoning model
☆99Updated 4 months ago
Columbia-NLP-Lab / PAPILLON
Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles
☆52Updated 2 months ago
princeton-pli / hal-harness
☆97Updated 2 weeks ago
goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆154Updated 2 months ago
haizelabs / j1-micro
j1-micro (1.7B) & j1-nano (600M) are absurdly tiny but mighty reward models.
☆91Updated last month
yueqis / API-Based-Agent
☆51Updated 3 weeks ago
Ziems / arbor
A framework for optimizing DSPy programs with RL
☆89Updated this week
facebookresearch / ZeroSumEval
A framework for pitting LLMs against each other in an evolving library of games ⚔
☆32Updated 2 months ago
MinorJerry / OpenWebVoyager
☆78Updated 8 months ago
colonylabs / ScribeAgent
Code for ScribeAgent paper
☆58Updated 4 months ago
jerber / arc_agi
☆55Updated this week
Danau5tin / calculator_agent_rl
Training an LLM to use a calculator with multi-turn reinforcement learning, achieving a **62% absolute increase in evaluation accuracy**.
☆42Updated 2 months ago
HazyResearch / eclair-agents
Automating enterprise workflows with multimodal agents
☆108Updated 9 months ago
willccbb / claude-deep-research
Claude Deep Research config for Claude Code.
☆196Updated 4 months ago
Not-Diamond / RoRF
Routing on Random Forest (RoRF)
☆176Updated 9 months ago
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆173Updated 4 months ago
arcprize / arc-agi-benchmarking
Testing baseline LLMs performance across various models
☆284Updated this week
data-for-agents / insta
Official Repo for InSTA: Towards Internet-Scale Training For Agents
☆50Updated this week
ahstat / episodic-memory-benchmark
Synthetic data generation and benchmark implementation for "Episodic Memories Generation and Evaluation Benchmark for Large Language Mode…
☆48Updated 3 months ago
benpry / why-think-step-by-step
Code and data for the paper "Why think step by step? Reasoning emerges from the locality of experience"
☆60Updated 3 months ago