hud-evals / hud-sdkLinks
HUD SDK
☆71Updated this week
Alternatives and similar repositories for hud-sdk
Users that are interested in hud-sdk are comparing it to the libraries listed below
Sorting:
- Challenges for general-purpose web-browsing AI agents☆60Updated last month
- ☆129Updated 3 months ago
- An automated tool for discovering insights from research papaer corpora☆138Updated last year
- ⚖️ Awesome LLM Judges ⚖️☆107Updated 2 months ago
- [ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?☆71Updated 3 months ago
- Inference-time scaling for LLMs-as-a-judge.☆251Updated this week
- ☆162Updated 4 months ago
- ☆64Updated last month
- WebLINX is a benchmark for building web navigation agents with conversational capabilities☆153Updated 5 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆173Updated 6 months ago
- Train your own SOTA deductive reasoning model☆99Updated 4 months ago
- Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles☆52Updated 2 months ago
- ☆97Updated 2 weeks ago
- Open source interpretability artefacts for R1.☆154Updated 2 months ago
- j1-micro (1.7B) & j1-nano (600M) are absurdly tiny but mighty reward models.☆91Updated last month
- ☆51Updated 3 weeks ago
- A framework for optimizing DSPy programs with RL☆89Updated this week
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆32Updated 2 months ago
- ☆78Updated 8 months ago
- Code for ScribeAgent paper☆58Updated 4 months ago
- ☆55Updated this week
- Training an LLM to use a calculator with multi-turn reinforcement learning, achieving a **62% absolute increase in evaluation accuracy**.☆42Updated 2 months ago
- Automating enterprise workflows with multimodal agents☆108Updated 9 months ago
- Claude Deep Research config for Claude Code.☆196Updated 4 months ago
- Routing on Random Forest (RoRF)☆176Updated 9 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 4 months ago
- Testing baseline LLMs performance across various models☆284Updated this week
- Official Repo for InSTA: Towards Internet-Scale Training For Agents☆50Updated this week
- Synthetic data generation and benchmark implementation for "Episodic Memories Generation and Evaluation Benchmark for Large Language Mode…☆48Updated 3 months ago
- Code and data for the paper "Why think step by step? Reasoning emerges from the locality of experience"☆60Updated 3 months ago