poking-agents / modular-publicLinks
☆22Updated last month
Alternatives and similar repositories for modular-public
Users that are interested in modular-public are comparing it to the libraries listed below
Sorting:
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆99Updated 2 weeks ago
- ☆92Updated 2 months ago
- ☆55Updated 9 months ago
- METR Task Standard☆151Updated 5 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆28Updated last year
- ☆134Updated 3 months ago
- Public repository containing METR's DVC pipeline for eval data analysis☆78Updated 3 months ago
- Simple repository for training small reasoning models☆33Updated 5 months ago
- ☆38Updated 11 months ago
- An attribution library for LLMs☆42Updated 9 months ago
- A framework for optimizing DSPy programs with RL☆89Updated this week
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆207Updated this week
- Inference-time scaling for LLMs-as-a-judge.☆250Updated last week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 4 months ago
- Open source interpretability artefacts for R1.☆154Updated 2 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆32Updated 2 months ago
- ☆86Updated 6 months ago
- Draw more samples☆192Updated last year
- ☆90Updated this week
- QAlign is a new test-time alignment approach that improves language model performance by using Markov chain Monte Carlo methods.☆23Updated 3 months ago
- Simple GRPO scripts and configurations.☆59Updated 5 months ago
- ☆99Updated 4 months ago
- Just a bunch of benchmark logs for different LLMs☆119Updated 11 months ago
- Synthetic data generation and benchmark implementation for "Episodic Memories Generation and Evaluation Benchmark for Large Language Mode…☆46Updated 3 months ago
- ☆14Updated 3 months ago
- j1-micro (1.7B) & j1-nano (600M) are absurdly tiny but mighty reward models.☆91Updated last month
- Official Repo for InSTA: Towards Internet-Scale Training For Agents☆48Updated last week
- gzip Predicts Data-dependent Scaling Laws☆35Updated last year
- Long context evaluation for large language models☆220Updated 4 months ago
- Repository for the paper Stream of Search: Learning to Search in Language☆149Updated 5 months ago