Qurrent-AI / RES-QLinks
RES-Q: Evaluating the Code-Editing Capability of Large Language Model Systems at the Repository Scale
☆27Updated last year
Alternatives and similar repositories for RES-Q
Users that are interested in RES-Q are comparing it to the libraries listed below
Sorting:
- ☆118Updated 2 weeks ago
- ☆152Updated 5 months ago
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆133Updated last week
- ☆217Updated last week
- Sphynx Hallucination Induction☆53Updated last year
- ☆133Updated 3 months ago
- METR Task Standard☆173Updated last year
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆189Updated 11 months ago
- Red-Teaming Language Models with DSPy☆250Updated 11 months ago
- Just a bunch of benchmark logs for different LLMs☆119Updated last year
- A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you…☆82Updated last year
- ☆65Updated this week
- Lightly-reviewed collection of community environments☆210Updated last week
- ☆137Updated 10 months ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆124Updated last year
- Repository for the paper Stream of Search: Learning to Search in Language☆153Updated last year
- CodeSage: Code Representation Learning At Scale (ICLR 2024)☆116Updated last year
- Approximation of the Claude 3 tokenizer by inspecting generation stream☆150Updated last year
- ☆33Updated 8 months ago
- ☆59Updated last year
- Inference-time scaling for LLMs-as-a-judge.☆327Updated 3 months ago
- Scaling is a distributed training library and installable dependency designed to scale up neural networks, with a dedicated module for tr…☆66Updated 2 months ago
- A subset of jailbreaks automatically discovered by the Haize Labs haizing suite.☆100Updated 9 months ago
- Functional Benchmarks and the Reasoning Gap☆89Updated last year
- Source code for the collaborative reasoner research project at Meta FAIR.☆112Updated 9 months ago
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆246Updated last week
- The Granite Guardian models are designed to detect risks in prompts and responses.☆130Updated 4 months ago
- A better way of testing, inspecting, and analyzing AI Agent traces.☆46Updated 3 weeks ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆175Updated last year
- ☆123Updated 11 months ago