Qurrent-AI / RES-QLinks
RES-Q: Evaluating the Code-Editing Capability of Large Language Model Systems at the Repository Scale
☆26Updated last year
Alternatives and similar repositories for RES-Q
Users that are interested in RES-Q are comparing it to the libraries listed below
Sorting:
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆99Updated this week
- Just a bunch of benchmark logs for different LLMs☆119Updated 11 months ago
- ☆134Updated 3 months ago
- Inference-time scaling for LLMs-as-a-judge.☆251Updated this week
- ☆97Updated 2 weeks ago
- Scaling is a distributed training library and installable dependency designed to scale up neural networks, with a dedicated module for tr…☆62Updated 8 months ago
- Sphynx Hallucination Induction☆53Updated 5 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 4 months ago
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆224Updated this week
- ☆99Updated 4 months ago
- Website for hosting the Open Foundation Models Cheat Sheet.☆267Updated 2 months ago
- Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (NeurIPS 2024)☆190Updated last year
- ☆92Updated 2 months ago
- Code for ExploreTom☆84Updated 3 weeks ago
- Train your own SOTA deductive reasoning model☆99Updated 4 months ago
- ☆128Updated 3 months ago
- ☆129Updated 3 months ago
- Repository for the paper Stream of Search: Learning to Search in Language☆149Updated 5 months ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆222Updated last year
- Manage scalable open LLM inference endpoints in Slurm clusters☆267Updated last year
- Functional Benchmarks and the Reasoning Gap☆88Updated 9 months ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆109Updated last year
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆173Updated 6 months ago
- Open source interpretability artefacts for R1.☆154Updated 2 months ago
- Approximation of the Claude 3 tokenizer by inspecting generation stream☆131Updated 11 months ago
- ⚖️ Awesome LLM Judges ⚖️☆107Updated 2 months ago
- Code for the paper "Fishing for Magikarp"☆159Updated 2 months ago
- ☆117Updated 4 months ago
- Red-Teaming Language Models with DSPy☆202Updated 5 months ago
- ☆171Updated 4 months ago