Qurrent-AI / RES-Q

RES-Q: Evaluating the Code-Editing Capability of Large Language Model Systems at the Repository Scale

☆26

Alternatives and similar repositories for RES-Q:

Users that are interested in RES-Q are comparing it to the libraries listed below

nuprl / CanItEdit
Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions
☆41Updated 8 months ago
METR / RE-Bench
☆73Updated this week
teknium1 / LLM-Benchmark-Logs
Just a bunch of benchmark logs for different LLMs
☆119Updated 9 months ago
Aleph-Alpha / scaling
Scaling is a distributed training library and installable dependency designed to scale up neural networks, with a dedicated module for tr…
☆58Updated 5 months ago
ScalingIntelligence / codemonkeys
☆37Updated 3 months ago
METR / vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
☆90Updated this week
google-deepmind / mishax
☆128Updated 3 weeks ago
google-deepmind / dangerous-capability-evaluations
☆54Updated 7 months ago
AblateIt / finetune-study
Comprehensive analysis of difference in performance of QLora, Lora, and Full Finetunes.
☆82Updated last year
JoshuaPurtell / SmallBench
Small, simple agent task environments for training and evaluation
☆18Updated 5 months ago
PrimeIntellect-ai / prime-rl
prime-rl is a codebase for decentralized RL training at scale
☆79Updated this week
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆171Updated last month
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆85Updated 6 months ago
kanishkg / stream-of-search
Repository for the paper Stream of Search: Learning to Search in Language
☆145Updated 2 months ago
SalesforceAIResearch / LaTRO
☆114Updated 2 months ago
InflectionAI / Inflection-Benchmarks
Public Inflection Benchmarks
☆68Updated last year
automix-llm / automix
Mixing Language Models with Self-Verification and Meta-Verification
☆104Updated 4 months ago
METR / public-tasks
☆90Updated last month
OpenPipe / deductive-reasoning
Train your own SOTA deductive reasoning model
☆88Updated last month
haizelabs / sphynx
Sphynx Hallucination Induction
☆53Updated 2 months ago
aorwall / SWE-bench-docker
☆91Updated 9 months ago
princeton-pli / hal-harness
☆72Updated this week
casper-hansen / OpenCoconut
OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.
☆171Updated 3 months ago
magicproduct / hash-hop
Long context evaluation for large language models
☆207Updated last month
evalplus / repoqa
RepoQA: Evaluating Long-Context Code Understanding
☆107Updated 5 months ago
allenai / infinigram-api
☆53Updated last week
poking-agents / modular-public
☆21Updated 3 weeks ago
JD-P / minihf
MiniHF is an inference, human preference data collection, and fine-tuning tool for local language models. It is intended to help the user…
☆171Updated this week
EleutherAI / nanoGPT-mup
The simplest, fastest repository for training/finetuning medium-sized GPTs.
☆105Updated 5 months ago
haizelabs / verdict
Verdict is a library for scaling judge-time compute.
☆199Updated last week