[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation
☆71Jan 15, 2026Updated last month
Alternatives and similar repositories for swt-bench
Users that are interested in swt-bench are comparing it to the libraries listed below
Sorting:
- ☆45Jan 21, 2026Updated last month
- The official repo for the paper Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Gen…☆20Feb 27, 2024Updated 2 years ago
- [ACL'25] UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench☆35Aug 12, 2025Updated 6 months ago
- TDD-Bench-Verified is a new benchmark for generating test cases for test-driven development (TDD)☆27Sep 18, 2025Updated 5 months ago
- ☆28Nov 10, 2025Updated 3 months ago
- Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents☆23Feb 21, 2026Updated last week
- ☆12Nov 5, 2024Updated last year
- ☆60Jan 28, 2025Updated last year
- CodeRepoQA dataset☆15Feb 19, 2025Updated last year
- Code for "Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking" (https://arxiv.org/abs/2…☆14Feb 2, 2026Updated last month
- Data and Code for Paper "Reflect Not Reflex: Inference-Based Common Ground Improves Dialogue Response Quality" (EMNLP 2022)☆11Nov 28, 2022Updated 3 years ago
- ☆13May 23, 2025Updated 9 months ago
- Knowledge Graph based Question Answering benchmark.☆10Feb 1, 2020Updated 6 years ago
- The Swiss Federal Chancellery Fedlex portal (www.fedlex.admin.ch) crawled, prettified and presented as a git repository.☆19Jan 10, 2026Updated last month
- Interface for GenAI-Arena [NeurIPS24]☆17Feb 27, 2024Updated 2 years ago
- ☆14Mar 13, 2021Updated 4 years ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆584Updated this week
- ☆628Sep 1, 2025Updated 6 months ago
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆251Updated this week
- Code for our paper Resources and Evaluations for Multi-Distribution Dense Information Retrieval☆16Jan 16, 2024Updated 2 years ago
- AutoLog: A Log Sequence Synthesis Framework for Anomaly Detection [ASE'23]☆41Feb 20, 2024Updated 2 years ago
- ☆69Dec 15, 2024Updated last year
- Reinforcement Learning for Repository-Level Code Completion☆42Aug 19, 2024Updated last year
- Analyzing LLM Alignment via Token distribution shift☆17Jan 26, 2024Updated 2 years ago
- A lightweight library for Bayesian analysis of LLM evals (ICML 2025 Spotlight Position Paper)☆22May 28, 2025Updated 9 months ago
- junit tools contest infrastructure☆13Feb 9, 2024Updated 2 years ago
- A tool for REST API test coverage computation☆21Nov 12, 2025Updated 3 months ago
- [TOSEM 2026]A Systematic Literature Review on Large Language Models for Automated Program Repair☆232Updated this week
- ☆44Jun 24, 2025Updated 8 months ago
- ☆18Apr 15, 2024Updated last year
- ☆23Jan 25, 2023Updated 3 years ago
- [DL4C @ ICLR 2025] A Benchmark for Automated Environment Setup☆34Nov 9, 2025Updated 3 months ago
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆85Jul 13, 2024Updated last year
- Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"☆23Mar 18, 2025Updated 11 months ago
- Agentless🐱: an agentless approach to automatically solve software development problems☆2,010Dec 22, 2024Updated last year
- Code of our paper Applying CodeBERT for Automated Program Repair of Java Simple Bugs which is accepted to MSR 2021.☆52Nov 27, 2022Updated 3 years ago
- Repository for the CODAH dataset☆22Oct 29, 2022Updated 3 years ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆644Jul 29, 2025Updated 7 months ago
- Official code implementation for the ACL 2025 paper: 'Dynamic Scaling of Unit Tests for Code Reward Modeling'☆27May 16, 2025Updated 9 months ago