A comprehensive code domain benchmark review of LLM researches.
β229Jun 25, 2026Updated last week
Alternatives and similar repositories for Awesome-Code-Benchmark
Users that are interested in Awesome-Code-Benchmark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Source code to accompany research paper on training multi token prediction language models using self-distillation.β39Feb 21, 2026Updated 4 months ago
- [NeurIPS 2025 D&B] π SWE-bench Goes Live!β200Jun 11, 2026Updated 3 weeks ago
- β12Nov 5, 2024Updated last year
- Multi-agent synthetic data generation pipeline capable of generating and validating long horizon terminal/coding tasks for RL trainingβ67Jul 28, 2025Updated 11 months ago
- JDCallgraph - Dynamic call graph generation for Java.β20Oct 12, 2020Updated 5 years ago
- End-to-end encrypted cloud storage - Proton Drive β’ AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- β15Feb 24, 2021Updated 5 years ago
- Must-read papers on Repository-level Code Generation & Issue Resolution π₯β313Jun 15, 2026Updated 2 weeks ago
- A curated list of products, benchmarks, and research papers on autonomous code agents. Beyond coding β they're redefining how software chβ¦β107Jun 20, 2026Updated last week
- Source code for Grounded Adaptation for Zero-shot Executable Semantic Parsingβ21Feb 1, 2021Updated 5 years ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agentsβ688Updated this week
- This repository contains code and data of the paper **On the Limitations of Continual Learning for Malware Classification**, accepted to β¦β20Dec 29, 2023Updated 2 years ago
- Automated Benchmarking of LLM Agents on Real-World Software Security Tasks [NeurIPS 2025]β79Jan 27, 2026Updated 5 months ago
- β¨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024β208Aug 16, 2024Updated last year
- Code repository for "RL Grokking Recipe: How RL Unlocks and Transfers New Algorithms in LLMs""β35Oct 12, 2025Updated 8 months ago
- Simple, predictable pricing with DigitalOcean hosting β’ AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- β58Jun 30, 2023Updated 3 years ago
- Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Modelsβ45Jun 14, 2024Updated 2 years ago
- [ACL25] FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementationβ56Jan 28, 2026Updated 5 months ago
- Replication of AST Neural Network from Zhang J. et. al (2019) and application to software vulnerability detectionβ12Jan 13, 2020Updated 6 years ago
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agentsβ294Jul 13, 2025Updated 11 months ago
- Supervised Local Modeling for Interpretabilityβ29Oct 27, 2018Updated 7 years ago
- CodeMind is a generic framework for evaluating inductive code reasoning of LLMs. It is equipped with a static analysis component that enaβ¦β42Feb 18, 2026Updated 4 months ago
- The Infibench variant of bigcode-evaluation-harness --- a framework for the evaluation of autoregressive code generation language models.β14Oct 19, 2024Updated last year
- β110Oct 13, 2025Updated 8 months ago
- Deploy open-source AI quickly and easily - Special Bonus Offer β’ AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"β894Jul 16, 2025Updated 11 months ago
- An Algorithm to Quantify Robustness of Recurrent Neural Networksβ49Apr 24, 2020Updated 6 years ago
- MetaAgent: Toward Self-Evolving Agent via Tool Meta-Learningβ47Sep 3, 2025Updated 10 months ago
- [ACL 2025] Graph Aligned Large Language Models for Improved Source Code Understandingβ45May 18, 2025Updated last year
- Reproducing R1 for Code with Reliable Rewardsβ313May 5, 2025Updated last year
- Dataset and baseline for Coling 2022 long paper (oral): "ConFiguRe: Exploring Discourse-level Chinese Figures of Speech"β13Jul 27, 2023Updated 2 years ago
- βοΈ Tree-sitter custom toolkit for extracting function and class from raw source fileβ52Jul 1, 2024Updated 2 years ago
- Crashbench is a LLM benchmark to measure bug-finding and reporting capabilities of LLMsβ14Mar 8, 2026Updated 3 months ago
- A website to store all my tests for ease of access.β23Feb 28, 2025Updated last year
- Bare Metal GPUs on DigitalOcean Gradient AI β’ AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- β176Apr 23, 2025Updated last year
- This is docker images of Ubuntu 16.04 LTS with different versions of javaβ14Dec 8, 2021Updated 4 years ago
- Gephi tutorials for data visualisation lecture. A Network Tour of Data Science 2019 Fall semesterβ12Apr 11, 2021Updated 5 years ago
- Cyber-Zero: Training Cybersecurity Agents Without Runtimeβ96Feb 13, 2026Updated 4 months ago
- Leaderboard of Frontier Models for Program Repair https://repairbench.github.io/β11Oct 26, 2025Updated 8 months ago
- LLM-based approach to find regression bugs. It checks the behavioral changes introduced by a pull request against its title, description,β¦β17Mar 12, 2026Updated 3 months ago
- β19Sep 29, 2024Updated last year