tongye98 / Awesome-Code-BenchmarkLinks

A comprehensive code domain benchmark review of LLM researches.

☆67

Alternatives and similar repositories for Awesome-Code-Benchmark

Users that are interested in Awesome-Code-Benchmark are comparing it to the libraries listed below

Sorting:

YerbaPage / Awesome-Repo-Level-Code-Generation
Must-read papers on Repository-level Code Generation & Issue Resolution 🔥
☆126Updated 3 weeks ago
amazon-science / Repoformer
Repoformer: Selective Retrieval for Repository-Level Code Completion (ICML 2024)
☆55Updated last month
code-rag-bench / code-rag-bench
CodeRAG-Bench: Can Retrieval Augment Code Generation?
☆148Updated 8 months ago
ganler / code-r1
Reproducing R1 for Code with Reliable Rewards
☆240Updated 2 months ago
open-compass / DevEval
A Comprehensive Benchmark for Software Development.
☆111Updated last year
seketeam / EvoCodeBench
An Evolving Code Generation Benchmark Aligned with Real-world Code Repositories
☆62Updated 11 months ago
SWE-Perf / SWE-Perf
☆30Updated 2 weeks ago
multi-swe-bench / multi-swe-bench
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
☆218Updated last week
zhangxjohn / LLM-Agent-Benchmark-List
A banchmark list for evaluation of large language models.
☆134Updated last month
amazon-science / cceval
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
☆151Updated last year
Leolty / repobench
✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
☆168Updated 11 months ago
llm-as-a-judge / Awesome-LLM-as-a-judge
☆388Updated last week
R2E-Gym / R2E-Gym
Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆136Updated 3 weeks ago
MozerWang / Loong
[EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
☆139Updated 8 months ago
thunlp / DebugBench
The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".
☆79Updated last year
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆151Updated 9 months ago
SalesforceAIResearch / CodeChain
Official code for the paper "CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules"
☆45Updated 6 months ago
LCLM-Horizon / A-Comprehensive-Survey-For-Long-Context-Language-Modeling
A Comprehensive Survey on Long Context Language Modeling
☆166Updated 3 weeks ago
allanj / repo-level-codegen-papers
Repo-Level Code generation papers
☆195Updated 2 weeks ago
xyliu-cs / RISE
Official Implementation of RISE (Reinforcing Reasoning with Self-Verification)
☆28Updated 3 weeks ago
StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…
☆231Updated 2 months ago
Saibo-creator / Awesome-LLM-Constrained-Decoding
A curated list of papers related to constrained decoding of LLM, along with their relevant code and resources.
☆241Updated 2 weeks ago
zorazrw / awesome-tool-llm
☆237Updated 11 months ago
lyy1994 / awesome-data-contamination
The Paper List on Data Contamination for Large Language Models Evaluation.
☆98Updated 2 weeks ago
microsoft / SWE-bench-Live
🚀 SWE-bench Goes Live!
☆103Updated last week
newfacade / LeetCodeDataset
LeetCode Training and Evaluation Dataset
☆28Updated 3 months ago
SWE-bench / SWE-smith
Scaling Data for SWE-agents
☆328Updated this week
zankner / CLoud
Critique-out-Loud Reward Models
☆70Updated 9 months ago
yuzhu-cai / rSDE-Bench
☆26Updated 2 months ago
xingyaoww / mint-bench
Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…
☆128Updated last year