bytedance / FullStackBenchLinks

Official repository for our paper "FullStack Bench: Evaluating LLMs as Full Stack Coders"

☆92

Alternatives and similar repositories for FullStackBench

Users that are interested in FullStackBench are comparing it to the libraries listed below

Sorting:

ganler / code-r1
Reproducing R1 for Code with Reliable Rewards
☆221Updated last month
open-compass / DevEval
A Comprehensive Benchmark for Software Development.
☆110Updated last year
thunlp / DebugBench
The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".
☆78Updated 11 months ago
THUDM / NaturalCodeBench
NaturalCodeBench (Findings of ACL 2024)
☆65Updated 8 months ago
thinkwee / AgentsMeetRL
An Awesome List of Reinforcement Learning-based Large Language Agent Works. Collect directly from official code base.
☆154Updated this week
MCEVAL / McEval
☆41Updated 6 months ago
multi-swe-bench / multi-swe-bench
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
☆195Updated this week
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆145Updated 8 months ago
YerbaPage / Awesome-Repo-Level-Code-Generation
Must-read papers on Repository-level Code Generation & Issue Resolution 🔥
☆101Updated this week
LCLM-Horizon / A-Comprehensive-Survey-For-Long-Context-Language-Modeling
A Comprehensive Survey on Long Context Language Modeling
☆152Updated 3 weeks ago
sail-sg / oat-zero
A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.
☆241Updated 2 months ago
bytedance / SandboxFusion
☆431Updated this week
LingmaTongyi / Lingma-SWE-GPT
Inference code of Lingma SWE-GPT
☆226Updated 6 months ago
seketeam / EvoCodeBench
An Evolving Code Generation Benchmark Aligned with Real-world Code Repositories
☆61Updated 10 months ago
TIGER-AI-Lab / AceCoder
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]
☆85Updated 2 months ago
bytedance / trae-agent
☆36Updated last week
yanweiyue / masrouter
☆67Updated last month
CodeEditorBench / CodeEditorBench
☆47Updated last year
amazon-science / cceval
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
☆143Updated 10 months ago
microsoft / SWE-bench-Live
🚀 SWE-bench Goes Live!
☆80Updated last week
InternLM / InternBootcamp
☆157Updated last week
APEXLAB / CodeApex
☆49Updated last year
codefuse-ai / CodeFuse-CGE
☆21Updated 2 months ago
ntunlp / xCodeEval
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
☆83Updated 9 months ago
thu-coai / CritiqueLLM
☆142Updated 11 months ago
codefuse-ai / CodeFuse-CGM
☆191Updated this week
codefuse-ai / codefuse-evaluation
Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中
☆96Updated 2 months ago
GAIR-NLP / ProX
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
☆251Updated 3 weeks ago
AlphaPav / mem-kk-logic
On Memorization of Large Language Models in Logical Reasoning
☆67Updated 2 months ago
thu-coai / AutoDetect
Official github repo for AutoDetect, an automated weakness detection framework for LLMs.
☆42Updated last year