bytedance / FullStackBench
Official repository for our paper "FullStack Bench: Evaluating LLMs as Full Stack Coders"
☆73Updated 3 months ago
Alternatives and similar repositories for FullStackBench:
Users that are interested in FullStackBench are comparing it to the libraries listed below
- ☆173Updated last month
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆68Updated 8 months ago
- A Comprehensive Survey on Long Context Language Modeling☆86Updated 2 weeks ago
- Reproducing R1 for Code with Reliable Rewards☆140Updated 3 weeks ago
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.☆216Updated this week
- Inference code of Lingma SWE-GPT☆199Updated 3 months ago
- [ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.☆237Updated 5 months ago
- A Comprehensive Benchmark for Software Development.☆100Updated 10 months ago
- ☆49Updated last year
- GitHub page for "Large Language Model-Brained GUI Agents: A Survey"☆136Updated 3 weeks ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆229Updated last month
- NaturalCodeBench (Findings of ACL 2024)☆62Updated 5 months ago
- ☆142Updated 8 months ago
- ☆60Updated 4 months ago
- ☆124Updated 3 weeks ago
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆63Updated last month
- ☆102Updated 3 months ago
- connecting humans and agents☆80Updated 3 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆179Updated 5 months ago
- AutoCoA (Automatic generation of Chain-of-Action) is an agent model framework that enhances the multi-turn tool usage capability of reaso…☆75Updated last week
- This repo aims to record resource of role-playing abilities in LLMs, including dataset, paper, application, etc.☆107Updated 6 months ago
- ☆122Updated last year
- InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (ICML 2024)☆113Updated 3 months ago
- Neural Code Intelligence Survey 2024; Reading lists and resources☆257Updated last week
- ☆113Updated 2 months ago
- ☆28Updated 4 months ago
- CodeRAG-Bench: Can Retrieval Augment Code Generation?☆119Updated 4 months ago
- ☆101Updated 11 months ago
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆136Updated 8 months ago
- Official implementation of paper How to Understand Whole Repository? New SOTA on SWE-bench Lite (21.3%)☆74Updated this week