bytedance / FullStackBench
Official repository for our paper "FullStack Bench: Evaluating LLMs as Full Stack Coders"
☆79Updated 4 months ago
Alternatives and similar repositories for FullStackBench:
Users that are interested in FullStackBench are comparing it to the libraries listed below
- Reproducing R1 for Code with Reliable Rewards☆179Updated this week
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆72Updated 9 months ago
- ☆232Updated 2 months ago
- A Comprehensive Survey on Long Context Language Modeling☆131Updated 3 weeks ago
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving☆128Updated last week
- NaturalCodeBench (Findings of ACL 2024)☆63Updated 6 months ago
- A Comprehensive Benchmark for Software Development.☆102Updated 10 months ago
- ☆39Updated 4 months ago
- 珠算代码大模型(Abacus Code LLM)☆56Updated 7 months ago
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.☆229Updated last week
- [ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.☆240Updated 5 months ago
- Codev-Bench (Code Development Benchmark), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev…☆41Updated 5 months ago
- Inference code of Lingma SWE-GPT☆213Updated 4 months ago
- ☆63Updated 5 months ago
- ☆146Updated last month
- [LREC-COLING'24] HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization☆39Updated last month
- Token level visualization tools for large language models☆79Updated 3 months ago
- CodeRAG-Bench: Can Retrieval Augment Code Generation?☆128Updated 5 months ago
- A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in Large Language Models☆38Updated last month
- A visuailzation tool to make deep understaning and easier debugging for RLHF training.☆187Updated 2 months ago
- ☆101Updated 4 months ago
- ☆81Updated last year
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆236Updated last week
- Neural Code Intelligence Survey 2024; Reading lists and resources☆259Updated last month
- Advancing LLM with Diverse Coding Capabilities☆69Updated 9 months ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆75Updated last week
- ☆52Updated 2 months ago
- a survey of long-context LLMs from four perspectives, architecture, infrastructure, training, and evaluation☆46Updated 3 weeks ago
- Ling is a MoE LLM provided and open-sourced by InclusionAI.☆143Updated last week
- ☆19Updated 4 months ago