bigcode-project / bigcodebench
BigCodeBench: Benchmarking Code Generation Towards AGI
☆184Updated this week
Related projects: ⓘ
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆99Updated last month
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆173Updated 3 weeks ago
- RepoQA: Evaluating Long-Context Code Understanding☆96Updated this week
- ☆131Updated last month
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆182Updated 4 months ago
- 🐙 OctoPack: Instruction Tuning Code Large Language Models☆421Updated last month
- Arena-Hard-Auto: An automatic LLM benchmark.☆421Updated 2 weeks ago
- ✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024☆129Updated last month
- [ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".☆211Updated last month
- A simple unified framework for evaluating LLMs☆121Updated this week
- ☆268Updated this week
- Enhancing AI Software Engineering with Repository-level Code Graph☆60Updated 3 weeks ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆182Updated last month
- A multi-programming language benchmark for LLMs☆189Updated this week
- An Analytical Evaluation Board of Multi-turn LLM Agents☆227Updated 3 months ago
- Expert Specialized Fine-Tuning☆129Updated last month
- ☆284Updated 3 months ago
- A Comprehensive Benchmark for Software Development.☆84Updated 3 months ago
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆114Updated last month
- Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718☆244Updated last week
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆416Updated 6 months ago
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…☆198Updated 10 months ago
- ☆170Updated last month
- Code for the paper 🌳 Tree Search for Language Model Agents☆124Updated last month
- StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation☆221Updated 2 months ago
- ToolBench, an evaluation suite for LLM tool manipulation capabilities.☆134Updated 6 months ago
- xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval☆70Updated 8 months ago
- Generative Judge for Evaluating Alignment☆208Updated 8 months ago
- RewardBench: the first evaluation tool for reward models.☆352Updated last week
- ☆86Updated last year