ProsusAI / stack-evalLinks

Official implementation for the paper, StackEval: Benchmarking LLMs in Coding Assistance, https://arxiv.org/abs/2412.05288

☆15

Alternatives and similar repositories for stack-eval

Users that are interested in stack-eval are comparing it to the libraries listed below

Sorting:

wasiahmad / AVATAR
Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
☆54Updated 10 months ago
zorazrw / odex
[EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation
☆48Updated last year
SalesforceAIResearch / swecomm
☆27Updated 5 months ago
amazon-science / recode
Releasing code for "ReCode: Robustness Evaluation of Code Generation Models"
☆53Updated last year
rizwan09 / REDCODER
☆45Updated 4 months ago
Jun-jie-Huang / CoCLR
Source Code for ACL-21 main conference paper "CoSQA: 20,000+ Web Queries for Code Search and Question Answering".
☆45Updated 2 years ago
ntunlp / ExecEval
A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
☆55Updated 8 months ago
panthap2 / deep-jit-inconsistency-detection
Deep Just-In-Time Inconsistency Detection Between Comments and Source Code: Artifact
☆22Updated 2 years ago
facebookresearch / mbr-exec
code for "Natural Language to Code Translation with Execution"
☆41Updated 2 years ago
terryyz / ice-score
[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code
☆76Updated last year
gangiswag / cornstack
☆34Updated this week
multi-swe-bench / multi-swe-bench-env
☆1Updated 9 months ago
JetBrains-Research / lca-baselines
Baselines for all tasks from Long Code Arena benchmarks 🏟️
☆30Updated 2 months ago
google-research-datasets / great
The dataset for the variable-misuse task, used in the ICLR 2020 paper 'Global Relational Models of Source Code' [https://openreview.net/f…
☆22Updated 4 years ago
shrivastavadisha / repo_level_prompt_generation
☆125Updated 2 years ago
SalesforceAIResearch / CodeChain
Official code for the paper "CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules"
☆45Updated 5 months ago
amazon-science / llm-code-preference
Training and Benchmarking LLMs for Code Preference.
☆33Updated 7 months ago
logic-star-ai / swt-bench
[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation
☆50Updated 3 weeks ago
crux-eval / eval-arena
☆26Updated last week
squaresLab / VarCLR
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning
☆39Updated 2 years ago
nyu-mll / ILF-for-code-generation
☆76Updated 3 months ago
psunlpgroup / ReaLMistake
This repository includes a benchmark and code for the paper "Evaluating LLMs at Detecting Errors in LLM Responses".
☆29Updated 10 months ago
rajasagashe / JuICe
Code for generating the JuICe dataset.
☆37Updated 3 years ago
jamesmurdza / humaneval-results
Evaluation results of code generation LLMs
☆31Updated last year
Ablustrund / APPS_Plus
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
☆65Updated 9 months ago
jadecxliu / CodeQA
Dataset and code for Findings of EMNLP'21 paper "CodeQA: A Question Answering Dataset for Source Code Comprehension".
☆41Updated last year
martin-wey / CodeUltraFeedback
CodeUltraFeedback: aligning large language models to coding preferences
☆71Updated last year
esteng / regal_program_learning
☆24Updated 9 months ago
jianguda / mrncs
☆23Updated 2 years ago
RosaliaTufano / code_review
☆36Updated 3 years ago