ProsusAI / stack-evalLinks
Official implementation for the paper, StackEval: Benchmarking LLMs in Coding Assistance, https://arxiv.org/abs/2412.05288
☆15Updated 7 months ago
Alternatives and similar repositories for stack-eval
Users that are interested in stack-eval are comparing it to the libraries listed below
Sorting:
- Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.☆54Updated 10 months ago
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆48Updated last year
- ☆27Updated 5 months ago
- Releasing code for "ReCode: Robustness Evaluation of Code Generation Models"☆53Updated last year
- ☆45Updated 4 months ago
- Source Code for ACL-21 main conference paper "CoSQA: 20,000+ Web Queries for Code Search and Question Answering".☆45Updated 2 years ago
- A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.☆55Updated 8 months ago
- Deep Just-In-Time Inconsistency Detection Between Comments and Source Code: Artifact☆22Updated 2 years ago
- code for "Natural Language to Code Translation with Execution"☆41Updated 2 years ago
- [EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code☆76Updated last year
- ☆34Updated this week
- ☆1Updated 9 months ago
- Baselines for all tasks from Long Code Arena benchmarks 🏟️☆30Updated 2 months ago
- The dataset for the variable-misuse task, used in the ICLR 2020 paper 'Global Relational Models of Source Code' [https://openreview.net/f…☆22Updated 4 years ago
- ☆125Updated 2 years ago
- Official code for the paper "CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules"☆45Updated 5 months ago
- Training and Benchmarking LLMs for Code Preference.☆33Updated 7 months ago
- [NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation☆50Updated 3 weeks ago
- ☆26Updated last week
- VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning☆39Updated 2 years ago
- ☆76Updated 3 months ago
- This repository includes a benchmark and code for the paper "Evaluating LLMs at Detecting Errors in LLM Responses".☆29Updated 10 months ago
- Code for generating the JuICe dataset.☆37Updated 3 years ago
- Evaluation results of code generation LLMs☆31Updated last year
- StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback☆65Updated 9 months ago
- Dataset and code for Findings of EMNLP'21 paper "CodeQA: A Question Answering Dataset for Source Code Comprehension".☆41Updated last year
- CodeUltraFeedback: aligning large language models to coding preferences☆71Updated last year
- ☆24Updated 9 months ago
- ☆23Updated 2 years ago
- ☆36Updated 3 years ago