microsoft / prose-benchmarks
PROSE Public Benchmark Suite
☆24Updated last month
Related projects ⓘ
Alternatives and complementary repositories for prose-benchmarks
- Astraios: Parameter-Efficient Instruction Tuning Code Language Models☆57Updated 7 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆52Updated last month
- NaturalCodeBench (Findings of ACL 2024)☆56Updated last month
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆44Updated 11 months ago
- ☆75Updated last year
- CodeUltraFeedback: aligning large language models to coding preferences☆65Updated 5 months ago
- Implementation of the model: "Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models" in PyTorch☆29Updated this week
- ☆39Updated 5 months ago
- xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval☆74Updated 2 months ago
- ☆42Updated 4 months ago
- ☆21Updated 3 weeks ago
- ☆50Updated 5 months ago
- Self-Reflection in LLM Agents: Effects on Problem-Solving Performance☆29Updated 6 months ago
- ☆25Updated 2 years ago
- Script for processing OpenAI's PRM800K process supervision dataset into an Alpaca-style instruction-response format☆27Updated last year
- Repo for Llatrieval☆29Updated 3 months ago
- StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback☆56Updated 2 months ago
- A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.☆42Updated last month
- ☆47Updated 9 months ago
- [ICLR'24 spotlight] Tool-Augmented Reward Modeling☆36Updated 8 months ago
- This repository includes a benchmark and code for the paper "Evaluating LLMs at Detecting Errors in LLM Responses".☆26Updated 3 months ago
- Code for paper 'Data-Efficient FineTuning'☆29Updated last year
- ☆24Updated 5 months ago
- LMTuner: Make the LLM Better for Everyone☆33Updated last year
- On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability☆23Updated last month
- Dataset and code for Findings of EMNLP'21 paper "CodeQA: A Question Answering Dataset for Source Code Comprehension".☆38Updated 11 months ago
- [ICML 2023] "Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation", Wenqing Zheng, S P Sharan, Ajay Kumar Jaiswal, …☆37Updated last year
- Lightweight tool to identify Data Contamination in LLMs evaluation☆42Updated 8 months ago
- ☆33Updated 5 months ago
- [NAACL 2024 Findings] Evaluation suite for the systematic evaluation of instruction selection methods.☆23Updated last year