openai / human-evalLinks
Code for the paper "Evaluating Large Language Models Trained on Code"
☆3,114Updated last year
Alternatives and similar repositories for human-eval
Users that are interested in human-eval are comparing it to the libraries listed below
Sorting:
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,550Updated 2 years ago
- ☆1,387Updated 2 years ago
- Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024☆1,683Updated 4 months ago
- A framework for the evaluation of autoregressive code generation language models.☆1,020Updated 6 months ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,940Updated 6 months ago
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,087Updated 2 years ago
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,725Updated 2 months ago
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,769Updated last year
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,811Updated 7 months ago
- Official implementation for "Automatic Chain of Thought Prompting in Large Language Models" (stay tuned & more will be updated)☆2,003Updated last year
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,151Updated 2 months ago
- Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …☆2,662Updated this week
- The MATH Dataset (NeurIPS 2021)☆1,298Updated 5 months ago
- [NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning☆3,052Updated last year
- ☆671Updated last year
- Aligning pretrained language models with instruction data generated by themselves.☆4,573Updated 2 years ago
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,195Updated last year
- CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.☆5,171Updated 3 months ago
- ☆1,560Updated this week
- SWE-bench: Can Language Models Resolve Real-world Github Issues?☆4,232Updated this week
- ☆772Updated last year
- ☆1,505Updated 2 years ago
- ☆1,338Updated last year
- ☆1,634Updated 2 years ago
- Home of CodeT5: Open Code LLMs for Code Understanding and Generation☆3,099Updated 2 years ago
- ☆1,069Updated last year
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆878Updated last year
- APPS: Automated Programming Progress Standard (NeurIPS 2021)☆501Updated last year
- ☆489Updated last year
- 👨💻 An awesome and curated list of best code-LLM for research.☆1,277Updated last year