openai / human-evalLinks

Code for the paper "Evaluating Large Language Models Trained on Code"

☆2,839

Alternatives and similar repositories for human-eval

Users that are interested in human-eval are comparing it to the libraries listed below

Sorting:

hendrycks / test
Measuring Massive Multitask Language Understanding | ICLR 2021
☆1,457Updated 2 years ago
evalplus / evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
☆1,518Updated 2 weeks ago
bigcode-project / bigcode-evaluation-harness
A framework for the evaluation of autoregressive code generation language models.
☆962Updated this week
tatsu-lab / alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆1,806Updated 6 months ago
anthropics / hh-rlhf
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
☆1,762Updated last month
openai / prm800k
800,000 step-level correctness labels on LLM solutions to MATH problems
☆2,026Updated 2 years ago
EleutherAI / pythia
The hub for EleutherAI's work on interpretability and learning dynamics
☆2,570Updated last month
openai / grade-school-math
☆1,298Updated last year
yizhongw / self-instruct
Aligning pretrained language models with instruction data generated by themselves.
☆4,423Updated 2 years ago
stanford-crfm / helm
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …
☆2,344Updated this week
FranxYao / chain-of-thought-hub
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
☆2,741Updated 11 months ago
bigscience-workshop / promptsource
Toolkit for creating, sharing and using natural language prompts.
☆2,904Updated last year
ysymyth / ReAct
[ICLR 2023] ReAct: Synergizing Reasoning and Acting in Language Models
☆2,831Updated last year
microsoft / CodeXGLUE
CodeXGLUE
☆1,705Updated last year
microsoft / LMOps
General technology for enabling AI capabilities w/ LLMs and MLLMs
☆4,067Updated 3 weeks ago
sahil280114 / codealpaca
☆1,478Updated 2 years ago
THUDM / AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
☆2,683Updated 5 months ago
google / BIG-bench
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
☆3,082Updated last year
microsoft / CodeT
☆661Updated 8 months ago
CarperAI / trlx
A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
☆4,680Updated last year
amazon-science / auto-cot
Official implementation for "Automatic Chain of Thought Prompting in Large Language Models" (stay tuned & more will be updated)
☆1,893Updated last year
lucidrains / toolformer-pytorch
Implementation of Toolformer, Language Models That Can Use Tools, by MetaAI
☆2,042Updated 11 months ago
google-research / FLAN
☆1,529Updated last week
noahshinn / reflexion
[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning
☆2,793Updated 6 months ago
hendrycks / math
The MATH Dataset (NeurIPS 2021)
☆1,154Updated 11 months ago
AetherCortex / Llama-X
Open Academic Research on Improving LLaMA to SOTA LLM
☆1,618Updated last year
openai / lm-human-preferences
Code for the paper Fine-Tuning Language Models from Human Preferences
☆1,347Updated last year
hendrycks / apps
APPS: Automated Programming Progress Standard (NeurIPS 2021)
☆474Updated last year
huybery / Awesome-Code-LLM
👨‍💻 An awesome and curated list of best code-LLM for research.
☆1,216Updated 7 months ago
gkamradt / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆1,940Updated 11 months ago