bigcode-project / bigcode-evaluation-harnessLinks

A framework for the evaluation of autoregressive code generation language models.

☆986

Alternatives and similar repositories for bigcode-evaluation-harness

Users that are interested in bigcode-evaluation-harness are comparing it to the libraries listed below

Sorting:

microsoft / CodeT
☆668Updated 11 months ago
bigcode-project / octopack
🐙 OctoPack: Instruction Tuning Code Large Language Models
☆471Updated 8 months ago
declare-lab / instruct-eval
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
☆548Updated last year
abacaj / code-eval
Run evaluation on LLMs using human-eval benchmark
☆420Updated 2 years ago
bigcode-project / bigcode-dataset
☆475Updated last year
tatsu-lab / alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆1,885Updated 2 months ago
hendrycks / test
Measuring Massive Multitask Language Understanding | ICLR 2021
☆1,505Updated 2 years ago
suzgunmirac / BIG-Bench-Hard
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
☆519Updated last year
LiveCodeBench / LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆688Updated 3 months ago
tatsu-lab / alpaca_farm
A simulation framework for RLHF and alternatives. Develop your RLHF method without collecting human data.
☆826Updated last year
xlang-ai / DS-1000
[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".
☆256Updated 11 months ago
sylinrl / TruthfulQA
TruthfulQA: Measuring How Models Imitate Human Falsehoods
☆824Updated 9 months ago
OpenLMLab / LEval
[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
☆390Updated last year
ContextualAI / HALOs
A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).
☆890Updated 3 weeks ago
allenai / reward-bench
RewardBench: the first evaluation tool for reward models.
☆643Updated 4 months ago
likenneth / honest_llama
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
☆553Updated 9 months ago
epfLLM / Megatron-LLM
distributed trainer for LLMs
☆581Updated last year
Leolty / repobench
✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
☆174Updated last year
nuprl / MultiPL-E
A multi-programming language benchmark for LLMs
☆278Updated 2 months ago
princeton-nlp / LLM-Shearing
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
☆631Updated last year
bigcode-project / bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
☆440Updated last week
jquesnelle / yarn
YaRN: Efficient Context Window Extension of Large Language Models
☆1,619Updated last year
DachengLi1 / LongChat
Official repository for LongChat and LongEval
☆531Updated last year
ruixiangcui / AGIEval
☆767Updated last year
yuchenlin / LLM-Blender
[ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the dive…
☆966Updated last year
madaan / self-refine
LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
☆750Updated last year
hao-ai-lab / LookaheadDecoding
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,288Updated 7 months ago
SinclairCoder / Instruction-Tuning-Papers
Reading list of Instruction-tuning. A trend starts from Natrural-Instruction (ACL 2022), FLAN (ICLR 2022) and T0 (ICLR 2022).
☆770Updated 2 years ago
glgh / awesome-llm-human-preference-datasets
A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.
☆381Updated 2 years ago
anthropics / hh-rlhf
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
☆1,788Updated 4 months ago