wellecks / lm-evaluation-harnessLinks

A framework for few-shot evaluation of autoregressive language models.

☆24

Alternatives and similar repositories for lm-evaluation-harness

Users that are interested in lm-evaluation-harness are comparing it to the libraries listed below

Sorting:

protagolabs / odyssey-math
☆85Updated 10 months ago
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆100Updated last year
IBM / SALMON
Self-Alignment with Principle-Following Reward Models
☆169Updated 2 months ago
ars22 / scaling-LLM-math-synthetic-data
Code and data used in the paper: "Training on Incorrect Synthetic Data via RL Scales LLM Math Reasoning Eight-Fold"
☆31Updated last year
ryoungj / ObsScaling
[NeurIPS'24 Spotlight] Observational Scaling Laws
☆59Updated last year
Edward-Sun / easy-to-hard
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
☆124Updated last year
princeton-nlp / Collie
[ICLR 2024] COLLIE: Systematic Construction of Constrained Text Generation Tasks
☆57Updated 2 years ago
chujiezheng / LLM-Extrapolation
Official repository for ACL 2025 paper "Model Extrapolation Expedites Alignment"
☆76Updated 6 months ago
allenai / Lila
A unified benchmark for math reasoning
☆89Updated 2 years ago
SynthLabsAI / big-math
A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
☆68Updated 9 months ago
mukhal / GRACE
[EMNLP '23] Discriminator-Guided Chain-of-Thought Reasoning
☆49Updated last year
hkust-nlp / llm-compression-intelligence
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆143Updated last year
YuxiXie / SelfEval-Guided-Decoding
☆103Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆80Updated 4 months ago
qtli / GSM-Plus
GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.
☆63Updated last year
hughbzhang / o1_inference_scaling_laws
Replicating O1 inference-time scaling laws
☆90Updated last year
genrm-star / genrm-critiques
GenRM-CoT: Data release for verification rationales
☆66Updated last year
shunzh / Code-AI-Tree-Search
☆120Updated last year
lee-ny / teaching_arithmetic
☆84Updated 2 years ago
bigcode-project / bigcodebench-annotation
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
☆25Updated last year
asaparov / prontoqa
Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.
☆154Updated 2 months ago
SimengSun / alpaca_farm_lora
☆22Updated 2 years ago
princeton-nlp / ProLong
Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"
☆240Updated 2 months ago
tianjunz / HIR
☆159Updated 2 years ago
wellecks / naturalprover
NaturalProver: Grounded Mathematical Proof Generation with Language Models
☆38Updated 2 years ago
mnoukhov / async_rlhf
Code and Configs for Asynchronous RLHF: Faster and More Efficient RL for Language Models
☆67Updated 7 months ago
Zayne-sprague / MuSR
☆56Updated last year
anadim / the-little-retrieval-test
☆34Updated 2 years ago
swj0419 / in-context-pretraining
☆54Updated last year
microsoft / SparseMixer
Sparse Backpropagation for Mixture-of-Expert Training
☆29Updated last year