princeton-nlp / HELMETLinks

The HELMET Benchmark

☆178

Alternatives and similar repositories for HELMET

Users that are interested in HELMET are comparing it to the libraries listed below

Sorting:

princeton-nlp / ProLong
Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"
☆231Updated last month
nightdessert / Retrieval_Head
open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality
☆215Updated last year
google-deepmind / loft
LOFT: A 1 Million+ Token Long-Context Benchmark
☆218Updated 4 months ago
allenai / olmes
Reproducible, flexible LLM evaluations
☆257Updated last week
TIGER-AI-Lab / LongICLBench
Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]
☆108Updated 8 months ago
CodeCreator / WebOrganizer
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
☆67Updated 5 months ago
QwenLM / ProcessBench
Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
☆174Updated 5 months ago
TIGER-AI-Lab / General-Reasoner
General Reasoner: Advancing LLM Reasoning Across All Domains [NeurIPS25]
☆185Updated 4 months ago
da03 / Internalize_CoT_Step_by_Step
☆195Updated 6 months ago
hkust-nlp / llm-compression-intelligence
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆142Updated last year
ScalerLab / JudgeBench
☆102Updated 11 months ago
Leooyii / LCEG
Long Context Extension and Generalization in LLMs
☆62Updated last year
booydar / babilong
BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.
☆215Updated last month
SalesforceAIResearch / GemFilter
☆85Updated 9 months ago
HKUNLP / STRING
[ICLR'25] Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"
☆78Updated 11 months ago
allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆242Updated 11 months ago
eddycmu / demystify-long-cot
☆323Updated 4 months ago
huggingface / ioi
☆40Updated 6 months ago
bigai-nlco / LooGLE
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models
☆185Updated last year
princeton-nlp / QuRating
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆189Updated last year
TIGER-AI-Lab / CritiqueFineTuning
Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" [COLM 2025]
☆178Updated 3 months ago
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆251Updated 6 months ago
hkust-nlp / dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
☆115Updated 10 months ago
xlang-ai / BRIGHT
[ICLR 2025] BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
☆168Updated last month
ryoungj / ObsScaling
[NeurIPS'24 Spotlight] Observational Scaling Laws
☆57Updated last year
zhiyuanhubj / LongRecipe
LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
☆76Updated last year
GAIR-NLP / OctoThinker
Revisiting Mid-training in the Era of Reinforcement Learning Scaling
☆177Updated 3 months ago
zjunlp / LightThinker
[EMNLP 2025] LightThinker: Thinking Step-by-Step Compression
☆112Updated 6 months ago
agentica-project / verl-pipeline
Async pipelined version of Verl
☆121Updated 6 months ago
PKU-ML / LongPPL
Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"
☆102Updated 2 weeks ago