richardodliu / OpenCodeEvalLinks

☆36

Alternatives and similar repositories for OpenCodeEval

Users that are interested in OpenCodeEval are comparing it to the libraries listed below

Sorting:

hkust-nlp / dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
☆109Updated 7 months ago
ganler / code-r1
Reproducing R1 for Code with Reliable Rewards
☆239Updated 2 months ago
princeton-nlp / ProLong
Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"
☆215Updated 4 months ago
tongyx361 / symeval
Evaluation utilities based on SymPy.
☆20Updated 7 months ago
agentica-project / verl-pipeline
Async pipelined version of Verl
☆110Updated 3 months ago
ZubinGou / math-evaluation-harness
A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨
☆237Updated last year
Zanette-Labs / SpeculativeRejection
[NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection
☆48Updated 8 months ago
OpenBMB / InfiniteBench
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
☆342Updated 10 months ago
QwenLM / ProcessBench
Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
☆164Updated 2 months ago
LCLM-Horizon / A-Comprehensive-Survey-For-Long-Context-Language-Modeling
A Comprehensive Survey on Long Context Language Modeling
☆164Updated 2 weeks ago
nightdessert / Retrieval_Head
open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality
☆204Updated 11 months ago
phonism / CP-Zero
Based on the R1-Zero method, using rule-based rewards and GRPO on the Code Contests dataset.
☆17Updated 3 months ago
KbsdJames / Omni-MATH
The official repository of the Omni-MATH benchmark.
☆85Updated 7 months ago
OpenMOSS / Thus-Spake-Long-Context-LLM
a survey of long-context LLMs from four perspectives, architecture, infrastructure, training, and evaluation
☆55Updated 3 months ago
TIGER-AI-Lab / verl-tool
A version of verl to support tool use
☆297Updated last week
bigai-nlco / LooGLE
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models
☆184Updated 9 months ago
multimodal-art-projection / KORGym
☆44Updated 2 months ago
princeton-pli / LongProc
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
☆26Updated last month
facebookresearch / Multi-IF
The evaluation code for MultiIF multi-turn and multi-lingual instruction following
☆51Updated 8 months ago
R2E-Gym / R2E-Gym
Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆128Updated last week
GAIR-NLP / LIMR
☆206Updated 5 months ago
modelscope / Trinity-RFT
Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (…
☆142Updated this week
TIGER-AI-Lab / LongICLBench
Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]
☆105Updated 5 months ago
eddycmu / demystify-long-cot
☆306Updated last month
hahahawu / Long-to-Short-via-Model-Merging
Model merging is a highly efficient approach for long-to-short reasoning.
☆76Updated last month
sail-sg / oat-zero
A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.
☆244Updated 3 months ago
QwenLM / CodeElo
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
☆45Updated 5 months ago
FasterDecoding / REST
REST: Retrieval-Based Speculative Decoding, NAACL 2024
☆205Updated 7 months ago
mtbench101 / mt-bench-101
[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
☆101Updated last year
infinigence / LVEval
Repository of LV-Eval Benchmark
☆67Updated 10 months ago