RUCAIBox / ICPC-EvalLinks
A new benchmark of 118 ICPC problems for evaluating LLM reasoning in competitive coding, featuring realistic ICPC competition scenario, robust local evaluation, and a iterative repair metrics Refine@K
☆16Updated 7 months ago
Alternatives and similar repositories for ICPC-Eval
Users that are interested in ICPC-Eval are comparing it to the libraries listed below
Sorting:
- LeetCode Training and Evaluation Dataset☆46Updated 8 months ago
- A research repo for experiments about Reinforcement Finetuning☆53Updated 9 months ago
- The related works and background techniques about Openai o1☆221Updated last year
- Reproducing R1 for Code with Reliable Rewards☆279Updated 8 months ago
- ☆50Updated 4 months ago
- Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (…☆473Updated this week
- [AAAI 2025] The official code of the paper "InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct"(http…☆14Updated last year
- Official repository for our paper "FullStack Bench: Evaluating LLMs as Full Stack Coders"☆109Updated 8 months ago
- A comprehensive code domain benchmark review of LLM researches.☆182Updated 3 months ago
- ☆326Updated 7 months ago
- A Comprehensive Survey on Long Context Language Modeling☆215Updated last month
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆142Updated 2 months ago
- ☆255Updated 5 months ago
- ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry☆41Updated last week
- ☆52Updated 10 months ago
- Evaluation utilities based on SymPy.☆21Updated last year
- Repository of LV-Eval Benchmark☆73Updated last year
- Must-read papers on Repository-level Code Generation & Issue Resolution 🔥☆234Updated 3 weeks ago
- ☆457Updated 5 months ago
- A Comprehensive Benchmark for Software Development.☆124Updated last year
- Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"☆181Updated 7 months ago
- CodeRAG-Bench: Can Retrieval Augment Code Generation?☆164Updated last year
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆270Updated last year
- Official repository for the paper "COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis".☆17Updated 10 months ago
- ☆217Updated last week
- A repository sharing the literatures about large language models☆107Updated 3 weeks ago
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.☆249Updated 8 months ago
- Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"☆117Updated last month
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆142Updated 10 months ago
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆144Updated 3 weeks ago