infi-coder / infibench-evaluatorLinks

The evaluation framework for the InfiCoder-Eval benchmark.

☆21

Alternatives and similar repositories for infibench-evaluator

Users that are interested in infibench-evaluator are comparing it to the libraries listed below

Sorting:

evalplus / repoqa
RepoQA: Evaluating Long-Context Code Understanding
☆117Updated 10 months ago
amazon-science / llm-code-preference
Training and Benchmarking LLMs for Code Preference.
☆36Updated 10 months ago
CodeEditorBench / CodeEditorBench
☆53Updated last year
alonj / Same-Task-More-Tokens
The code for the paper: "Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models"
☆54Updated last year
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆153Updated 11 months ago
qishenghu / InstructCoder
InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw
☆62Updated 11 months ago
SalesforceAIResearch / swecomm
☆28Updated 8 months ago
crux-eval / eval-arena
☆28Updated 3 weeks ago
zhenyuhe00 / SWE-Swiss
SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution
☆84Updated last week
Ablustrund / APPS_Plus
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
☆68Updated last year
InternLM / SWE-Fixer
☆117Updated 4 months ago
bigcode-project / astraios
Astraios: Parameter-Efficient Instruction Tuning Code Language Models
☆62Updated last year
Gen-Verse / CURE
[NeurIPS 2025 Spotlight] ReasonFlux-Coder: Open-Source LLM Coders with Co-Evolving Reinforcement Learning
☆117Updated last week
allenai / easy-to-hard-generalization
Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"
☆48Updated last year
lfsszd / CS-Drafting
Cascade Speculative Drafting
☆30Updated last year
SparksofAGI / MHPP
☆32Updated last week
casmlab / NPHardEval
Repository for NPHardEval, a quantified-dynamic benchmark of LLMs
☆59Updated last year
Infini-AI-Lab / APE
☆33Updated 7 months ago
open-compass / DevEval
A Comprehensive Benchmark for Software Development.
☆113Updated last year
ise-uiuc / xft
XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts
☆35Updated last year
hughbzhang / o1_inference_scaling_laws
Replicating O1 inference-time scaling laws
☆90Updated 9 months ago
TIGER-AI-Lab / LongICLBench
Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]
☆107Updated 7 months ago
SalesforceAIResearch / GemFilter
☆86Updated 8 months ago
martin-wey / CodeUltraFeedback
CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)
☆72Updated last year
r2e-project / r2e
[ICML '24] r2e: turn any github repository into a programming agent environment
☆129Updated 5 months ago
MLE-Dojo / MLE-Dojo
☆73Updated last month
huggingface / ioi
☆38Updated 6 months ago
zorazrw / odex
[EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation
☆49Updated last year
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆81Updated last year
jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆116Updated last year