THUDM / NaturalCodeBenchLinks

NaturalCodeBench (Findings of ACL 2024)

☆67

Alternatives and similar repositories for NaturalCodeBench

Users that are interested in NaturalCodeBench are comparing it to the libraries listed below

Sorting:

CodeEditorBench / CodeEditorBench
☆53Updated last year
ntunlp / xCodeEval
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval
☆86Updated last year
qishenghu / InstructCoder
InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw
☆62Updated last year
THUDM / ChatGLM-Math
☆83Updated last year
meowpass / FollowComplexInstruction
Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large L…
☆51Updated last year
SparksofAGI / MHPP
☆32Updated last month
MCEVAL / McEval
☆44Updated 10 months ago
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆51Updated 4 months ago
Ablustrund / APPS_Plus
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
☆71Updated last year
GAIR-NLP / ReAlign
Reformatted Alignment
☆112Updated last year
liyucheng09 / Contamination_Detector
Lightweight tool to identify Data Contamination in LLMs evaluation
☆52Updated last year
GAIR-NLP / OPO
☆51Updated last year
TIGER-AI-Lab / AceCoder
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]
☆91Updated 6 months ago
Junjie-Ye / ToolEyes
[COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆69Updated 5 months ago
bigcode-project / astraios
Astraios: Parameter-Efficient Instruction Tuning Code Language Models
☆62Updated last year
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆82Updated last year
thu-coai / CritiqueLLM
☆147Updated last year
OpenLMLab / LongWanjuan
Towards Systematic Measurement for Long Text Quality
☆36Updated last year
Open-Source-O1 / o1_Reasoning_Patterns_Study
☆104Updated 10 months ago
wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆58Updated last year
icip-cas / awesome-auto-alignment
Collection of papers for scalable automated alignment.
☆94Updated last year
NumberChiffre / mcts-llm
☆96Updated 10 months ago
open-compass / MathBench
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
☆107Updated 5 months ago
princeton-nlp / LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
☆131Updated last year
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆154Updated last year
QwenLM / online_merging_optimizers
Implementations of online merging optimizers proposed by Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment
☆77Updated last year
TIGER-AI-Lab / MAmmoTH2
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆148Updated 11 months ago
thu-coai / ComplexBench
Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)
☆95Updated 8 months ago
yegcjs / mixinglaws
☆106Updated 3 months ago
ntunlp / ExecEval
A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
☆56Updated last year