alphadl / OOP-evalLinks
The first Object-Oriented Programming (OOP) Evaluaion Benchmark for LLMs
☆24Updated 5 months ago
Alternatives and similar repositories for OOP-eval
Users that are interested in OOP-eval are comparing it to the libraries listed below
Sorting:
- [ICLR 2022] Official repository for "Knowledge Removal in Sampling-based Bayesian Inference"☆17Updated 3 years ago
- ☆28Updated 10 months ago
- [ICML 2024] Code for the paper "Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases"☆35Updated 11 months ago
- ☆184Updated last week
- FusionBench: A Comprehensive Benchmark/Toolkit of Deep Model Fusion☆143Updated last week
- ☆16Updated 8 months ago
- MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion (ACL 2025)☆25Updated last month
- ICML 2024 - Official Repository for EXO: Towards Efficient Exact Optimization of Language Model Alignment☆57Updated last year
- 🚀enhanced GRPO with more verifiable rewards and real-time evaluators☆35Updated 2 weeks ago
- ☆40Updated 2 weeks ago
- The official implementation for Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free☆44Updated last month
- SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Model https://arxiv.org/pdf/2411.02433☆26Updated 6 months ago
- ☆15Updated 2 months ago
- ☆32Updated last year
- [ACL'24] Can LLMs Speak For Diverse People? Tuning LLMs via Debate to Generate Controllable Controversial Statements☆23Updated 9 months ago
- [ACL 2024] Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning☆45Updated 10 months ago
- ☆19Updated 9 months ago
- Offcial Repo of Paper "Eliminating Position Bias of Language Models: A Mechanistic Approach""☆14Updated last week
- Code and models for EMNLP 2024 paper "WPO: Enhancing RLHF with Weighted Preference Optimization"☆40Updated 9 months ago
- [ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning☆25Updated last year
- ☆17Updated 4 months ago
- This is an official implementation of the Reward rAnked Fine-Tuning Algorithm (RAFT), also known as iterative best-of-n fine-tuning or re…☆32Updated 9 months ago
- Benchmarking Benchmark Leakage in Large Language Models☆52Updated last year
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆54Updated last year
- An official implementation of "Catastrophic Failure of LLM Unlearning via Quantization" (ICLR 2025)☆27Updated 4 months ago
- Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"☆71Updated this week
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.☆81Updated 10 months ago
- Exploration of automated dataset selection approaches at large scales.☆45Updated 3 months ago
- [EMNLP-2022 Findings] Code for paper “ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback”.☆26Updated 2 years ago
- Code accompanying the paper "Noise Contrastive Alignment of Language Models with Explicit Rewards" (NeurIPS 2024)☆54Updated 7 months ago