RyanLoil / ProjectEvalLinks
☆23Updated 3 months ago
Alternatives and similar repositories for ProjectEval
Users that are interested in ProjectEval are comparing it to the libraries listed below
Sorting:
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆140Updated last year
- ☆322Updated last year
- a survey of long-context LLMs from four perspectives, architecture, infrastructure, training, and evaluation☆61Updated 10 months ago
- [ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning☆188Updated 7 months ago
- Code and data for "ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM" (NeurIPS 2024 Track Datasets and…☆64Updated 8 months ago
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆214Updated 9 months ago
- The implement of ACL2024: "MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization"☆43Updated last year
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆194Updated last year
- Official repository for the paper "COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis".☆17Updated 11 months ago
- ☆165Updated 3 months ago
- WritingBench: A Comprehensive Benchmark for Generative Writing☆156Updated last month
- Extrapolating RLVR to General Domains without Verifiers☆200Updated 5 months ago
- [ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.☆256Updated last year
- Awesome-Long2short-on-LRMs is a collection of state-of-the-art, novel, exciting long2short methods on large reasoning models. It contains…☆258Updated 5 months ago
- [ACL' 25] The official code repository for PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models.☆88Updated 11 months ago
- ☆11Updated last year
- Repository for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning☆168Updated 2 years ago
- [EMNLP 2023] Adapting Language Models to Compress Long Contexts☆328Updated last year
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆146Updated last month
- InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning☆284Updated 2 years ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆143Updated 2 months ago
- Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718☆373Updated last year
- [ACL 2024 (Oral)] A Prospector of Long-Dependency Data for Large Language Models☆59Updated last year
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale☆265Updated 7 months ago
- [ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models☆119Updated 8 months ago
- [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step☆304Updated last year
- Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework☆208Updated 3 weeks ago
- Heuristic filtering framework for RefineCode☆82Updated 10 months ago
- Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]☆587Updated last year
- OMGEval😮: An Open Multilingual Generative Evaluation Benchmark for Foundation Models☆36Updated last year