VILA-Lab / Open-LLM-LeaderboardLinks
Open-LLM-Leaderboard: Open-Style Question Evaluation. Paper at https://arxiv.org/abs/2406.07545
☆50Updated last year
Alternatives and similar repositories for Open-LLM-Leaderboard
Users that are interested in Open-LLM-Leaderboard are comparing it to the libraries listed below
Sorting:
- [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Models☆60Updated last year
- [NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models☆57Updated 8 months ago
- ☆145Updated 4 months ago
- ☆64Updated last year
- ☆20Updated last year
- ☆30Updated last year
- [ACL 2025] Are Your LLMs Capable of Stable Reasoning?☆32Updated 6 months ago
- ☆104Updated last year
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆111Updated 11 months ago
- Exploration of automated dataset selection approaches at large scales.☆52Updated 11 months ago
- Codebase for Instruction Following without Instruction Tuning☆36Updated last year
- ☆142Updated 10 months ago
- DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems☆65Updated last year
- ☆14Updated 2 years ago
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.☆85Updated last year
- Co-LLM: Learning to Decode Collaboratively with Multiple Language Models☆126Updated last year
- ☆108Updated 2 months ago
- ☆80Updated 10 months ago
- [NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models☆65Updated last year
- [ACL'25 Oral] What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective☆75Updated 7 months ago
- [ACL 2025] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLM…☆68Updated last year
- [COLING'25] Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?☆82Updated last year
- Code implementation of synthetic continued pretraining☆148Updated last year
- HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models☆53Updated last year
- Organize the Web: Constructing Domains Enhances Pre-Training Data Curation☆77Updated 9 months ago
- [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs☆49Updated last year
- "A Survey on Agent-as-a-Judge"☆87Updated 3 weeks ago
- Official PyTorch Implementation of EMoE: Unlocking Emergent Modularity in Large Language Models [main conference @ NAACL2024]☆39Updated last year
- Long Context Extension and Generalization in LLMs☆62Updated last year
- [ICML 2025] Predictive Data Selection: The Data That Predicts Is the Data That Teaches☆60Updated 11 months ago