guijinSON / MM-EvalLinks
Official implementation for "MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models"
☆17Updated last year
Alternatives and similar repositories for MM-Eval
Users that are interested in MM-Eval are comparing it to the libraries listed below
Sorting:
- Official Code for M-RᴇᴡᴀʀᴅBᴇɴᴄʜ: Evaluating Reward Models in Multilingual Settings (ACL 2025 Main)☆40Updated 8 months ago
- [ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision☆95Updated last year
- Multilingual Large Language Models Evaluation Benchmark☆133Updated last year
- The geometry of multilingual language model representations (EMNLP 2022).☆22Updated 3 years ago
- [ACL 2025 Main] Official Repository for "Evaluating Language Models as Synthetic Data Generators"☆40Updated last year
- ☆45Updated last year
- [NeurIPS 2025] Reasoning Models Better Express Their Confidence"☆22Updated 2 months ago
- Code for Zero-Shot Tokenizer Transfer☆142Updated last year
- ☆187Updated 7 months ago
- ☆22Updated 3 years ago
- Repository for "Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators"☆12Updated 10 months ago
- code associated with ACL 2021 DExperts paper☆118Updated 2 years ago
- ☆145Updated last year
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆226Updated last year
- BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages☆45Updated 5 months ago
- [ICLR 2022] Towards Continual Knowledge Learning of Language Models☆92Updated 3 years ago
- A curated list of research papers and resources on Cultural LLM.☆53Updated last year
- ☆13Updated last year
- [EMNLP 2022] TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models☆74Updated last year
- ☆68Updated 2 years ago
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆80Updated last year
- Crosslingual Reasoning through Test-Time Scaling☆20Updated 8 months ago
- ☆27Updated last year
- ☆11Updated 4 months ago
- Code for "Tracing Knowledge in Language Models Back to the Training Data"☆39Updated 3 years ago
- ☆55Updated last year
- ☆75Updated 2 years ago
- 👻 Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"☆59Updated last year
- ☆85Updated last year
- ☆88Updated 3 years ago