YangLinyi / GLUE-X
We leverage 14 datasets as OOD test data and conduct evaluations on 8 NLU tasks over 21 popularly used models. Our findings confirm that the OOD accuracy in NLP tasks needs to be paid more attention to since the significant performance decay compared to ID accuracy has been found in all settings.
☆117Updated last year
Related projects ⓘ
Alternatives and complementary repositories for GLUE-X
- The repository for paper <Evaluating Open-QA Evaluation>☆23Updated 7 months ago
- ☆40Updated 11 months ago
- DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models☆13Updated last year
- Implementation of the paper: "Making Retrieval-Augmented Language Models Robust to Irrelevant Context"☆62Updated 3 months ago
- Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model☆65Updated 2 years ago
- [NeurIPS 2022] Generating Training Data with Language Models: Towards Zero-Shot Language Understanding☆62Updated 2 years ago
- Code and data for paper "Context-faithful Prompting for Large Language Models".☆39Updated last year
- [NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't…☆83Updated 4 months ago
- ☆25Updated last year
- [EMNLP 2023] Official Code of "JointMatch: A Unified Approach for Diverse and Collaborative Pseudo-Labeling to Semi-Supervised Text Class…☆17Updated 6 months ago
- Source code of "Reasons to Reject? Aligning Language Models with Judgments"☆56Updated 8 months ago
- This is the official code for paper: [PiVe: Prompting with Iterative Verification Improving Graph-based Generative Capability of LLMs]☆31Updated 2 months ago
- Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"☆60Updated 8 months ago
- Evaluating the Ripple Effects of Knowledge Editing in Language Models☆50Updated 7 months ago
- Lightweight tool to identify Data Contamination in LLMs evaluation☆42Updated 8 months ago
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆45Updated 7 months ago
- [ICLR'24 spotlight] Tool-Augmented Reward Modeling☆36Updated 8 months ago
- ☆14Updated 4 months ago
- [ICML 2023] Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning☆39Updated last year
- [NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…☆29Updated last year
- Github repository for "FELM: Benchmarking Factuality Evaluation of Large Language Models" (NeurIPS 2023)☆54Updated 10 months ago
- ☆33Updated last year
- Repo for Llatrieval☆28Updated 3 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆75Updated last month
- A method of ensemble learning for heterogeneous large language models.☆30Updated 3 months ago
- This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"☆43Updated 3 weeks ago
- Paper list of "The Life Cycle of Knowledge in Big Language Models: A Survey"☆61Updated last year
- ☆44Updated 2 months ago
- [EMNLP 2024] The official GitHub repo for the survey paper "Knowledge Conflicts for LLMs: A Survey"☆86Updated 2 months ago
- Towards Systematic Measurement for Long Text Quality☆28Updated 2 months ago
- Repository for the paper "Cognitive Mirage: A Review of Hallucinations in Large Language Models"☆46Updated last year