YangLinyi / GLUE-X

We leverage 14 datasets as OOD test data and conduct evaluations on 8 NLU tasks over 21 popularly used models. Our findings confirm that the OOD accuracy in NLP tasks needs to be paid more attention to since the significant performance decay compared to ID accuracy has been found in all settings.
117Updated last year

Related projects

Alternatives and complementary repositories for GLUE-X