Explaining bias in modern Deep Language Models

Christian Vincenzo Traina

Explaining bias in modern Deep Language Models.

Rel. Elena Maria Baralis, Giuseppe Attanasio. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial.
Download (7MB) | Preview

Abstract

In recent years, episodes of hate speech on the Internet have increased. Hate speech manifests with instances of misogyny, racism, and attacks on minorities. To analyze large amounts of data and curb the spread of hurtful content, modern language models such as BERT are currently employed in the task of automatic hate speech detection. Although these models have outperformed previous solutions, several recent works have shown that they still suffer from unintended bias. Biased models tend to be over-sensitive to a limited set of words, so they base the entire decision on only those words and ignore the context. Much recent work has focused on explaining the models, on the understanding of the output, and the way it is obtained.

Explanation methods can be based on either exploiting the inner workings of the neural network or analyzing the output by perturbing the input