Explaining bias in modern Deep Language Models

Christian Vincenzo Traina

Explaining bias in modern Deep Language Models.

Rel. Elena Maria Baralis, Giuseppe Attanasio. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial.
Download (7MB) | Preview

Abstract:	In recent years, episodes of hate speech on the Internet have increased. Hate speech manifests with instances of misogyny, racism, and attacks on minorities. To analyze large amounts of data and curb the spread of hurtful content, modern language models such as BERT are currently employed in the task of automatic hate speech detection. Although these models have outperformed previous solutions, several recent works have shown that they still suffer from unintended bias. Biased models tend to be over-sensitive to a limited set of words, so they base the entire decision on only those words and ignore the context. Much recent work has focused on explaining the models, on the understanding of the output, and the way it is obtained. Explanation methods can be based on either exploiting the inner workings of the neural network or analyzing the output by perturbing the input. In this thesis, several techniques are used to explain neural networks, including attention maps, as extracted from BERT inference; their transformation, called effective attention maps; Hidden Token Attribution, which is a gradient-based explainer; a hierarchical explainer called Sampling-And-Occlusion (SOC); Minimal Contrastive Editing (MICE), which is a modern algorithm that uses counterfactual explanations; and two different SHAP versions: KernelSHAP and DeepSHAP. The main contributions of this thesis concern the selection of the best explanation methods for the detection of unintended biases in modern neural networks, evidence that most explainers express different types of explanations, and evidence that peaks in contribution scores are more common in false-positive samples. The analysis was performed on two different hate speech detection datasets, both in English. The datasets were collected from Twitter and manually annotated. In particular, they concern misogyny and hatred against immigrants. The explanations have been evaluated according to their quality, measured by the deviation from human explanation results. The latter was determined by a parallel survey in which 25 volunteers were asked to indicate the most important words in a sentence. Other parameters included lead time, ease of reading, and theoretical background. The results on these datasets show that the attention maps and the effective attention maps express the same type of explanation in most of the cases considered. The quality is very high in both of them, even if the output is not easily understandable by non-insiders, since it requires technical knowledge about the network. In MiCE and SOC, thanks to their particular outputs which are respectively hierarchical and contrastive, bias can be quickly individualized, so their use is encouraged despite their high lead time. Moreover, HTA is fast to compute and its quality remained consistent across experiments. Finally, both DeepSHAP and KernelSHAP were able to detect the bias in most cases, but the quality of the explanation was significantly lower compared to the other methods. Two additional experiments were conducted to prove whether the presence of peaks on the word contribution score - derived from the attention features - was greater in false-positive samples and whether there was a correlation between the explainers used. The results of these experiments show that there is a correlation between false-positive samples and peaks in attention features and that none of the explanation methods used is redundant, with the sole exception of attention maps and effective attention maps
Relatori:	Elena Maria Baralis, Giuseppe Attanasio
Anno accademico:	2021/22
Tipo di pubblicazione:	Elettronica
Numero di pagine:	109
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/22587

Modifica (riservato agli operatori)