Nicola Maddalozzo
Methods and Measures for bias detection in natural language processing: A study on word embeddings and masked models.
Rel. Eliana Pastor, Laura Alonso Alemany. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (2MB) | Preview |
Abstract: |
The society in which we live is influenced by prejudices that discriminate against specific groups of the population. In recent years, the presence of these biases has been detected in the textual data used to train natural language processing algorithms. Thus, the tools based on these algorithms present biases that harm specific categories of people. In addition to causing harm to people affected by biases, these tools do not comply with the fundamental right to non-discrimination, which may result in legal action against the responsible companies and institutions that created them. To detect and characterize this type of bias in natural language processing tools, the scientific community has developed methods and metrics to detect and measure bias. In this thesis, we apply these methods to analyze two different types of tools used in natural language processing. The first tool is word embedding, which maps words to vector representations, obtained with a Word2Vec or GloVe model. The second tool is based on large BERT-type masked language models, trained with textual data. We apply bias detection methods to both word embedding and a model’s output sequences. We analyze their capabilities and limitations. We then propose novel evaluation to assess whether a metric measures other phenomena besides a possible bias. We propose a novel approach to assess whether a model of the BERT family can be effectively evaluated in terms of the biases it contains. In the analysis phase, we conducted experiments for bias detection and measurement using these two types of tools on Spanish texts. This study is of interest to the community since most existing assessments focus on evaluating bias in English texts due to the prevalence of models and resources in that language. |
---|---|
Relatori: | Eliana Pastor, Laura Alonso Alemany |
Anno accademico: | 2023/24 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 110 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Ente in cotutela: | Universidad Nacional de Cordoba (ARGENTINA) |
Aziende collaboratrici: | FAMAF (Facultad de Matemática, Astronomía, Física y Computación) |
URI: | http://webthesis.biblio.polito.it/id/eprint/28495 |
Modifica (riservato agli operatori) |