Sensitive attributes disproportion as a risk indicator of algorithmic unfairness

Federico D'Asaro

Sensitive attributes disproportion as a risk indicator of algorithmic unfairness.

Rel. Antonio Vetro', Juan Carlos De Martin. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2021

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (5MB) | Preview

Archive (ZIP) (Documenti_allegati) - Altro
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (43MB)

Abstract:	Title: Sensitive attributes disproportion as a risk indicator of algorithmic unfairness Candidate: Federico d’Asaro Supervisor: ric. Antonio Vetrò Co-Supervisor: prof. Juan Carlos De Martin 07/09/2021 AI is increasingly being used in highly sensitive areas such as health care, hiring, so there has been a wider focus on the implications of bias and unfairness embedded in it. One may assume that using data to automate decisions would make everything fair, but it is not the case. AI bias can come in through societal bias embedded in training datasets, decisions made during the machine learning development process, etc. Our aim is to anticipate, before applying any algorithm, unfairness phenomenon by studying balance characteristic of protected attributes such age, ethnicity, gender, etc. We start by replicating results of [1], thus analyzing relationships between balance and unfairness indices. We first evaluate balance indexes (in the interval [0,1], where 0 is imbalance while 1 is balance) as Gini, Simpson, Shannon, Imbalance ratio (IIR), Renyi, Hill on training data of 9 datasets. Then on testdata, Independence, Separation - True Positive Rate (TPR) and False Positive Rate (FPR) -, Sufficiency - Positive Predictive Value (PPV) and Negative Predictive Value (NPV) - and Overall Accuracy Equality (OAE) are chosen as measures of discrimination (in the interval [0,1], where 0 is fairness while 1 is unfairness) and computed with respect to the sensible attributes taken into consideration (the same used for balance assessment). The study is conducted on several levels: different models (LogisticRegression - LR, Support Vector Machine - SVM, K-nearest neighbors - KNN, Random Forest - RF) and variant (baseline, smote) thereof are considered. Observations were made by analyzing separately balance indexes and unfairness ones. Further investigations were made on the relationships between the two indices to evaluate the goodness of the former as indicators of discrimination. As regards balance measures, Gini and Shannon penalize disproportion less than other indexes. Furthermore they benefit of lower unfairness risk levels thresholds in terms TPR, PPV, OAE. About unfairness, there are differences between baseline and smote variant of the algorithms: the first favours Independence and Separation (low unfairness), the second reaches lower discrimination on Sufficiency. Looking at Unfairness distribution among two balance risk levels (with a threshold at 33%, under which imbalance is classified as ‘high risk’), IIR is the index which better anticipate discrimination, it fails only on OAE. The second best performing index is Shannon which fails in FPR and NPV discrimination capabilities among the two levels of risks. Major part of these observations are robust to an extended assessment (through additional datasets) especially in correspondence of Random Forest model. As concern correlation between balance measures and unfainress ones, Independence, TPR and PPV are the easiest to correlate with. About Independence, SVM is the model getting higher values over the four balance indexes. RF performs very well on Independence and Separation, KNN on Sufficiency and OAE. A correlation comparison by attribute cardinality was carried out. It showed that IIR takes undesired positive correlation on attributes with 8 classes. [1] Vetrò, A., Torchiano, M., & Mecati, M. (2021). https://doi.org/10.1016/j.giq.2021.101619
Relatori:	Antonio Vetro', Juan Carlos De Martin
Anno accademico:	2021/22
Tipo di pubblicazione:	Elettronica
Numero di pagine:	122
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/20558

Modifica (riservato agli operatori)