polito.it
Politecnico di Torino (logo)

classification of imbalanced data applied to insurance market

Miriam Dessi'

classification of imbalanced data applied to insurance market.

Rel. Luca Cagliero. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Matematica, 2019

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview
Abstract:

The class imbalanced problem can be considered one of the top problem in data mining today, as it is present in many real-world domains such as computer science, epidemiology, finance and so on. This has brought along a growth attention from both academia and industry. In this master thesis a critical study of the nature of the problem, the state-of-art solutions, an explanation of specif measure of performance and a real application of this problem has been carried out. In particular in the first part of the work a discussion about the problem of data imbalanced itself have been presented. We will analyze how the skewed distribution affects standard classification learning algorithms that are generally biased towards majority class. The reason is generally rooted inside the classifier's learning process structure, that it is often built with the prospective to optimize global metrics such as accuracy. This might lead to distorted conclusion about the performances i.e. a classifier that achieve an accuracy of 99\% but that have a imbalanced ratio (fraction between majority class instances and minority ones) of 1, it is only classified all elements as belonging to the majority class, its performance is not so accurate. However the imbalanced distribution of the data is not the only factor that hider the learning task. Several data intrinsic characteristics have a robust impact on classification performance. An explanation of the problem of small disjuncts, the overlapping between classes, the presence of noise and borderline examples will be presented showing how they affected the learning process. In the second part of the thesis state of art solutions to these issues are presented. They can be divided into four groups: data level, algorithm level, cost-sensitive and ensembles methods. Data level approaches [\cite{chawla2002smote},\cite{han2005borderline},\cite{bunkhumpornpat2009safe},\cite{he2008adasyn},\cite{yun2016automatic},\cite{wilson1972asymptotic},\cite{tomek1976two},\cite{kubat1997addressing},\cite{laurikkala2001improving}] use sampling methods to balance the class distribution. Resampling techniques can be categorized into three groups: undersampling, oversampling and hybrids. Algorithm level or internal approaches aim to improve the learning process, acting on the classifiers itself or on the training data. Cost sensitive approaches include data level, algorithm level or both mixed. The objective of this kind of solutions is to assign different misclassification cost to each class. As a combination of all these approaches there are the ensembles, whose approach consist in train several classifier and then aggregate their prediction in other to handle the overfitting problems. The two most famous ensemble techniques Bagging and Boosting. Finally for application it will be provided a case study developed during a intership in Reale Mutua Assicurazioni. In this final part several experiments will be conducted to cope with the imbalanced problems. Firstly the performances of standard classifier such SVM, logistic regression, decision tree and random forest will be analyzed underling the criticality of the different classifiers then their performances will be improve employing data level techniques such as SMOTE, ADASYN, RUS, ROS, Tomek link, Kmeans SMOTE [\cite{last2017oversampling}]. The experimental results will show that decision tree classifier outperforms the others classifier in terms of F-measure when ROS is used as re

Relatori: Luca Cagliero
Anno accademico: 2019/20
Tipo di pubblicazione: Elettronica
Numero di pagine: 63
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Matematica
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-44 - MODELLISTICA MATEMATICO-FISICA PER L'INGEGNERIA
Aziende collaboratrici: REALE MUTUA ASSICURAZIONI
URI: http://webthesis.biblio.polito.it/id/eprint/12731
Modifica (riservato agli operatori) Modifica (riservato agli operatori)