Francesca Avidano
Machine Learning techniques. Clustering and Classification: a project with Banca di Asti.
Rel. Daniele Apiletti. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2022
Abstract: |
This master's degree thesis is focused on the exploration and application of the most known machine learning algorithms on a dataset provided by Banca di Asti, a local bank situated in Northern Italy. Specifically, data belong to observations conducted in 2021 on a subset of bank’s customers. In fact, in order to shrink just to clients who can take advantage from services offered by Filiale OnLine, some restrictions became necessary: the selection has been done on private customers, owners of at least one bank account and of the internet banking product “Banca Semplice Home”. After the analysis of the dataset dimensions and features, two different kinds of machine learning techniques have been performed. For an unsupervised learning, customers’ clustering has been considered, while supervised learning has been carried out through a classification task. In the first part of the thesis, a clustering pipeline has been followed with the goal of identifying the right number of clusters more representative of the customers’ subset. At the beginning of this chapter, clustering kinds and cluster’s types have been illustrated. Then the principal clustering models, K-Means, Hierarchical clustering and DBSCAN, have been studied along with their highlights and weaknesses, always providing the idea of each algorithm. In addiction to this, some internal indices for clustering validation have been considered. In order to apply the models to Banca di Asti’s dataset, some preprocessing methods have been performed, such as outliers removing and standardization. In the end, results have been shown and compared. The goal of the second part of the project is, instead, a classification task: predict whether a customer will purchase a product or sign a document through the digital services offered by Filiale OnLine. While the same original dataset has been maintained, a different approach to data exploration and data preparation has been chosen with respect to the previous part. Feature distribution, correlation and balancing have been investigated in order to apply the proper algorithms for data cleaning. Moreover, as for clustering chapter, the most famous and widely used classifiers have been described and performed: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors and Support Vector Machine. Results have been then evaluated through metrics computed on the confusion matrices. In the last chapter of the project, some conclusive considerations have been drawn. Even though this project is strictly related to a very specific kind of customers with data from 2021, it is possible to run it on more recent observations and it can produce a good starting point to extend the quantity of people who will prefer the online channel to the physical branch for document signing. Moreover, it will be adaptable with the proper changes, in due time, in order to include companies and firms in the classification task. |
---|---|
Relators: | Daniele Apiletti |
Academic year: | 2022/23 |
Publication type: | Electronic |
Number of Pages: | 88 |
Additional Information: | Tesi secretata. Fulltext non presente |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Aziende collaboratrici: | Cassa di Risparmio di Asti |
URI: | http://webthesis.biblio.polito.it/id/eprint/25528 |
Modify record (reserved for operators) |