Francesco Conforte
Automatic Classification of textual reviews to enhance mobile application services.
Rel. Luca Cagliero. Politecnico di Torino, Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro), 2022
PDF (Tesi_di_laurea)
- Tesi
Accesso riservato a: Solo utenti staff fino al 19 Dicembre 2025 (data di embargo). Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (7MB) |
Abstract: |
Mobile application developers can extract useful information from reviews that people post on mobile distribution platforms to improve the services they provide. The classification of such reviews can be automatized through Machine Learning techniques implementing a task of Natural Language Processing: Text Classification. Several challenges are faced in this thesis work, developed in cooperation with Ariston S.p.A, an Italian corporation that produces heating systems and related products. They concern data distribution and taxonomy of classes, both put at the disposal and established by the company itself. In particular, the adopted taxonomy is structured in a two-levels manner: five macro-categories branching out into several other sub-categories and reaching thus a total number of 24 classes. It is a considerably high number if combined with the small availability of review texts as well as with their marked imbalanced distribution with respect to classes. Moreover, some subcategories partially overlap each other, because they concern the same arguments but are observed from a different point of view. A further constraint is a total absence, among AI communities, of pre-trained word embedding models for this specific context. Considering all the aforementioned constraints, the purpose of this thesis work is to develop a classifier capable of automatically classifying a review. The workflow follows three experiments independent of each other: the classic multiclass approach, where all the implemented machine learning models are trained and evaluated on the original dataset. The multiclass approach is then applied again after a process of data balancing, by testing different techniques for balancing the dataset distribution. The One VS Rest approach is finally implemented and evaluated. Due to the very scarce availability of data for some sub-categories (only two in some cases), the automatic classification is made by grouping reviews by macro-categories and by training the models to distinguish them. To evaluate the goodness of obtained results, several metrics exist, such as the balanced accuracy score, confusion matrix, etc. Among them, greater importance is given to those that focus on evaluating the automatic system's ability to recognize specific classes: precision, recall and F1-Score. Some models achieve satisfying performance in distinguishing the majority classes, by properly managing the unbalance. Whereas, they achieve very poor results in terms of F1-Score on minority classes. The reasons can be condensed into two main key points: firstly, the amount of data at the disposal is meager. It, indeed, does not allow training accurately all models; secondly, the taxonomy of labels needs a review process, because, once micro-categories are grouped, reviews concerning the same arguments overlap each other and it becomes difficult for a classifier to distinguish the details and classify the text correctly. |
---|---|
Relatori: | Luca Cagliero |
Anno accademico: | 2022/23 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 93 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro) |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-27 - INGEGNERIA DELLE TELECOMUNICAZIONI |
Aziende collaboratrici: | ARISTON THERMO S.P.A. |
URI: | http://webthesis.biblio.polito.it/id/eprint/25616 |
Modifica (riservato agli operatori) |