polito.it
Politecnico di Torino (logo)

ADVANCEMENTS IN TOPIC MODELING TECHNIQUES: A COMPREHENSIVE STUDY ON ALGORITHM COMPARISON, AND NOVEL METRICS FOR OUTCOME EVALUATION, USING SOCIAL MEDIA DATA

Riccardo Prestigiacomo

ADVANCEMENTS IN TOPIC MODELING TECHNIQUES: A COMPREHENSIVE STUDY ON ALGORITHM COMPARISON, AND NOVEL METRICS FOR OUTCOME EVALUATION, USING SOCIAL MEDIA DATA.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

Abstract:

This master's thesis, conducted in the company Claravista, a consulting firm based in Paris specializing in digital marketing and data science, aims to comprehensively compare various topic modeling algorithms. The primary focus of this research is on natural language processing, specifically delving into topic modeling techniques. In an era where a staggering volume of information is generated daily by individuals across social media platforms, the potential inherent in analyzing this wealth of data resonates profoundly across numerous domains. Especially in the fast-paced world of digital marketing, where Claravista works, this kind of analysis becomes even more important. Throughout this study, three distinct machine learning algorithms—namely, Latent Dirichlet Allocation (LDA), Top2Vec, and BERTopic—are evaluated in the context of topic modeling. The primary data source for this analysis is social media content. The methodology to get the data involves a two-fold approach. Initially, a generative AI is employed to construct a labeled dataset, enabling the exploration of a supervised learning scenario. Subsequently, real-world social media data is utilized, transitioning the analysis into an unsupervised learning scenario. The overarching objective of this study encompasses two facets: first, conducting a comparative analysis of the three algorithms to elucidate their individual strengths and weaknesses; and second, performing a quantitative assessment, emphasizing the use of quantitative metrics to evaluate the efficacy of the models. In essence, this thesis contributes to the realm of natural language processing by providing an in-depth investigation into the performance of LDA, Top2Vec, and BERTopic algorithms for topic modeling. The research not only highlights the nuances of these techniques but also offers robust quantitative metrics for evaluating their performance in the context of social media data analysis.

Relatori: Paolo Garza
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 76
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela: TELECOM ParisTech (FRANCIA)
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/28613
Modifica (riservato agli operatori) Modifica (riservato agli operatori)