polito.it
Politecnico di Torino (logo)

Detecting and understanding food risk factors from social and news data

Aurora Gensale

Detecting and understanding food risk factors from social and news data.

Rel. Luca Cagliero, Irene Benedetto. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

Abstract:

Foodborne illnesses present an ongoing risk to public health, affecting millions of individuals each year. As such, there is a need to control food safety. Recent developments in machine and deep learning technologies have facilitated the prediction of food risks. Existing approaches often rely on data sources beyond text, including images and structured data. Among those that use textual sources, these are often elaborated and already edited texts and not near real-time, as social data can be. However, these works still make use of outdated methodologies like classical machine learning models (such as Support Vector Machine (SVM) or Bayesian networks). Although Language Models have a significant ability to work with textual data, they are little explored. Only one study employs near real-time sources but fails to provide in-depth analyses of the models and does not investigate their explainability. This thesis aims to overcome limitations by implementing Deep Natural Language Processing techniques to address two tasks using textual data collected from different sources. Initial attention will focus on Twitter (now X) data, given its near real-time properties. Later, government reports from authoritative sources (such as US Department of Agriculture, Food Safety Authority of Ireland, Government of Canada and Center for Food Safety of Hong Kong) and news articles from several public newspapers websites are considered. The latter is distinguished from Twitter data by its greater length and authority. The first task outlined focuses on binary classification to detect sentences related to food risks. While the second one concerns the extraction of significant information from such sentences. To address these tasks, this work has benchmarked several state-of-the-art models to assess the impact of pre-training and model characteristics such as its size and architecture. Firstly, fine-tuning is performed on Twitter data. Then, to test the models’ ability to handle different types of texts, news and reports data is also considered. Since approaches that rely on Language Models may suffer from a lack of interpretability, one of the research’s main focuses is to conduct an error analysis and explainability research, using the SHAP (SHapley Additive exPlanations) technique. It is worth noting that this study is the first to employ heterogeneous data sources in the field of food safety risk prediction to evaluate the ability of the models when the source changes. Our research indicates that employing language models is an effective way to ensure food safety by detecting food-related risks. Specifically, we highlight the significance of choosing the appropriate approach for pretraining and fine-tuning to optimise model performance. Moreover, through the incorporation of heterogeneous textual data, the generalisation capabilities of the models were demonstrated, giving good results on all sources. The analysis of explainability also showed that the considered models are able to focus on the relevant words in whole texts.

Relatori: Luca Cagliero, Irene Benedetto
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 108
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: MAIZE S.R.L.
URI: http://webthesis.biblio.polito.it/id/eprint/29319
Modifica (riservato agli operatori) Modifica (riservato agli operatori)