polito.it
Politecnico di Torino (logo)

Propagation pattern mining algorithm: a parallel approach

Gaetano Riccardo Ricotta

Propagation pattern mining algorithm: a parallel approach.

Rel. Paolo Garza, Luca Colomba. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (5MB) | Preview
Abstract:

Given the widespread adoption of connected devices and increasingly precise sensors, the volume of spatiotemporal data has grown significantly in recent years. This growth necessitates the development of tools capable of analyzing large amounts of spatial and temporal data to extract relevant information and train machine learning models. However, this data can be heterogeneous and come from various sources, such as weather measurements, traffic information, weather conditions, and road accidents. In particular, this thesis aims to improve the efficiency of an existing framework for spatiotemporal data analysis by parallelizing most of the process stages using the Spark framework. The goal is to make the processing of large amounts of heterogeneous data, including US road accident data and weather condition data, more efficient in order to extract correlations between spatial and temporal events. The framework involves several stages, including event deduplication, parent-child event correlation, final tree construction, and frequent pattern extraction. All of these stages are parallelized using Spark, with the exception of frequent pattern extraction using the SLEUTH algorithm, which remains sequential. Spark was chosen for its ability to process large amounts of data quickly and efficiently using advanced parallelization techniques. Additionally, Spark easily manages heterogeneous data and integrates various data sources. The developed parallel framework was tested on a dataset of 37 million events, demonstrating its ability to process this data in about 24 hours. This represents a significant improvement over the centralized code version, which required several days to process a smaller dataset. The framework evaluation work was conducted using three different US cities: Boston, Los Angeles, and New York City. The spatiotemporal relationships between events in these cities were evaluated using three different temporal thresholds. Essentially, the temporal threshold represents the maximum distance in minutes between the start times of two events considered during the Parent-Child relationship search phase. The three temporal thresholds used in the framework evaluation were 10, 15, and 20 minutes. The results of this thesis demonstrate that the use of the Spark framework significantly improved the efficiency of the entire spatiotemporal data analysis process, reducing processing times and improving scalability. Parallelizing the three stages of the process enabled the handling of large amounts of data more quickly and efficiently without compromising the accuracy of the analysis. Frequent pattern extraction using SLEUTH provided useful information for extracting correlations between spatial and temporal events, with potential applications in road accident prevention and road safety improvement.

Relatori: Paolo Garza, Luca Colomba
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 64
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/26875
Modifica (riservato agli operatori) Modifica (riservato agli operatori)