Gianvito Liturri
Temporal Co-location Pattern Discovery in Spatiotemporal Data through Parallel Computing.
Rel. Paolo Garza, Luca Colomba. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (16MB) | Preview |
Abstract: |
This master thesis investigates co-location patterns: categories of entities that frequently appear close to each other. In the co-location pattern mining problem, the categories are called features and each entity that belongs to a specific category is named instance. In the urban field, an example can be "John's Restaurant" which is an instance of the feature "Restaurant". Co-location patterns are important since they reveal underlying correlations between entities. Several algorithms were proposed in literature to tackle the spatial co-location mining task. A step forward in state-of-the-art methods consists in implementing parallel approaches. The idea behind it is to divide the entire dataset into independent partitions that will be processed in parallel. Most of the proposed solutions consist in sequential algorithms. Instead, the few algorithms that use a parallel approach have a common issue: the partitions that should be processed in parallel are not independent and thus require information sharing, creating a bottleneck in the computation. This thesis explores one of the most recent parallel algorithms in the context of spatial colocation mining, introduced in the paper entitled "Parallel Co-location Pattern Mining based on Neighbor-Dependency Partition and Column Calculation". Given that the authors did not disclose their original parallel implementation, the first step in this thesis consisted in the reimplementation of the aforementioned algorithm using PySpark and Python. To verify the correctness of the algorithm, experimental evaluations were performed on a real dataset related to plants of Gaoligong Mountain and were run on the SmartData@PoliTO cluster. The framework is based on two main concepts: the "neighbor-dependency partition" and "Column Calculation". The first one divides the dataset into completely independent partitions. The second one, instead, can finds patterns efficiently. The results achieved in the experimental evaluations are comparable to the results reported in the paper. Moreover, this thesis proposes a novel extensions to such algorithm, adding the time dimension. The first version of the algorithm proposed wants to find spatial-temporal correlation between people and trajectories. The idea behind this is that if two people are friends or have any other relationship, they will have similar positions within the same timeframes. In the implementation we consider for each person a set of trajectories; when we want to find correlation between two people, we verify if they were close to each other in similar timestamps. We verify closeness in terms of both spatial and temporal distance and not exact position sharing. The first version is tested on a large dataset, named Geolife Trajectories, to recognize relationships among people in Beijing. Furthermore, in a second version of the algorithm we extended the analysis by including points of interests (POIs) in the mining process. Doing so, this second version of the algorithm is able to extract information about people and the places that they frequently attend together. The proposed methodology demonstrates that the implemented algorithm is effective and efficient for discovering spatial-temporal behavioural patterns using parallel computing techniques. |
---|---|
Relators: | Paolo Garza, Luca Colomba |
Academic year: | 2022/23 |
Publication type: | Electronic |
Number of Pages: | 81 |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Aziende collaboratrici: | UNSPECIFIED |
URI: | http://webthesis.biblio.polito.it/id/eprint/26879 |
Modify record (reserved for operators) |