polito.it
Politecnico di Torino (logo)

A Big Data Solution for Silhouette Computation

Sara Prone

A Big Data Solution for Silhouette Computation.

Rel. Paolo Garza, Eliana Pastor. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2019

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Document access: Anyone
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview
Abstract:

For data analysis, the partitioning into groups based on data characteristics is crucial. This process is called clustering and the result is a set of groups containing original data, where data in the same group are more similar to each other than to data in other groups. The clustering process only partitions objects into clusters, so at the end of the process the number of object is the same as the original, with the additional information about their division in groups. Since in real world the data sets are likely to contain a huge amount of data, in this work a way to reduce this amount maintaining most important features of data is presented. The idea is simply to summarize the already clustered data dividing them into cells with a certain size and computing a representative object for each cell. The representative object will represent all the original data contained in the cell and will have a weight equal to the number of represented data. In this way, clusters of weighted objects are generated and a resultant so-called weighted clustering is obtained. The weighted clustering is a representation of the original clustering with a reduced cardinality. The reduction of cardinality is crucial because operations on lower amounts of data are faster and easier. To evaluate quality of the representation of the original clustered data with the clusters of weighted objects, the silhouette index has been used. A modification of this index that considers weights of objects has been created in this thesis. This version, called weighted silhouette, is important because the the silhouette index time complexity is quadratic in the number of considered data and for this reason the index can not be computed for large data sets. Using the weighted version proposed in this work, the silhouette index can be computed for high amounts of clustered data, after the application of the weighted clustering process that generate a representation of these data with a reduced cardinality.

Relators: Paolo Garza, Eliana Pastor
Academic year: 2018/19
Publication type: Electronic
Number of Pages: 121
Subjects:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/11065
Modify record (reserved for operators) Modify record (reserved for operators)