polito.it
Politecnico di Torino (logo)

BIG DATA AND CLUSTERING QUALITY INDEX COMPUTATION

Aynadis Temesgen Gebru

BIG DATA AND CLUSTERING QUALITY INDEX COMPUTATION.

Rel. Paolo Garza, Tania Cerquitelli. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2019

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview
Abstract:

Clustering analysis is unsupervised machine learning technique that partitions a dataset multiple groups or clusters so that instances in the cluster have high similarity but not with instances of other clusters. There exists number of methods to accomplish the process of clustering analysis. The quality of the results generated by a clustering method is measured by cluster evaluation. Some clustering methods demand the number of clusters into which data is going to be partitioned. Cluster evaluation determines the number of clusters to be used as an input to the clustering methods. A comparison between the result of the different clustering methods can also be performed using cluster evaluation. The traditional cluster evaluation algorithms are not applicable for bigdata due to a size limitation and run time cost. This work introduces a method for evaluating big size dataset. This paper presents a technique that assists the evaluation process of large amount of data. It proposes a sampling approach using bigdata analysis to reduce the size of the dataset so that the traditional clustering validity indices are able to process it. Silhouette validity index is selected and adopted test the sampling result. The sampling technique positions instances in the space which is split into grids of same size. It iterates through each grid to verify if all the instances with in the grid belongs to the same grid to perform the reduction. The reduction is carried out only on those grid containing instances from the same cluster providing associated weight to the instances to the grid. The evaluation of the implementation is made in both manually clustered and automatically clustered datasets. In the manually clustered data set, three data sets containing 18000, 6500 and 3000 instances with 5, 8 and 31 number of cluster respectively. Single dataset is experimented in the second test case(the auto- clusterd dataset) which in has 8000 instances. To observe the performance of this work, a comparison between a dataset and the average of three different percentage of random sample are done. The silhouette index on the clustering results original dataset and spark sampled dataset are very close, with slightly higher index of the original dataset. Generally, silhouette index of smart sampled and original data set on all the datasets of the are very approximate which indicates the smart sampling can be considered as a solution for the cluster evaluation of huge size dataset. The performance of the average random sample shows an equal or slightly higher index than both the smart the sampled and original dataset. However, Its important to consider that finding the average silhouette index on random sample requires the number executions for a single dataset as results vary on each run. On the other hand, the smart sampling is executed one time as long as the generated which is a plus for the smart sampling.

Relatori: Paolo Garza, Tania Cerquitelli
Anno accademico: 2019/20
Tipo di pubblicazione: Elettronica
Numero di pagine: 57
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/13164
Modifica (riservato agli operatori) Modifica (riservato agli operatori)