Politecnico di Torino (logo)

A Data Compression Approach for Big Data Classification

Eskadmas Ayenew Tefera

A Data Compression Approach for Big Data Classification.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2020

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview

Abstract - A data compression approach for big data classification is used, first, to compress large size dataset into small size dataset, then to created and build the classification model for the compressed dataset and evaluate the model’s accuracy. Data compression algorithms are used to compress and reduce the size of dataset. The compression mechanisms help to increase and optimize operational efficiencies, enable cost reductions, and reduce risks for the business operations. It is becoming costly to process large size datasets, which need to be reduced by using compression techniques. The compression algorithm controls each bit of a dataset and optimizes the size without losing any data subsequently by using a lossless data compression approach. In a lossless data compression technique, data can be compressed without loss. Therefore, the restored dataset is equal to the original form of the dataset. In this thesis, a classification model for the initial datasets has been created and built by using a J48 classification algorithm. The built-in model of the original datasets has been evaluated to get the evaluation statistics of the model, the accuracy of the model, and the confusion matrix results of the model. Then, it is possible to extract correctly and incorrectly classified instances of a given dataset. The built-in decision tree has root, internal, and leave nodes, which draws paths of the tree. A path is a way from the root-through-internal-to-leave node and the number of paths are equal to the number of leaves. In each path, there are one or more instances. We can extract the paths of correctly and incorrectly classified instances from the classifier model. In addition, it is also possible to extract instances from the paths. In order to get paths of the tree and an instance from a path, it is required to generate and get the source strings of the tree. We can extract instances from the tree or from each path and create a compressed small dataset. Then we can concatenate one compressed dataset with the other compressed dataset and create another compressed dataset. Like the original dataset, we can created and built classification models for the compressed datasets. The models have been evaluated and the evaluation results of the compressed dataset model have been compared with the evaluation results of the original dataset model. As a result, we can see the accuracy of the classification models of the original and compressed datasets. The initial datasets have been selected from UCI Machine Learning Repository.

Relators: Paolo Garza
Academic year: 2020/21
Publication type: Electronic
Number of Pages: 87
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/16660
Modify record (reserved for operators) Modify record (reserved for operators)