Politecnico di Torino (logo)

Gpu accelerated ETL processes: a faster way to deal with big data

Edoardo Lardizzone

Gpu accelerated ETL processes: a faster way to deal with big data.

Rel. Daniele Apiletti. Politecnico di Torino, Corso di laurea magistrale in Data Science and Engineering, 2022


The world is in constant evolution and in the last decade data have become the most valuable assets in economics and research, their use is essential for a company which wants to keep up with the times but, since everyday there are new methods to extract and analyze data their mole is getting bigger and bigger and the old frameworks are beginning to be obsolete in terms of times of execution. The goal of this work is to check the state of the art regarding the use of the GPU in a data pipeline, the focus is on the ETL part of the framework because the exploitation of these machines for the Machine Learning part has already been taken on the next level while the preprocessing phase is still mainly done using CPUs. Since the usage of the GPU to accelerate the Deep Learning phase has been a an argument of discussion for many years and very good technique have been discovered already I do not refer to them because it would be a waste of time and it would be out of my intents. After a first part where the actual data pipeline adopted by most of the data scientists (with the most used instruments for every phase) is explicated I try to explain the functioning of the GPUs and their history, after that I pass to a confrontation between the modern tools and the one I try to exploit. The last part before the conclusions is dedicated to the infrastructures I used and how I configure them to let the jobs run on the GPU and finally I retrieve some results in order to understand if it is convenient for a company or an entity to switch to this new paradigm. The results take in consideration both the time of execution and the cost of the infrastructure, since the usage of GPU instances in cloud computing leads to a higher cost in respect to the ones without a GPU, I do not give an universal solution because there could be entities that prefer a cheaper solutions and others that instead would rather have a less time consuming job even if it leads to higher costs, the goal is to give multiple solutions where one can choose. Being a new argument and since not many papers have been written on the matter, this is an experimental work that in the next few years could probably be improved by the changes that will be made especially in the libraries I used.

Relators: Daniele Apiletti
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 101
Additional Information: Tesi secretata. Fulltext non presente
Corso di laurea: Corso di laurea magistrale in Data Science and Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: DATA Reply S.r.l. con Unico Socio
URI: http://webthesis.biblio.polito.it/id/eprint/22603
Modify record (reserved for operators) Modify record (reserved for operators)