Politecnico di Torino (logo)

Data Quality for streaming applications

Andrei Robert Zannelli

Data Quality for streaming applications.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2021

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (9MB) | Preview

The topic of big data has become highly sought after in recent years and with it all the problems that they entail. The ability to analyze large amounts of data in an innovative way has made it possible to facilitate the development and the enormous production of data from countless sources such as social media, sensors, industrial machines or simply server logs has certainly encouraged the development of the big data field. With the increase in the production speed of all these types of data, we have begun to speak of streaming data, to indicate the production of data in near real time. Of course, with the acceleration of production, the need to hasten their analysis rose too and there were many answers proposed by the top players in the sector, such as Apache Spark and its two components dedicated to streaming, DStreams and Structured Streaming. One of the fundamental problems of all this data is its quality, as it is often used by large companies to decide the best business choices to implement in response to the data in their possession. The latter two themes are the protagonists of this thesis, namely data quality in streaming environments. This thesis rests its foundations on the Data Quality framework, an open source software produced by AgileLab s.r.l., which aims to transport the academic concepts of data quality into a practical and everyday reality. The objective set by this thesis was to integrate the aforementioned framework with the possibility of managing the analysis of streaming sources, without having to change the very nature of the project and solving all the problems that streaming carries with it. The analysis and design of the solution started from highlighting the main problems, the necessary abstractions and the work carried out in academic fields by the community of researchers and developers to enter the world of streaming data, passing through the technologies and methods used by this research. Subsequently, the current structure of the aforementioned framework is passed under the microscope, analyzing the various modules and entities, to highlight its functioning and capabilities. The changes made to allow integration with the most commonly used formats and file types within the streaming data circle are reported immediately after, also highlighting both the conceptual and implementation problems. The research ends with the analysis and comparison of the standard model and the one created in this thesis, using a common database for both as benchmark and highlighting how the various components work together to obtain the same result in both cases. Finally it is exposed what can be implemented to continue the optimization of the project and preparing the most interesting improvements.

Relators: Paolo Garza
Academic year: 2020/21
Publication type: Electronic
Number of Pages: 75
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: Agile Lab S.r.l.
URI: http://webthesis.biblio.polito.it/id/eprint/18194
Modify record (reserved for operators) Modify record (reserved for operators)