polito.it
Politecnico di Torino (logo)

NoSQL Data Lake: Search Engine, Analytics and Machine Learning

Paolo Iannino

NoSQL Data Lake: Search Engine, Analytics and Machine Learning.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2018

Abstract:

The project aims at building a data lake providing a comprehensive set of analytical tools for an R&D team of Amadeus, the leader IT provider for the travel industry. The system targets different data sources related to the reissue of a flight ticket, which are processed to achieve three main objectives: a search engine, a statical framework and a business intelligence tool. The previous goals are mapped to three different tasks: the development of an efficient preprocessing phase, the proper organization of the storage and the graphical user interface, and the elaboration of a machine learning solution. One of the main contribution is the use of the current state of the art technologies in terms of scalable data processing and data storage. Indeed, all the preprocessing phase is performed through the Spark framework, while the chosen database, which is the functional core of the system, employs a NoSQL approach. In addition, the project provides a data preparation phase which is scalable also in terms of data sources. Moreover, the structure of the input data can be fine tuned in order to respond to variable and specific user needs. Another contribution regards the development of a modern and responsive single page application for the proper fruition of the tools and the obtained results. Furthermore, data visualization techniques are employed to make the user experience smoother and the use of the system more effective. For what regards the machine learning effort, it involves all the possible stages of a data mining activity: from the implementation of an efficient data preparation phase; through the model tuning, the features transformation and the features selection; to the development of a framework for the proper exploitation of the predictions and of the model. It is worth noting that a major contribution is also represented by an original conversion between categorical and numerical attributes. Other challenges affecting the project in all its different parts are: the complexity of the source of information, both in terms of structural convolution and functional meaning; the definition of requirements reflecting the team needs; the flexibility in facing frequent changes in the specifications; the development of a usable tool and not just an abstract proof of concept; the automation of the different processing stages involved; the handling of hardware constraints forcing an efficient implementation; Even though the scope of the project is obviously broad and ambitious, the time and hardware con- straints are strict and the context is complex and evolving, the work reaches all expected results, also considering the new additional targets coming out along the process. Furthermore, the tool is welcomed by a good feedback from both the more technical and the more functional users. Moreover, the icing on the cake, the product is so appreciated, and the future possibilities it brings so attractive, that the company decides to finance it. Even considering the team supported the development with constant help and supervision, it is worth noting the undersigned is the only developer in charge of every aspect of the project. From a personal point of view, the undertaken responsibilities and the solution of the different challenges encountered leads to great benefits in terms of professional formation: all competences developed in the academic career are intensively exploited and enriched by a greater awareness of the data science subject.

Relatori: Paolo Garza
Anno accademico: 2018/19
Tipo di pubblicazione: Elettronica
Numero di pagine: 95
Informazioni aggiuntive: Tesi secretata. Full text non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela: INP - Grenoble Institute of Technology - ENSIMAG (FRANCIA)
Aziende collaboratrici: SAS AMADEUS
URI: http://webthesis.biblio.polito.it/id/eprint/8351
Modifica (riservato agli operatori) Modifica (riservato agli operatori)