polito.it
Politecnico di Torino (logo)

Pipeline for the automatic population of an automotive database: from retrieval to parsing of textual descriptions

Gennaro Petito

Pipeline for the automatic population of an automotive database: from retrieval to parsing of textual descriptions.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2022

Abstract:

Natural language processing has proven to be very effective in the automatization of many processes, especially since the introduction of the Transformer and of large language models, like BERT, that are able to produce contextual embeddings and can be finetuned to perform new tasks on specific datasets. JATO, like other companies, relies on human intervention to retrieve and extract data from text to populate certain databases. This is especially done when the extraction requires natural language understanding and not a simple rule-based system. This task can still be quite repetitive and time-consuming, so in this thesis we implement a pipeline to automate the population of a car's optional equipment database. We make use of two different sources of information provided by car manufacturers: configurators and brochures. While configurators present data through JSON files, that have as fields the name of the option, its category and its description, in brochures the same information is all embedded in paragraphs of text. For this reason, with brochures, we first have to retrieve the passages belonging to a certain category with BM25 and then extract the name of the option described in each passage exploiting a BERT model. At this point there is no more need for differentiation between brochures and configurators, so we use a RoBERTa model to extract the features of each piece of optional equipment from the corresponding descriptions. Before populating JATO's database we also add an output normalization step to follow its rules. The results for the retrieval step are very promising and show the power of BM25 despite its age with perfect retrieval in the test set. However the scarcity of labeled ground-truth entails a not thorough enough analysis of the results. Regarding instead the feature extraction step, where we limited to audio systems in this thesis due to the need for manual labeling, we managed to achieve an outstanding 99.89% token accuracy on the test set for configurators, with no missed extractions, which means that the few errors can actually be fixed in the output normalization stage, and 97.39% on the brochures test set. We also tested BERT and DistilBERT which however achieved slightly worse results. The option name extraction task is particularly difficult even for a human reader and we had expected not great results. Nonetheless we managed to achieve a promising 88.03% token accuracy with the BERT model. Overall the pipeline seems to be very effective and robust, and we trust it will be very useful to JATO once it will be extended to all the option categories. In this thesis we also explore zero-shot question answering to test the capabilities of the T-zero model to perform text comprehension on new domains and answer to closed-book questions about the automotive sector, which showed that the model had already encountered text about this domain during training. Finally, since the model is also trained for machine translation we also tested its text comprehension skills in different languages, which showed how the model had implicitly learned this task from the English one. This was quite surprising and opens to many further possibilities and expansions.

Relatori: Paolo Garza
Anno accademico: 2021/22
Tipo di pubblicazione: Elettronica
Numero di pagine: 79
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: Jato Dynamics Italia
URI: http://webthesis.biblio.polito.it/id/eprint/23647
Modifica (riservato agli operatori) Modifica (riservato agli operatori)