Politecnico di Torino (logo)

Exploiting background knowledge for scene graph generation with Logic Tensor Networks

Silvia Giammarinaro

Exploiting background knowledge for scene graph generation with Logic Tensor Networks.

Rel. Fabrizio Lamberti, Lia Morra. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2021

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (8MB) | Preview

In novel Deep Learning applications, complex models and algorithms are designed to understand the world around us. Every scene we see in real life can be represented as a set of objects and a set of predicates (actions, prepositions, etc.). Starting from these sets, a graph can be defined with objects as nodes and predicates as links. Every relationship between two objects is a triplet (subject, predicate, object). This task is called scene graph generation and it is divided into two phases: first, locate objects and predict their classes (object detection), then create the set of possible triplets (relationship detection). In the last years, this topic has gained considered attention by the research community as it is part of more challenging machine learning problems. In this thesis, the entire scene graph generation pipeline is exploited, focusing first on object detection state-of-the-art and then scene graph generation models. As object detection is already a consolidated task, scene graph generation still has some open problems. One of them are biased annotations: as humans, we have a lot of linguistic biases, some generic words are used more often than specific ones. This bias can be seen in the Visual Relationship Detection (VRD) dataset, where the most frequent predicate is the generic place preposition 'on'. In this case, some predicates carry more information than others. As an example, instead of saying (man, on, chair), it is more accurate to say (man, sitting on, chair). This bias will negatively affect the model, as it will be tempted to use generic predicates in most triplets. To solve this issue, the injection of knowledge into the model can be beneficial. The Logic Tensor Network (LTN) model is considered, in which the training consists of building a knowledge base. The knowledge base is created upon objects and predicates. For objects, bounding boxes coordinates, classes probabilities and geometric features are used. For predicates, a set of positive and negative logical axioms are created starting from the training set distribution. Moreover, LTN introduces the concept of fuzzy logic, the truthiness of a logical expression is measured with a value between zero and one. LTN has demonstrated how the use of a knowledge base can improve the performance in the scene graph generation problem. The purpose of this work is to study the LTN pipeline strengths and weaknesses, starting from the object detection task to the relationship detection. First, the impact of the object detection step on relationship detection is discussed. Wrong object classes and locations can significantly impact the final scene graph. Then, different knowledge bases are used to determine the most promising aggregation of logical axioms. The results show that both phases of the pipeline need to be optimized at their best to obtain good scene graphs. The two main datasets present in the literature are analyzed: Visual Relationship Detection (VRD) and Visual Genome (VG). Both datasets have been used by different models in the literature, so this allows us to study which improvements are more promising for LTN.

Relators: Fabrizio Lamberti, Lia Morra
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 77
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/21149
Modify record (reserved for operators) Modify record (reserved for operators)