Politecnico di Torino (logo)

Towards Context based Monocular Depth Estimation

Alessio Cappellato

Towards Context based Monocular Depth Estimation.

Rel. Barbara Caputo, Nicola Gatti, Sabine Süsstrunk. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2021

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (28MB) | Preview

Monocular depth estimation is a classical computer vision task, which consists in densely predicting the spatial distance between the object depicted by each pixel and the camera with which the single RGB image is taken. This type of information is extremely useful for a variety of practical contexts, like 3D reconstruction, visual simultaneous localization and mapping (SLAM), and autonomous driving systems, because it permits reasoning about the geometrical structure of the environment and the relationship between objects in it. Over the last years, a number of fully-convolutional encoder-decoder networks have been used to study the considered problem; their popularity is rooted in their locality and translation invariance properties, which allow a parameter-efficient modelling of highly spatially-correlated information. In this context, in the first part of this work, we improve the design of a convolutional decoder incorporating the Laplacian pyramid decomposition of the input image to guide the progressive prediction of depth residuals; this additional feature provided to the decoder retains important information on the location of object boundaries, but also uninformative noise due to intra-object variations. To overcome this issue, we propose to use contours extracted from instance segmentation masks to filter out the noise and keep only the semantically relevant Laplacian residuals. The resulting method achieves a performance improvement on most metrics and a reduction of visual artifacts. More recently, the success of transformers and the ability of their attention mechanism to model long-range dependencies (contrarily to the limited receptive field of convolutions) have sparked many studies proving their competitiveness with convolution-based methods. In the second part of this thesis, we thus adopt a vision transformer-based paradigm both in the design of the encoder and subsequently of the decoder. In particular, we propose a multitask setting with depth estimation and semantic segmentation to conduct a thorough study on the role of attention and its impact on the cross-task interaction. We initially focus on the development of custom attention inside a columnar transformer encoder and employ double-head convolutional decoders for independent dense prediction, revealing that attention sharing is beneficial for both tasks in comparison to the individual monotask performance. Moreover, we show that the extraction of task-invariant features in a single stream further improves the results on all metrics. Finally, we adopt a pyramidal transformer encoder with shifted windows, to better leverage the power of skip connections, and extend the use of transformers to the decoding stage by proposing various monotask and multitask decoders, thereby obtaining convolution-free networks. While in the monotask setting the performance of the proposed transformer decoders is comparable with that of the convolutional ones, the improvement brought by the interaction with the segmentation task is slightly lower. Overall, we systematically outperform the state of the art and our previous results, as proved by extensive experimentation on the official NYU Depth V2 dataset, and demonstrate that transformers can achieve comparable results and surpass convolutional methods even when trained with few samples.

Relators: Barbara Caputo, Nicola Gatti, Sabine Süsstrunk
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 116
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
URI: http://webthesis.biblio.polito.it/id/eprint/20541
Modify record (reserved for operators) Modify record (reserved for operators)