Politecnico di Torino (logo)

Building a large scale database for near-duplicate image detection from insurance claims

Huseyin Cagri Karakus

Building a large scale database for near-duplicate image detection from insurance claims.

Rel. Fabrizio Lamberti, Lia Morra. Politecnico di Torino, Corso di laurea magistrale in Mechatronic Engineering (Ingegneria Meccatronica), 2019


This work presents a strategy to create a large scale database for near duplicate image detection from insurance claims. The basic idea is semantically comparing the images in a database. When the database is expanded with new images, the possibility to detect near duplicates increases as well. Thus, performance of detecting the possible near duplicates is strictly related to how much the database is large. On the other hand, it is not always easy to expand a database with natural images especially when the new samples are received as raw data. The study is based on a large scale private dataset provided by an insurance company. The dataset comprises a mixture of noise images and photos. Some of the real photos are not in correct orientation and besides the noise in the dataset, rotated photos also have a negative impact on near duplicate detection. When the database is expanded with new samples, those images also added to the dataset and there is a direct proportion with the number of images and noise in the database. Increased noise in the database drastically decreases the accuracy of near duplicate detection. To create an effective dataset for near duplicate image detection a two stage strategy is adopted. In the first stage all noise data in the database is removed. After reducing the noise in the raw database, rotated images are detected and corrected. The first module uses basic image processing functions and a task specific neural network based binary classifier to detect and discard the noise in the database. In this stage duplicate images are also discarded. The second module aims to detect and correct the rotated images in the database. Similar to the first stage, another NN based classifier is used for this purpose. Most of the images in the private dataset are in their correct orientation. Thus, having a high accuracy on detection of zero degree rotated images is much more crucial at this stage. For this reason, a considerable portion of the rotated images were left as they are, in order not to rotate any zero degree rotated image. Even in that case, a significant improvement observed on near duplicate detection after the application of complete strategy on the provided dataset. The final results of this study prove the efficiency of NN based classifiers on classification of real life photos. Both networks were trained on an AWS EC2 instance. The complete project was developed in Python (3.6) language. To create and train the neural network based classifiers Keras 2.2.4 framework used with TensorFlow backend. Test results of both classifiers show that they are quite accurate to achieve the main goal of this study.

Relators: Fabrizio Lamberti, Lia Morra
Academic year: 2018/19
Publication type: Electronic
Number of Pages: 95
Additional Information: Tesi secretata. Fulltext non presente
Corso di laurea: Corso di laurea magistrale in Mechatronic Engineering (Ingegneria Meccatronica)
Classe di laurea: New organization > Master science > LM-25 - AUTOMATION ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/10894
Modify record (reserved for operators) Modify record (reserved for operators)