Politecnico di Torino (logo)

A benchmark protocol for evaluating near duplicate detection performance

Emanuel Poppa

A benchmark protocol for evaluating near duplicate detection performance.

Rel. Fabrizio Lamberti, Lia Morra. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2020


Nowadays, with the ease of use and effectiveness of image manipulation software, combined with the large number of photos published on the Internet, falsification and alteration of existing images are increasingly widespread operations. The task of searching for images which have undergone different types of alterations in a large data collection is called near duplicate detection and is now being applied in a growing number of areas such as plagiarism, fraud prevention and forensic image identification. However, the study of algorithms that allow an ever more precise and efficient search for near duplicates is not always possible because of the scarcity of appropriately annotated benchmark collection. In particular for near duplicate detection it is important that the datasets address real world challenges, the pairs of near duplicates have been calculated and there is a huge number of images for which the absence of near duplicates was established. In this context, this thesis aims to create an evaluation protocol that allows to evaluate the performance of image descriptors for the near duplicate detection applied to the Mir-Flick Near-Duplicate (MFND) collection, a one million image dataset in which the groundtruth of near duplicate is already annotated and which meets all the requirements mentioned above. The proposed methodology estimate the performance of different descriptors using the Receiver Operating Curve (ROC), which is shown in literature to work analytically and experimentally for this class of problems, comparing the distances of the annotated near duplicates and the negative pairs, which must be the most difficult possible. For this reason, an analysis was performed to search for further near duplicates to add to the groundtruth and experiments were conducted to statistically understand the minimum size that a collection sub-sample must have to generate negative pairs that are difficult enough to provide comparable performances to those obtained in the whole collection. This last analysis was carried out with the aim of dividing MFND into training and validation set to provide a complete benchmark for anyone wishing to evaluate the performance of descriptors, including those based on deep learning, for the search for near duplicate images. Finally to show a real context use the methodology was also applied to an insurance company private dataset with the purpose to search near duplicate images aimed at carrying out fraud on the reported claims.

Relators: Fabrizio Lamberti, Lia Morra
Academic year: 2020/21
Publication type: Electronic
Number of Pages: 99
Additional Information: Tesi secretata. Fulltext non presente
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/15916
Modify record (reserved for operators) Modify record (reserved for operators)