Politecnico di Torino (logo)

Automated forms cleaning by considering forms with coloured content and coloured backgrounds

Mahsa Farjoo

Automated forms cleaning by considering forms with coloured content and coloured backgrounds.

Rel. Alessandro Savino. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (13MB) | Preview

The Main purpose of this project, extraction/erasing of information filled forms. As you know, forms include dynamic contents and static contents. When a form is completed by customers/ clients, static contents are the same, but the dynamic of the contents are different for each client/customer. Since this real dataset is scarce, I used the simulation method to generate fake dataset. These Empty forms are in PDF format. So, I need first convert these empty forms from PDF to PNG, then generate the dynamic content of forms and insert them on different locations of the form. For generating dynamic content of form, First I have to understand which dynamic content or which dynamic data require, and then with using fake python library generate random dynamic data. For instance, for filling the empty form, the python library needs to generate some German full-name, address, Bank account information, date of birth, job title, Email and some random text. The dataset is automatically generated. Therefore, we need also that information extraction should be done automatically because we cannot use humans to extract relevant information. This process is so slow. We can use such simulated datasets to train a Deep Learning model for extraction static forms given user filled dynamic forms. The work is then to learn for distinguishing between static and dynamic content and erase only the dynamic content, by considering these forms have coloured backgrounds and after removing the dynamic contents, the background must be recreated recovered. So that static templates can be extracted automatically from fake complete documents. Dynamic contents have two features: they have a coloured background and their location varies in form. So, deep learning model to be able to distinguish static content from dynamic content. We present a model that to be able to learns to distinguish between dynamic and static text in an image and erases dynamic content present in a filled form and recreate the background colour. Dynamic content erasure achieved an average SSIM score of 0.00003809 between original empty form image and generated form image.

Relators: Alessandro Savino
Academic year: 2022/23
Publication type: Electronic
Number of Pages: 73
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: Lang.Tec
URI: http://webthesis.biblio.polito.it/id/eprint/26757
Modify record (reserved for operators) Modify record (reserved for operators)