Politecnico di Torino (logo)

Color-Conditioned Abstract Image Generation with Diffusion Models

Christian Bardella

Color-Conditioned Abstract Image Generation with Diffusion Models.

Rel. Tatiana Tommasi, Angelica Urbanelli, Giuseppe Rizzo. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (22MB) | Preview

This thesis aims to investigate the promising Diffusion Models (DMs) technology by developing a prototype that meets the requirements for variability and quality of generated images. The ultimate goal is to initiate the technology transition by exploring DMs in conjunction with the well-established StyleGAN. Through this work, I specifically examine the behavior of DMs in the "Color to Image" task and their ability to generate images based on color label conditioning with the final goal of producing 512x512 resolution images. I adopt a step-by-step approach to gain a thorough understanding of this new technology, both practically and theoretically. Understanding how I can effectively condition a diffusion model to enable precise control over the generative process was an essential step. I implement a basic Diffusion-network, which uses a shallow vanilla U-net to grasp the functioning of the various components of the model and I successfully train this network on the "Letters font dataset", focusing on conditional and unconditional generation at a resolution of 32x32. The problem with this method is that the entire network works on a pixel level, meaning that the diffusion process is applied to the whole input image. For high-resolution image production, this approach immediately becomes impractical. The Latent Diffusion Model by CompVis was a suitable solution, which applies the diffusion process to a reduced input representation. This model has demonstrated remarkable results in conditional image generation and has now been made open source. It comprises 400 million parameters two-stage architecture: an encoder-decoder network and a Diffusion network. Using a pre-trained encoder, I trained the diffusion network on a customized version of the Wiki-Art dataset. Still, the time and resources were insufficient for a complete state-of-the-art comparable training. This first working prototype is capable of producing well-conditioned images at a resolution of 256x256, showing that DM beat the previous StyleGAN in matching the required color and how, with more extended training, more time, and computational resources, I could achieve comparable performance in terms of FID. Given these resource constraints, I have also adapted the Latent Diffusion code to run on a multi-GPU environment with limited resources exploiting a fully-sharded-data-parallel strategy. Overall, this thesis offers a comprehensive exploration of diffusion technology, encompassing its mathematical foundations and relevant literature background. It effectively highlights the strengths and limitations of this approach in the label-to-image task.

Relators: Tatiana Tommasi, Angelica Urbanelli, Giuseppe Rizzo
Academic year: 2022/23
Publication type: Electronic
Number of Pages: 77
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: FONDAZIONE LINKS
URI: http://webthesis.biblio.polito.it/id/eprint/30088
Modify record (reserved for operators) Modify record (reserved for operators)