A Multimodal Encoder of Music and Image for Valence Arousal Prediction

Tianming Qu

A Multimodal Encoder of Music and Image for Valence Arousal Prediction.

Rel. Giuseppe Rizzo, Luca Barco, Angelica Urbanelli. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (3MB) | Preview

Abstract:	Emotion analysis, a fundamental component of human-computer interaction, influences various domains, including content recommendation, image generation, and psychological research. Images and music, as crystallizations of human culture, inherently carry the emotions embedded by their creators. Analyzing the emotions conveyed in these works has long been a prominent direction of exploration in the field. Recent research in emotion analysis can be broadly categorized into two main streams: emotion label classification and valence-arousal prediction. My work primarily focuses on valence-arousal prediction. Valence represents the pleasure or displeasure elicited by a stimulus, while arousal indicates the degree of excitement or calmness. Both these metrics are crucial for the expression of human emotions. In recent years, with the rapid development of computer vision research, people have made breakthroughs in image and audio analysis. At the same time, multimedia applications that combine music and images have become increasingly popular, from advertising to movies to virtual reality experiences. Multi-modal analysis holds great promise in these contexts. In this context, my research endeavors to construct a multi-modal emotion prediction model employing metric learning. Throughout the experiments, I compare two different architectures for the encoders, one based on CNN (i.e. ResNet) and one based on more recent transformers. Different types of training losses are also applied with the aim of not only facilitating the model to acquire a shared latent embedding space but also allowing the model to learn the label space of the corresponding modality. I assess the performance across two types of encoders under this architecture, aiming to establish a foundation for subsequent research.
Relatori:	Giuseppe Rizzo, Luca Barco, Angelica Urbanelli
Anno accademico:	2023/24
Tipo di pubblicazione:	Elettronica
Numero di pagine:	87
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	FONDAZIONE LINKS
URI:	http://webthesis.biblio.polito.it/id/eprint/29591

Modifica (riservato agli operatori)