Politecnico di Torino (logo)

Author Gender Profiling from texts and images in Twitter

Ciccone, Giovanni

Author Gender Profiling from texts and images in Twitter.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2018

PDF (Tesi_di_laurea) - Tesi
Accesso al documento: Accesso libero
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview

In this thesis I described the work done by me at INSA/LIRIS research center in Lyon during my ERASMUS exchange semester. I started my project in February 2018. Firstly, I worked on micro-blogging platforms information retrieval, specifically Twitter. The goal was the Event Detection by processing a corpus of tweets. For this purpose, I implemented a Python script for collecting real time tweets containing keywords specified as input parameters. I used that script for creating 2 different datasets: the first one formed by tweets collected during Italian political elections of the 4th March 2018, the second one containing tweets about the FA CUP football match between Tottenham and Rochdale played on the 28th February 2018. On these 2 datasets I applied 2 different techniques for event detection having the aim of evaluating the goodness of results in comparison to events occurred during Italian elections and football match respectively.One month after the start of the project, my INSA/LIRIS supervisor proposed to me the possibility of taking part in the PAN-CLEF 2018 Author Profiling task 1. It is an international competition, organized by Conference and Labs of the Evaluation Forum (in short CLEF), among several research teams on different text forensics topics of growing interest like Author Profiling (AP). The aim of AP is to retrieve information about authors based on the content produced by them. PAN-CLEF 2018 task concerns Twitter users’ gender prediction from tweets texts and images posted by them. The challenge is formed by three subtask, that are gender prediction from only texts, from only images, and from both (combined approach).1. The first subtaskwasproposedalsoinpreviousPANeditions,thereforeourwishwastoreach as soon as possible the state of the art level, represented by PAN 2017 winner team, in order to dedicate the remaining part of the time for facing the second subtask, that is an absolute novelty within PAN context. We solved the first subtask by using natural language processing (preprocessing, bag of words, TF-IDF) and machine learning (linear SupportVectorMachine classifier) techniques.2. Concerningtheimagessubtask,weexperimenteddifferenttechniques(colorhistograms,local binary patterns, face detection, object detection) producing weak predictors, we decided to use the classifier stacking principle for obtaining a more robust estimator. Specifically it consists in a layered architecture:- the first layer contains the weak predictors- the second one is the meta-classifier that combines the four results coming from the previous stage- the third layer is used for aggregating predictions associated to different images of the same user in a single estimation per user3. The third subtask is simply the combination of results obtained from the first and second subtasks.The final outcomes are very encouraging, as a matter of fact my team achieved a fantastic 4th rank out of the 23 participant teams.After the software submission and acceptance, we prepared a scientific paper describing our proposed method. Moreover, I performed some further experiments in order to understand which are the weaker points that can be improved in future works.

Relatori: Paolo Garza
Anno accademico: 2017/18
Tipo di pubblicazione: Elettronica
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela: Institut National des Sciences Appliquées de Lyon - INSA (FRANCIA)
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/8030
Modifica (riservato agli operatori) Modifica (riservato agli operatori)