Politecnico di Torino (logo)

Ethical Manufacturing of Datasets for Artificial Intelligence: an Empirical Investigation into the State of Documentation Practice

Marco Rondina

Ethical Manufacturing of Datasets for Artificial Intelligence: an Empirical Investigation into the State of Documentation Practice.

Rel. Antonio Vetro', Juan Carlos De Martin. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial Share Alike.

Download (6MB) | Preview

Artificial Intelligence research and industrial developments have made great strides in recent years becoming increasingly pervasive within society, given the diffusion of AI applications with the aim of automating processes and decisions. One of the key elements of AI-based technologies is data, which play a central role in the quality of software outcomes. It is therefore becoming increasingly important to ensure that AI practitioners are fully aware of the quality of datasets and of the process generating them, in such a way that all the ¿typically implicit¿ assumptions, ethical issues, modeling choices clearly and transparently emerge, and their impact to downstream effects can be tracked, analysed and possibly mitigated. One of the tools that can be useful in this perspective is dataset documentation, because it helps to discover data ethical issues and how to manage them. The first aim of this work of thesis was to draw up a scheme of the relevant information that should always be attached to a dataset, starting from published proposals for standardising documentation. This scheme is designed to make it easier to check the presence of such information, and work as a measure of the completeness of the documentation. The next step consist in the application of the proposed scheme to some of the most popular dataset in the AI community. To this aim, four different repositories were selected (Huggingface, Kaggle, OpenML and UC Irvine ML) and, within each of them, the top 25 datasets were chosen. The aim was to assess how readily accessible this information was in the very same place where the data can be accessed. For this reason, the research was focused on the analysis of the dataset description pages in the hosting repositories. Since automatic assessment led to inaccurate or incomplete results, it was integrated with manual checking. Then the results were analysed with mixed methods (qualitative and quantitative) that allowed the identification of some correlations between the available documentation and dataset characteristics. On average, datasets containing people-related data showed equal or even less detailed documentation compared to other datasets. Information on how to use the datasets appears to be the most present. The least present information is about maintenance over time, data collection processes and finally pre-processing and labelling. In general, a lack of relevant information was observed, highlighting a paucity of transparency. This observation is even more significant when considering that the analysis was restricted to some of the most popular and well-known datasets. The scheme and procedure here proposed represent a useful tool to improve transparency and accountability. On one hand it can be used by dataset hosts and dataset consumers to quickly and simply check the completeness of a documentation. On the other it can serve as a guideline for dataset creators, helping them to improve their documentation so that dataset consumer can verify the underlying choices and assumptions. Altogether these results show that huge efforts of the AI community in devoting more attention to the dataset documentation process are urgent and necessary. The recommended path should be supported by the investigation and experimentation of techniques to fully integrate documentation models and processes into the AI pipeline.

Relators: Antonio Vetro', Juan Carlos De Martin
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 131
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/23519
Modify record (reserved for operators) Modify record (reserved for operators)