polito.it
Politecnico di Torino (logo)

Speeding up convergence while preserving privacy in Heterogeneous Federated Learning

Andrea Rizzardi

Speeding up convergence while preserving privacy in Heterogeneous Federated Learning.

Rel. Barbara Caputo, Debora Caldarola, Marco Ciccone. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2022

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview
Abstract:

The ability of machine learning and deep learning models to learn from data has led to their widespread adoption in a number of real-world settings today. To cite some, object recognition, autonomous driving systems, semantic segmentation and natural language generation are just a few examples of tasks that can be tackled by machine and deep learning. The typical "centralized" strategy is based on learning a model using collected sample data with the goal of generalizing to unseen data. Although this method produces excellent results, it has a fundamental flaw: in many real-world situations, gathering the necessary data is not trivial since more and more data is becoming protected by privacy regulations, making it inaccessible. The research community introduced an alternative approach to enable learning in privacy-constrained scenarios, i.e. Federated Learning (FL). The key concept of FL is to involve all the users(referred to as clients) having access to privacy-protected data in the training process, rather than collecting data all at once. In particular, a central server distributes a randomly initialized model to the clients, who train it using their personal data. The server then combines all of the trained models and updates the global one accordingly. The procedure continues in communication rounds. It's vital to note that this method ensures privacy by not allowing clients to send confidential information to the server. Due to its decentralized nature, the FL paradigm has significant drawbacks. For instance, each client may have a different local data distribution, leading to learning different local models that are then not suitable for aggregation. Additionally, each client might only have a small number of data samples, causing the local model to perform poorly. In this thesis, we address this issue known as statistical heterogeneity and to this end introduce FedSeq, which leverages the grouping of clients having various data distributions and sequential training within each cluster. The training benefits from this approach in two ways: the total number of data samples inside each cluster is more than the average number of data samples for a client; second, and most significantly, the distribution of the data is more uniform, producing better trained models. These two facts combined allow the trained model to “see” more data upon which it is trained while avoiding a training process that could lead to a model specialized only on data belonging to a subset of the available classes. FedSeq has been evaluated on various image recognition and text prediction applications, achieving excellent performances similar to the most advanced FL algorithms now available. Finally, this thesis covers a thorough investigation of FedSeq's resistance and adaptability to privacy attacks and defenses, as well as a research of the primary privacy attacks and defenses in FL.

Relatori: Barbara Caputo, Debora Caldarola, Marco Ciccone
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 79
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: Politecnico di Torino
URI: http://webthesis.biblio.polito.it/id/eprint/25564
Modifica (riservato agli operatori) Modifica (riservato agli operatori)