polito.it
Politecnico di Torino (logo)

Generative models for protein structure: A comparison between Generative Adversarial and Autoregressive networks

Letizia Bergamasco

Generative models for protein structure: A comparison between Generative Adversarial and Autoregressive networks.

Rel. Enrico Magli, Stefano Tubaro, Andrea Pagnani. Politecnico di Torino, Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro), 2020

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (6MB) | Preview
Abstract:

This thesis work is set in the context of synthetic protein sequences generation. Starting from a dataset of protein sequences that belong to the same protein family, the goal is to generate new sequences which are statistically indistinguishable from the ones in the same family. This is possible thanks to the recent advance in protein sequencing, which has made available a large number of protein family datasets. To do this, we use neural network generative models that are able to learn the probability distribution of a dataset, so that we can sample from that distribution and generate new synthetic data. In particular, two different kinds of models are proposed: generative adversarial networks (GANs) and autoregressive (AR) neural networks. Both approaches are implemented in Python, using the PyTorch framework. They are tested on two datasets that can be downloaded from the Pfam database, namely the multiple sequence alignments of the Kunitz/Bovine pancreatic trypsin inhibitor domain and of the Cyclic nucleotide-binding domain, respectively. The evaluation method is twofold: on the one side, we use one-point and two-point correlation plots to check if the empirical frequency counts of the amino acids in the initial dataset and in the generated dataset are similar. On the other side, we use sensitivity plots based on a method called direct coupling analysis, which can summarise a protein family's contact map or, in other words, its three-dimensional folded structure. The results show that both the implemented generative models are able to generate protein sequences that are statistically similar to the ones in the original dataset of their family. In fact, they present correlation between the amino acid frequency counts of the initial dataset and the generated dataset in the correlation plots. Moreover, the sensitivity plots reveal that the models can automatically learn also the three-dimensional folded structure of the considered protein family. With respect to this aspect, while in the Kunitz/Bovine pancreatic trypsin inhibitor domain dataset GANs result to perform better, in the Cyclic nucleotide-binding domain dataset AR networks show higher potential to capture the three-dimensional folded structure of the protein family. Overall, AR networks exhibit much shorter training times with respect to GANs.

Relatori: Enrico Magli, Stefano Tubaro, Andrea Pagnani
Anno accademico: 2020/21
Tipo di pubblicazione: Elettronica
Numero di pagine: 79
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-27 - INGEGNERIA DELLE TELECOMUNICAZIONI
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/15944
Modifica (riservato agli operatori) Modifica (riservato agli operatori)