Francesco Caredda
Attention Based Direct Coupling Analysis for Protein Structure Prediction.
Rel. Andrea Pagnani. Politecnico di Torino, Corso di laurea magistrale in Physics Of Complex Systems (Fisica Dei Sistemi Complessi), 2022
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (22MB) | Preview |
Abstract: |
Proteins are at the base of every biological function within the cell, ranging through a variety of transport, signaling and enzymatic tasks. Their functionalities heavily rely on their three-dimensional structure which is extremely difficult, time consuming and expensive to determine. In this thesis we discuss Direct Coupling Analysis (DCA), the state-of-the-art statistical physics model used to learn structural information about co-evolving proteins based on their amino-acid sequence. Phylogenetically related homologous sequences can be considered as belonging to a unique protein family with specific structural properties defining their functionality. For our purposes such sequences, aligned and collected in a data structure called Multiple Sequence Alignment (MSA), can be thought as samples drawn from a probability distribution encoding the fundamental structural traits of the protein family they belong to. The form of the distribution is obtained by applying a Maximum Entropy Principle imposing as empirical constraints the single and pairwise frequency counts of the amino-acids in the MSA. The resulting probability distribution is a Potts model whose parameters, to be inferred, represent the direct-interaction tensor for any two given residues and the local biases for each position in the sequence. More precisely, DCA represents an inverse Potts problem aimed at inferring the set of parameters which better describes the direct residue-residue interactions for a specific protein family. Among the possible methods that can be used to solve the inference problem, we consider the state-of-the-art architecture for contact-prediction, PlmDCA, a maximum likelihood estimate of the parameters by means of a gradient-ascent of a pseudo-loglikelihood function depending on the specific MSA. In particular, the purpose of this thesis is to develop a possible improvement of PlmDCA inspired by the Attention Mechanism, a deep learning technique developed in the context of Natural Language Processing. Attentions is gaining popularity in the computational biology community after the recent exploit of AlphaFold 2 by DeepMind which used it in its deep learning architecture for protein structure prediction at the 2020 CASP competition. In this new model, the interaction tensor of the Potts model is written as a non-linear low-rank decomposition whose aim is to share amino-acid features, effectively reproducing the fact that different positions may be in contact due to similar chemical interactions. The validity of the Attention-Based PlmDCA is tested against the standard PlmDCA architecture using three MSA whose structural data are fully available through the Pfam database. |
---|---|
Relators: | Andrea Pagnani |
Academic year: | 2021/22 |
Publication type: | Electronic |
Number of Pages: | 84 |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Physics Of Complex Systems (Fisica Dei Sistemi Complessi) |
Classe di laurea: | New organization > Master science > LM-44 - MATHEMATICAL MODELLING FOR ENGINEERING |
Aziende collaboratrici: | UNSPECIFIED |
URI: | http://webthesis.biblio.polito.it/id/eprint/23617 |
Modify record (reserved for operators) |