Politecnico di Torino (logo)

Maximum entropy modelling for inference in biological sequences analysis

Matteo De Leonardis

Maximum entropy modelling for inference in biological sequences analysis.

Rel. Andrea Pagnani. Politecnico di Torino, Corso di laurea magistrale in Physics Of Complex Systems (Fisica Dei Sistemi Complessi), 2021

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview

Likelihood maximization and entropy maximization are two common techniques used to infer the set of parameters of a probability distribution. In recent years, they have shown outstanding performance in inference problems of structural biology from sequence data. My work addresses two main aspects related to this subject. The first one is the prediction of contacts in a protein family through the analysis of correlation between residues. Standard information theory related methods based on local correlation measures (e.g. Mutual Information) that are routinely used to evaluate the correlation between two random variables, often fail because they are not able to disentangle direct from indirect interaction between variables. For this purpose, global inference strategies such as entropy maximization, can be used to define a quantity called "direct information" which is capable to ignore statistical correlation between residues which are not linked to the presence of contacts between them. The second research direction undertaken in my thesis, is about a maximum likelihood strategy to model phage display experiments. Phage display is a widespread laboratory technique (2018 Nobel prize in Chemistry) for the study of protein–protein, protein–peptide, and protein–DNA interactions that uses bacteriophages (viruses that infect bacteria) to connect proteins with the genetic information that encodes them. A coding gene is inserted into the phage genome to expose the protein under study on the phage capsid. Typically, a population of 10^13 phages is grown to display variants of wild-type proteins encoded in biologically engineered combinatorial libraries. This allows for screening tests, repeated for a certain number of rounds, aimed at testing their binding capability against a target. After each round, the phage population can be sequenced to inspect the abundance of sequences that are bound to the target. Usually, supervised machine learning approaches are utilized to analyze phage display experiments in order to predict the selectivity of new sequences. Nevertheless, an unsupervised approach based on Likelihood Maximization can be developed by outlining a model based on statistical mechanics which describes the experiment and it allows for the statistical inference of the relevant parameters of the model. This is carried out through a multi-variate optimization of a likelihood score. Thanks to this approach, the binding of the sequence to the target is modeled in a probabilistic way in terms of a two-states system by using an "energy" function that depends on the amino acid sequence. Finally, this model can be extended to a three-states system in which the third state can be associated to the state in which the sequence is folded but still cannot bind to the target.

Relators: Andrea Pagnani
Academic year: 2020/21
Publication type: Electronic
Number of Pages: 54
Corso di laurea: Corso di laurea magistrale in Physics Of Complex Systems (Fisica Dei Sistemi Complessi)
Classe di laurea: New organization > Master science > LM-44 - MATHEMATICAL MODELLING FOR ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/17915
Modify record (reserved for operators) Modify record (reserved for operators)