polito.it
Politecnico di Torino (logo)

Integrating New Data into Generative Models of Biomolecular Sequences

Giovanni Peinetti

Integrating New Data into Generative Models of Biomolecular Sequences.

Rel. Andrea Pagnani, Martin Weigt. Politecnico di Torino, NON SPECIFICATO, 2024

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (8MB) | Preview
Abstract:

The design of functional artificial biomolecules has been one of the main interests of biotechnology in recent years. The aim is to design sequences that have the same functionality of the natural ones and comparable features. Data-driven approaches are one of the more successful strategies. In Machine Learning, generative statistical models are tools to generate artificial biomolecular sequences. They are trained on Multiple Sequence Alignments of homologous families which consist of positive unlabelled sequences. In literature there are several examples where generative models have been built successfully to generate functional RNA and Proteins. Relying on maximum entropy principle, Direct Coupling Analysis (DCA) models are based on the Boltzmann Distribution in physics. They are built by learning a Potts model from data via Maximum Likelihood and they can be used to sample artificial sequences. Now thanks to the advent of new quantitative high-throughput experiments, more and more quantitatively annotated sequences emerge. This abundance of information presents unprecedented opportunities to improve generative models, significantly enhancing their accuracy and efficacy in synthetic biology. Using the framework of energy-based models, in this thesis a new statistical-physics inspired algorithm was developed to integrate these labelled data into the construction of a better generative model. A new objective function was designed to include the information from both the unlabelled and labelled data. Its maximisation is equivalent to adjust the target frequencies for the training and no back-propagation is needed: it can be thought as a refinement of the original generative model. The goal of this feedback system is twofold: to minimise the production of non-functional sequences and to engineer new artificial sequences that exhibit specific desired characteristics, such as structural compatibility. Our algorithm was applied to train models both on synthetic and real data and it provided exceedingly good results in directing the generation towards the desired features. To validate our techniques, a series of biological experiments is scheduled in the near future.

Relatori: Andrea Pagnani, Martin Weigt
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 70
Soggetti:
Corso di laurea: NON SPECIFICATO
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-44 - MODELLISTICA MATEMATICO-FISICA PER L'INGEGNERIA
Ente in cotutela: Sorbonne University (UPMC) (FRANCIA)
Aziende collaboratrici: Sorbonne Université
URI: http://webthesis.biblio.polito.it/id/eprint/31103
Modifica (riservato agli operatori) Modifica (riservato agli operatori)