An interpretable BERT-based architecture for SARS-CoV-2 variant identification

Giorgia Ghione

An interpretable BERT-based architecture for SARS-CoV-2 variant identification.

Rel. Santa Di Cataldo, Marta Lovino, Giansalvo Cirrincione, Elisa Ficarra. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

PDF (Tesi_di_laurea) - Tesi
Accesso riservato a: Solo utenti staff fino al 29 Luglio 2025 (data di embargo).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (110MB)

Abstract:	The Covid-19 pandemic has posed many challenges in the medical diagnostics field. One of these has been the need for constant detection and monitoring of the SARS-CoV-2 circulating variants. The most common approach to reliably identify a SARS-CoV-2 variant is exploiting genomics. Such an approach has been enabled by the constant collection of genetic sequences of the virus globally. However, variant identification methods are usually resource-intensive. Thus, small medical laboratories can have issues due to limited diagnostic capacity. This thesis presents a deep learning method to successfully identify variants without requiring high computational resources and long delays. The contribution of this thesis is twofold: 1) the development of a Bidirectional Encoder Representations from Transformers (BERT) fine-tuning architecture for SARS-CoV-2 variant identification; 2) the mathematical and biological interpretation of the model by leveraging its self-attention mechanism. The developed method allows the analysis of the spike gene of SARS-CoV-2 genome samples to determine their variant quickly. The chosen neural network BERT is a Transformer-based model initially proposed for processing natural language sequences. However, it has been successfully applied to several other contexts, such as DNA/RNA sequence analysis. Therefore, BERT was fine-tuned to adapt to the genomic sequence domain, reaching an F1 score equal to 98.59% on the inference dataset: it proved effective in recognizing variants circulating to date. Since BERT relies on the self-attention mechanism, the interpretability of the model was investigated by analyzing its self-attention matrices and hidden weights. The resulting mathematical interpretation allowed the understanding of the biological meaning of the attention patterns produced by the network. Indeed, BERT extracts relevant biological information on variants by focusing on specific parts of the SARS-CoV-2 spike gene. In particular, it was examined how attention spreads across the domains of the spike protein, and it was found that attention is often localized on the site of defining mutations of variants. Therefore, the developed architecture allows gaining insights into the distinctive characteristics of SARS-CoV-2 genetic sequences and into the behaviour of BERT neural network.
Relatori:	Santa Di Cataldo, Marta Lovino, Giansalvo Cirrincione, Elisa Ficarra
Anno accademico:	2021/22
Tipo di pubblicazione:	Elettronica
Numero di pagine:	119
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/23527

Modifica (riservato agli operatori)