Politecnico di Torino (logo)

An interpretable BERT-based architecture for SARS-CoV-2 variant identification

Giorgia Ghione

An interpretable BERT-based architecture for SARS-CoV-2 variant identification.

Rel. Santa Di Cataldo, Marta Lovino, Giansalvo Cirrincione, Elisa Ficarra. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

[img] PDF (Tesi_di_laurea) - Tesi
Restricted to: Repository staff only until 29 July 2025 (embargo date).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (110MB)

The Covid-19 pandemic has posed many challenges in the medical diagnostics field. One of these has been the need for constant detection and monitoring of the SARS-CoV-2 circulating variants. The most common approach to reliably identify a SARS-CoV-2 variant is exploiting genomics. Such an approach has been enabled by the constant collection of genetic sequences of the virus globally. However, variant identification methods are usually resource-intensive. Thus, small medical laboratories can have issues due to limited diagnostic capacity. This thesis presents a deep learning method to successfully identify variants without requiring high computational resources and long delays. The contribution of this thesis is twofold: 1) the development of a Bidirectional Encoder Representations from Transformers (BERT) fine-tuning architecture for SARS-CoV-2 variant identification; 2) the mathematical and biological interpretation of the model by leveraging its self-attention mechanism. The developed method allows the analysis of the spike gene of SARS-CoV-2 genome samples to determine their variant quickly. The chosen neural network BERT is a Transformer-based model initially proposed for processing natural language sequences. However, it has been successfully applied to several other contexts, such as DNA/RNA sequence analysis. Therefore, BERT was fine-tuned to adapt to the genomic sequence domain, reaching an F1 score equal to 98.59% on the inference dataset: it proved effective in recognizing variants circulating to date. Since BERT relies on the self-attention mechanism, the interpretability of the model was investigated by analyzing its self-attention matrices and hidden weights. The resulting mathematical interpretation allowed the understanding of the biological meaning of the attention patterns produced by the network. Indeed, BERT extracts relevant biological information on variants by focusing on specific parts of the SARS-CoV-2 spike gene. In particular, it was examined how attention spreads across the domains of the spike protein, and it was found that attention is often localized on the site of defining mutations of variants. Therefore, the developed architecture allows gaining insights into the distinctive characteristics of SARS-CoV-2 genetic sequences and into the behaviour of BERT neural network.

Relators: Santa Di Cataldo, Marta Lovino, Giansalvo Cirrincione, Elisa Ficarra
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 119
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/23527
Modify record (reserved for operators) Modify record (reserved for operators)