Politecnico di Torino (logo)

Machine-Learning Techniques for the Diagnosis of COVID-19 from Exhaled-Breath Mass Spectra

Matteo Serra

Machine-Learning Techniques for the Diagnosis of COVID-19 from Exhaled-Breath Mass Spectra.

Rel. Giovanni Squillero, Nicolo' Bellarmino. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview

Machine-Learning Techniques for the Diagnosis of COVID-19 from Exhaled-Breath Mass Spectra The emergence of COVID-19 caused by SARS-CoV-2 has created a global health crisis, necessitating rapid and non-invasive diagnostic methods. Traditional approaches like RT-PCR have limitations, so this study aims to use Machine Learning to detect COVID-19 from patients' breath mass spectra. The study began by creating a dataset of mass spectra stored in .ASC files. These files contain multiple acquisitions, each corresponding to a mass spectrum. The first phase aimed to identify the zones where the mass spectrometer is stable, we considered flat the zones with first derivative inside a tolerance guard, then standard deviations within the plateau were computed, and the acquisitions with the lowest values were selected and mass spectra were extracted. After this preliminary phase we got four datasets, one for each range. Data exploration revealed measurement bias, where acquisitions from the same day or close days were closer together. Normalization techniques such as TIC and Krypton normalization were applied to address this. High dimensionality issues were mitigated using feature selection methods like PCA and gradient boosting. The lack of data samples were solved using a brand new data augmentation technique that used the combination of different ranges acquisitions of the same patient. To augment the signal quality we applied signal pre-processing methods to the spectra. Those included baseline correction with an ALS algorithm, a Savitzky-Golay smoothing filter and a peak alignment procedure. To detect and discard outliers a z-score filter and a comparison with the NIST krypton isotopic ratios are applied. A Convolutional Autoencoder (CAE) was designed as a feature extractor, trained for 20 epochs with various layers and regularization techniques. In particular we used a padded noised version of the signal as training set and the aim of the net is to reconstruct the denoised version. The choice of the l2 regularization was a key point for the realization of the CAE. The net architecture involved a basic block made by a 1D convolutional layer, a max pooling or up sampling layer (depending if we are in the encoder or decoder part) and a batch normalization layer with different kernel and pooling size. CAE achieved satisfactory performance in terms of signal reconstruction and its encoder part will be used as feature extractor with a dimensionality reduction of a factor of 6. Several Machine Learning models, including KNN, RF, LR, XB, SVM, and an ensemble model, were employed in a 10-fold cross-validation protocol with stratification and outlier reduction. Variance thresholding, PCA or xgboost addressed features reduction and oversampling addressed class imbalance. Experiments revealed range 2 as the most discriminating for classification. With PCA as feature selection and TIC normalization, accuracy and F1-score improved from 82% and 68% to 93% and 87%, respectively. Expanding the dataset to the whole mass range led to 95% accuracy and 92% F1-score. CAE improved results, achieving 92% balanced accuracy and 90% F1-score by mitigating day bias. This study introduced a framework for COVID-19 detection from breath mass spectra using Machine Learning and a CAE to handle high dimensionality. Range 2 was found to be the most informative, and the proposed method achieved a 95% accuracy and 92% F1-score with a portable mass spectrometer, representing an improvement over invasive methods.

Relators: Giovanni Squillero, Nicolo' Bellarmino
Academic year: 2023/24
Publication type: Electronic
Number of Pages: 89
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: NanoTech Analysis srl
URI: http://webthesis.biblio.polito.it/id/eprint/28458
Modify record (reserved for operators) Modify record (reserved for operators)