Politecnico di Torino (logo)

A modern reimplementation of an alignment pipeline for the analysis and quantification of small non-coding RNA and isoforms using C++ and Python

Marco Capettini

A modern reimplementation of an alignment pipeline for the analysis and quantification of small non-coding RNA and isoforms using C++ and Python.

Rel. Gianvito Urgese. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2020

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview

During the last years computer science has taken on an increasingly central role in the processes underlying the production and analysis of biological data. The continuous development of new cutting-edge machines such as NGS has made it possible to make great progress in the field of genetic sequence analysis. For this reason, and also due to the enormous amount of data produced daily with these procedures, many algorithms and tools developed for the analysis need to be optimized for exploiting the enhanced features of new computing systems. With this thesis work I propose a modern reimplementation of an alignment tool called isomiR-SEA which was developed with a precise objective in mind: overcoming some of the limitations of today’s general-purpose alignment algorithms, that usually lack accuracy and completeness in the results. The first version of the tool was designed to detect and quantify small non-coding RNA sequences (microRNAs) and their variants isomiRs. Such sequences are provided to the program as simple text substrings composed of combinations of A, C, G and U characters, which represent very short segments of RNA made up of about 20-22 nucleotides. These small sequences play a critical role in gene expression because of their regulatory functions on the production of proteins. It is in fact widely proven that they are fundamental in several cellular processes and, as a consequence, in the onset and progression of many diseases such as immune disorders and cancer. isomiR-SEA algorithm was developed in C++14, written in a non-optimized way and not completely tested. So, in order to make it usable by the bioinformatics community, there was a strong need for software re-engineerization and bug-correction. For this reason I decided to reimplement the software by conforming to the modern C++17 programming standard and to SeqAn3, the today’s latest version of the library for the analysis of biological sequences which replaces Seqan2, used in the old version of isomiR-SEA. Besides fixing bugs, I have implemented several new features such as the serialization of the input reference databases, in order to save time in consecutive executions, and the possibility of providing only a single file as input to the program representing the union of several smaller ones, allowing to obtain with a single execution the same results that before would have required many more execution cycles. This, together with a revised data printing mechanism which originally wasted a large amount of resources saving temporary structures in memory, has allowed to switch from a first prototype of the tool to a working version tested in an environment very close to the intended one, providing a product that is currently usable by an end bioinformatician user. The new version of isomiR-SEA achieved a significant increase in performance by gaining both in terms of execution times (up to ~60%) and drastically decreasing max RAM consumption (by ~75%). Finally, I dealt with the post analysis of the output data, porting into Python3 scripts what were previously implemented using Knime, a software useful to create and productionize data science using intuitive environment. Although Knime is very convenient for prototyping thanks to its intuitive and model-oriented graphical interface, it flaws in terms of efficiency and performance when compared to Python.

Relators: Gianvito Urgese
Academic year: 2019/20
Publication type: Electronic
Number of Pages: 71
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/14493
Modify record (reserved for operators) Modify record (reserved for operators)