Politecnico di Torino (logo)

Analyzing Advanced Code Representations with Machine Learning

Daniele Falcetta

Analyzing Advanced Code Representations with Machine Learning.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

[img] PDF (Tesi_di_laurea) - Tesi
Restricted to: Repository staff only until 21 April 2026 (embargo date).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (5MB)

Code vulnerabilities are weaknesses in a software system that can be exploited by attackers to gain unauthorized access, steal sensitive data, or cause other types of harm. Being able to recognize and fix these vulnerabilities is important because it helps to protect the system and its users from potential attacks and breaches. This topic is one of the most pressing security issues that the software industry is dealing with nowadays. Our interest in the issue of ensuring that the open-source components used in SAP's products are free of (known) vulnerabilities is what motivated us to continue with this work. The vulnerability management process of a software with open source components is a challenging problem due to its dependence on non-reliable standard sources of advisories and vulnerability data. Previous efforts aimed to reduce this dependency by directly analyzing source code for the automatic detection of commits that are security-relevant. In the very first attempt, source code changes were treated as documents written in natural language processing, potentially ignoring the structured nature of source code. Then, other works tried to incorporate structural information of code in the form of abstract syntax trees (ASTs). Now, with our work, we seek to incorporate richer code representations into our analysis, e.g. obtained from semantic representations as well as data and control flow graphs (CFG, PDG). The dataset used for this work is an open-source code-centric dataset that does not rely purely on metadata but tries to discover vulnerabilities at their code-level. The goal of this work is to explore the use of different features (e.g. hand-crafted, tree- or graph-based features) for code representation and to benchmark the performances of these code representations in a series of industry-relevant tasks, and, in particular, to classify commits that are security relevant, i.e. that are likely to fix a vulnerability.

Relators: Paolo Garza
Academic year: 2022/23
Publication type: Electronic
Number of Pages: 109
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: SAP Labs France
URI: http://webthesis.biblio.polito.it/id/eprint/26681
Modify record (reserved for operators) Modify record (reserved for operators)