Overhead Prediction in Obfuscated Programs

Ilio Di Pietro

Overhead Prediction in Obfuscated Programs.

Rel. Cataldo Basile, Antonio Lioy. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (10MB) | Preview

Abstract:	Software obfuscation represents one of the best defense techniques against Man- At-The-End cyber attacks, in which the attacker has legal access to the resource he wants to violate. It consists of code transformations that make a program more difficult to understand and reverse engineer by changing its structure while preserving its original semantics. Although it is a very popular technique, the main disadvantage is that it worsens the performance of the obfuscated program. In fact, in order to make the code incomprehensible, the application of an obfuscating transformation involves the addition of complex computations, most of which introduce loops, dead paths, or data structures for the sole purpose of confusing the attacker. Such additions negatively affect the execution of the program. Therefore, the aim of this thesis is to build a machine learning model capable of estimating the overhead caused by the application of various obfuscation techniques to given programs. Previously collected data, concerning both the extraction of information on the execution of certain real-life programs and the application of two obfuscation techniques (code flattening and opaque predicate) on such programs, are taken into consideration. This data collection consists of execution traces of non-obfuscated programs, as well as execution times for both non-obfuscated and obfuscated versions. Traces are analyzed and processed in such a way as to create a complete dataset, extracting only those parts that are useful to obtain a representation of the program in terms of a list of executed instructions. The overhead is then calculated using the aforementioned execution time information. The sequential nature of the traces led to the choice of a deep learning model based on Long-Short-Term-Memory, capable of taking into account the time dependency between instructions. To address the long but still limited memory of the LSTMs networks, classes of equality between instructions are established in order to simplify them in such a way that semantically similar traces have the same syntax. The following compression step takes these simplifications into account and uses identical instructions appearing in sequence to reduce the length of the traces. After taking care of the structure of the dataset, to ensure better data quality, a preprocessing phase follows, in which outliers were detected and eliminated from the dataset. A subsequent encoding phase addresses the problem of categorical features and allows the dataset to take on an appropriate conformation to be used by the deep learning algorithm. Eventually, after splitting the data set into train and test sets, the network will be trained and then evaluated in order to obtain overhead estimates. The analysis shows that there are certain limitations in the ability to predict overhead and discusses possible reasons for this, in order to continue to move towards a complete solution to the problem of performance prediction in obfuscated programs.
Relatori:	Cataldo Basile, Antonio Lioy
Anno accademico:	2021/22
Tipo di pubblicazione:	Elettronica
Numero di pagine:	100
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	Politecnico di Torino
URI:	http://webthesis.biblio.polito.it/id/eprint/23525

Modifica (riservato agli operatori)