Federico Germinario
Identification of hard-coded secrets in GitHub through NLP-based scanners.
Rel. Giuseppe Rizzo. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023
Abstract: |
The exposure of hard-coded credentials inside source code is listed as one of the most dangerous vulnerabilities, due to the possibility for an attacker to gain unauthorized access to internal and external services. The automatic identification of secrets inside public and private repositories still represents a challenging problem to tackle, due to the different nature of credentials and snippets and the lack of specific and reliable benchmark datasets. Previous works have focused on the use of regular expressions and entropy-based approaches for the discovery of a limited number and specific structured strings with distinct formats such as API Keys, but ignoring unstructured credentials such as passwords. We propose, with this work, an NLP-based solution to identify structured and unstructured hard-coded credentials in source code for various programming languages. Our proposed solution can identify credentials by classifying the context, extracting, and checking the authenticity of the retrieved credential. The task of classification is performed through the use of large pre-trained language models, fine-tuned on two specific tasks of classification of the context of the snippet and the credential. The extraction of the credential instead is performed through query-based matching with the corresponding snippet's \textit{Abstract Syntax Tree} The proposed solution achieves SOTA performances on a set of handcrafted datasets, gained from GitHub. To encourage the development of new solutions, we will publish two datasets containing labeled snippets of code and credentials which reflect the GitHub domain. |
---|---|
Relators: | Giuseppe Rizzo |
Academic year: | 2022/23 |
Publication type: | Electronic |
Number of Pages: | 98 |
Additional Information: | Tesi secretata. Fulltext non presente |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Ente in cotutela: | INSTITUT EURECOM (FRANCIA) |
Aziende collaboratrici: | SAP Labs France |
URI: | http://webthesis.biblio.polito.it/id/eprint/26683 |
Modify record (reserved for operators) |