polito.it
Politecnico di Torino (logo)

Identification of hard-coded secrets in GitHub through NLP-based scanners

Federico Germinario

Identification of hard-coded secrets in GitHub through NLP-based scanners.

Rel. Giuseppe Rizzo. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

Abstract:

The exposure of hard-coded credentials inside source code is listed as one of the most dangerous vulnerabilities, due to the possibility for an attacker to gain unauthorized access to internal and external services. The automatic identification of secrets inside public and private repositories still represents a challenging problem to tackle, due to the different nature of credentials and snippets and the lack of specific and reliable benchmark datasets. Previous works have focused on the use of regular expressions and entropy-based approaches for the discovery of a limited number and specific structured strings with distinct formats such as API Keys, but ignoring unstructured credentials such as passwords. We propose, with this work, an NLP-based solution to identify structured and unstructured hard-coded credentials in source code for various programming languages. Our proposed solution can identify credentials by classifying the context, extracting, and checking the authenticity of the retrieved credential. The task of classification is performed through the use of large pre-trained language models, fine-tuned on two specific tasks of classification of the context of the snippet and the credential. The extraction of the credential instead is performed through query-based matching with the corresponding snippet's \textit{Abstract Syntax Tree} The proposed solution achieves SOTA performances on a set of handcrafted datasets, gained from GitHub. To encourage the development of new solutions, we will publish two datasets containing labeled snippets of code and credentials which reflect the GitHub domain.

Relatori: Giuseppe Rizzo
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 98
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela: INSTITUT EURECOM (FRANCIA)
Aziende collaboratrici: SAP Labs France
URI: http://webthesis.biblio.polito.it/id/eprint/26683
Modifica (riservato agli operatori) Modifica (riservato agli operatori)