Designing and engineering a Q&A LLM for network packet representation

Giovanni Dettori

Designing and engineering a Q&A LLM for network packet representation.

Rel. Luca Vassio, Marco Mellia, Matteo Boffa. Politecnico di Torino, Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro), 2024

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (15MB) | Preview

Abstract:	As internet traffic continues to grow exponentially, the ability to accurately classify and analyze it becomes increasingly important for ensuring network performance, security, and reliability. Traditional traffic classification methods often rely on static rules which are becoming less effective for the increasing complexity of network environments, the dynamicity of protocols, and the growth of encrypted traffic. This necessitates the development of more sophisticated techniques that can accurately represent and classify internet packets based on their intrinsic characteristics that derive both from the header and payload. The primary challenge lies in creating a representation of each packet that summarizes its significant features while being computationally efficient and scalable. Nowadays, to address this problem advanced machine learning algorithms and deep learning models are leveraged for their ability to learn complex patterns and relationships within the data. The thesis proposes a training pipeline that is able to obtain a significant packet representation, consisting of a floating vector of dimension 768, by applying different fine-tuning methods to the pre-trained T5 model. The core fine-tuning approach is based on emulating what the tool Wireshark already performs by asking an LLM different questions on the internet packet header and payload. The process starts with raw PCAP files that are pre-processed in order to create a question-answering dataset used to fine-tune a modified version of the T5 model. Indeed, a bottleneck is introduced between the encoder and the decoder to obtain a representation of the packet in input. For the objectives of the thesis, we are interested in the encoder plus the bottleneck that can be used to solve many network problems, such as classification, novelty detection, and malicious pattern recognition. The evaluation of the performance of the proposed structure is performed on a bench of classification tasks related to the application layer such as application or service recognition. The obtained results show that the idea is successful. Indeed, the accuracy and the F1 score on the classification tasks only lose an average of 5% with respect to state-of-the-art models, but we have the advantage of obtaining a numeric packet representation easily handled by a machine.
Relatori:	Luca Vassio, Marco Mellia, Matteo Boffa
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	73
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-27 - INGEGNERIA DELLE TELECOMUNICAZIONI
Aziende collaboratrici:	Politecnico di Torino
URI:	http://webthesis.biblio.polito.it/id/eprint/33158

Modifica (riservato agli operatori)