Enhancing Arduino AI Assistant: Semi-supervised User Intent Classification for RAG Optimization

Shurui Chen

Enhancing Arduino AI Assistant: Semi-supervised User Intent Classification for RAG Optimization.

Rel. Luca Vassio. Politecnico di Torino, Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro), 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (8MB) | Preview

Abstract:	Recent advances in generative AI have enabled the integration of large language models (LLMs) into development environments to assist users in programming, interpreting, and debugging. This paper presents a complete data processing and classification pipeline for a GenAI Chat Assistant embedded in the Arduino Cloud Editor. We propose classifying user queries into distinct intent categories, specifically create code, explain, suggest, and fix errors, to optimize Retrieval-Augmented Generation (RAG) responses. The goal is to improve the relevance and efficiency of RAG by tailoring document retrieval and response generation strategies to the specific user intent, thereby reducing irrelevant content and optimizing token usage in LLM responses. To support this classification, we first construct a text preprocessing framework that filters out noises, prompts, code-only contents, and non-English inputs, retaining only valid user queries for analysis. We benchmark multiple sentence embedding models (e.g., MiniLM, mpnet, E5, BGE) and ultimately select the intfloat/e5-base-v2 for its balance between performance and efficiency. Given that the raw dataset is unlabeled, we need to set corresponding principles for manual labeling. The number of labeled categories need to be balanced so that self-training can fully learn the characteristics of each category. By comparing the performance of each classification model, we built a semi-supervised classification framework based on calibrated ensemble models (LightGBM, CatBoost, logistic regression). To address the situation of label imbalance and blurred boundaries of some categories, the category threshold is dynamically adjusted, and high-confidence pseudo-labels are added to the new training set to ensure that the model can continue to learn. This approach gradually expands the labeled dataset while maintaining a balanced category distribution. Our method relatively improves the quality of pseudo-labels, achieves stable learning in underrepresented categories, and achieves good macro F1 scores on the retained validation set. When applied to a test dataset extracted from the production database, the model predicts intent labels for new user queries and provides an estimated class distribution. The resulting labeled data supports intent-aware retrieval and more precise response generation within the GenAI system, enhancing user interaction, improving token efficiency, and enabling further optimization of assistant performance.
Relatori:	Luca Vassio
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	69
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-27 - INGEGNERIA DELLE TELECOMUNICAZIONI
Aziende collaboratrici:	Arduino srl
URI:	http://webthesis.biblio.polito.it/id/eprint/36554

Modifica (riservato agli operatori)