polito.it
Politecnico di Torino (logo)

A Modular LLM‑Based Framework for Semantic Navigation and Perception in Mobile Robotics

Giuseppe Antonio Gentile

A Modular LLM‑Based Framework for Semantic Navigation and Perception in Mobile Robotics.

Rel. Alessandro Rizzo, Pangcheng David Cen Cheng. Politecnico di Torino, Corso di laurea magistrale in Mechatronic Engineering (Ingegneria Meccatronica), 2025

[img] PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (8MB)
Abstract:

As robots become increasingly integrated into human environments, the ability to interact intuitively through natural language has become a crucial goal. Traditional robotic control systems require structured inputs and predefined behaviors, limiting their adaptability in dynamic, real world environments. Recent advances in Large Language Models (LLM) offer a new paradigm: harnessing language as a general interface for reasoning, perception, and decision making. However, integrating LLMs with embodied agents presents fundamental challenges, including grounding instructions in physical space, ensuring safety, and linking symbolic language and low-level robotic actions. This thesis explores the integration of Large Language Models (LLMs) into robotic systems for natural language-driven control and perception. The proposed architecture connects a GPT-based reasoning agent with a ROS~2 navigation stack and a visual pipeline comprising BLIP for visual question answering and YOLOv8 with depth sensing for object localization. Natural language instructions are parsed into structured commands using prompt injection, verified for safety via an Abstract Syntax Tree (AST) parser, and dispatched to the robot for execution. The system is tested in a simulated environment with a TurtleBot4 platform. Experimental results demonstrate reliable performance in goal navigation and partial success in end-to-end instruction-to-verification tasks. While predefined commands achieve a 100\% success rate, semantically inferred goals reveal the limits of symbolic mapping and perception coverage. Vision modules exhibit strong accuracy on binary visual questions and acceptable spatial detection within proximity thresholds. This work contributes a modular, extensible framework for embodied LLMs, validating its potential for grounded, multimodal interaction. Future extensions are discussed to address real-time feedback integration, memory, and multi-turn dialogue.

Relatori: Alessandro Rizzo, Pangcheng David Cen Cheng
Anno accademico: 2024/25
Tipo di pubblicazione: Elettronica
Numero di pagine: 66
Soggetti:
Corso di laurea: Corso di laurea magistrale in Mechatronic Engineering (Ingegneria Meccatronica)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-25 - INGEGNERIA DELL'AUTOMAZIONE
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/36499
Modifica (riservato agli operatori) Modifica (riservato agli operatori)