Vision-Language-Action models for industrial robotics

Francesco Antonio Novia

Vision-Language-Action models for industrial robotics.

Rel. Alessandro Rizzo. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2026

Preview	PDF (Tesi_di_laurea) - Tesi Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (20MB) \| Preview
	Archive (ZIP) (Documenti_allegati) - Altro Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (33MB)

Abstract

Recent developments about Vision-Language-Action (VLA) models are representing a remarkably innovative approach in the field of robotics. This class of AI models promises to play noticeable role in robotics research, analogously to the deep innovation brought by foundation models for a large part of modern AI technologies, in several contexts. Leveraging multi-modal input understanding, LLMs generative capabilities, and an effective translation layer to perform real world actions, VLAs aim to embed the concept of a unified physical intelligence, which can then easily apply to different unseen embodiments and can increase modern autonomous robotic systems' versatility and robustness. Being able to adapt and fine-tune these models for various types of tasks and environments can potentially enhance the capabilities of common robotic systems, such as industrial manipulators, as well as facilitate users to control them with a more natural approach.

The aim of this thesis is to explore and integrate existing SoTA VLA models for industrial applications, specifically involving context-aware manipulation and pick-and-place use cases