polito.it
Politecnico di Torino (logo)

Environment and Embodiment adaptation of Vision-Language-Action models for robotic manipulation

Andrea Delli

Environment and Embodiment adaptation of Vision-Language-Action models for robotic manipulation.

Rel. Giuseppe Bruno Averta, Davide Buoso, Francesca Pistilli. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

[img] PDF (Tesi_di_laurea) - Tesi
Accesso riservato a: Solo utenti staff fino al 12 Giugno 2027 (data di embargo).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (19MB)
Abstract:

Vision-Language-Action (VLA) models represent a recent and promising direction in robotics, enabling agents to understand natural language instructions, perceive complex visual scenes, and perform manipulation tasks. However, these models often struggle to generalize across different robotic embodiments and environments, as changes in camera viewpoints, kinematics, or action spaces introduce significant distribution shifts. This thesis investigates the problem of robotic embodiment adaptation by evaluating the performance and adaptability of existing pre-trained VLA models on diverse robotic setups. The study focuses on fine-tuning and assessing multiple state-of-the-art VLA architectures: Diffusion Policy, OpenVLA, OpenVLA-OFT, SmolVLA, GR00T, and π0 using imitation learning. Data were collected primarily in simulation with the RLBench environment, which provides standardized tasks for the 7-DoF Franka Panda arm, and further validated on a 6-DoF real-world manipulator developed by the DIANA student team. In total, approximately 500 simulated episodes and 50 real demonstrations were gathered. The fine-tuning process relied mainly on the LeRobot framework, while OpenVLA models required custom training pipelines based on datasets in the RLDS format. Experimental results highlight the difficulty of achieving robust cross-embodiment generalization. Even with fine-tuning, performance often degrades under variations in viewpoint or control space, emphasizing the need for specialized adaptation procedures. However, this work provides a systematic evaluation of existing VLA models for manipulation and contributes to the open-source ecosystem by integrating RLBench compatibility into the official LeRobot repository. The findings show the importance of developing optimized fine-tuning strategies, such as those introduced in OpenVLA's Optimized Fine-Tuning (OFT) recipe, and incorporating lightweight adaptation methods like LoRA to facilitate domain and embodiment transfer. This research thus lays the groundwork for more scalable and generalizable VLA-based robotic learning.

Relatori: Giuseppe Bruno Averta, Davide Buoso, Francesca Pistilli
Anno accademico: 2025/26
Tipo di pubblicazione: Elettronica
Numero di pagine: 81
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/38614
Modifica (riservato agli operatori) Modifica (riservato agli operatori)