Adapting LLaMA 3.2 Vision for Unified Robotic Planning and Control

Vittorio Di Giorgio

Adapting LLaMA 3.2 Vision for Unified Robotic Planning and Control.

Rel. Alessio Sacco, Guido Marchetto, Flavio Esposito. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (27MB) | Preview

Abstract:	Vision-Language-Action (VLA) models are emerging as powerful tools for embodied AI, allowing robots to merge visual perception with language understanding to execute complex tasks. Their potential lies in combining perception, reasoning, and control within a single framework, which could greatly enhance robotics pipelines and boost generalization. However, it remains uncertain how effectively today’s open, mid-size vision-language models (VLMs) can be adapted into practical VLAs. Previous works like RT-2 have demonstrated impressive results but rely on large proprietary models and undisclosed training methods, while OpenVLA presents an open-source alternative based on composite architectures. In contrast, this study investigates whether a single, open, mid-size model like LLaMA 3.2 Vision Instruct can be fine-tuned into a functional VLA using only limited computational resources, along with carefully designed prompting and training strategies. The approach is evaluated on two complementary benchmarks: ALFRED, which emphasizes high-level household reasoning and long-horizon planning, and Open X-Embodiment (OpenX), which concentrates on low-level robotic manipulation trajectories. For ALFRED, the model is fine-tuned to generate both a natural language plan and a discrete sequence of actions (e.g., GoToLocation, PickupObject). This is achieved by creating structured prompts that enforce the chronological order of observations and conditionally integrate scene objects, effectively introducing dropout to enhance robustness. This framework enables the model to produce coherent and aligned language–action plans, even though the inference time per sample remains significant. For OpenX, a discrete action vocabulary is established by mapping 256 uncommon LLaMA tokens to 8-dimensional robot actions (termination, 3D position, rotation, gripper). While single-frame baselines struggle to follow meaningful trajectories, reframing the task as multi-frame sequence prediction with temporal windows and object-focused prompts allows the model to maintain trajectory consistency, termination, and gripper control. The results show that these fine-tuned models significantly surpass baseline configurations that lack fine-tuning or prompt design. Even when using only a small subset of the original datasets and training with limited resources, the models display promising abilities in both high-level reasoning and low-level control. Overall, the findings indicate that mid-size, open models like LLaMA 3.2 Vision can function as effective foundations for embodied AI when combined with efficient fine-tuning and carefully engineered inputs. These insights illuminate future research directions, including scaling data and computational resources, refining prompting strategies, and ultimately fostering the development of general-purpose, resource-efficient VLA systems for robotics.
Relatori:	Alessio Sacco, Guido Marchetto, Flavio Esposito
Anno accademico:	2025/26
Tipo di pubblicazione:	Elettronica
Numero di pagine:	168
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/38639

Modifica (riservato agli operatori)