Adapting LLaMA 3.2 Vision for Unified Robotic Planning and Control

Vittorio Di Giorgio

Adapting LLaMA 3.2 Vision for Unified Robotic Planning and Control.

Rel. Alessio Sacco, Guido Marchetto, Flavio Esposito. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (27MB) | Preview

Abstract

Vision-Language-Action (VLA) models are emerging as powerful tools for embodied AI, allowing robots to merge visual perception with language understanding to execute complex tasks. Their potential lies in combining perception, reasoning, and control within a single framework, which could greatly enhance robotics pipelines and boost generalization. However, it remains uncertain how effectively today’s open, mid-size vision-language models (VLMs) can be adapted into practical VLAs. Previous works like RT-2 have demonstrated impressive results but rely on large proprietary models and undisclosed training methods, while OpenVLA presents an open-source alternative based on composite architectures. In contrast, this study investigates whether a single, open, mid-size model like LLaMA 3.2 Vision Instruct can be fine-tuned into a functional VLA using only limited computational resources, along with carefully designed prompting and training strategies.

The approach is evaluated on two complementary benchmarks: ALFRED, which emphasizes high-level household reasoning and long-horizon planning, and Open X-Embodiment (OpenX), which concentrates on low-level robotic manipulation trajectories