Adapting LLaMA 3.2 Vision for Unified Robotic Planning and Control
Vittorio Di Giorgio
Adapting LLaMA 3.2 Vision for Unified Robotic Planning and Control.
Rel. Alessio Sacco, Guido Marchetto, Flavio Esposito. Politecnico di Torino, Master of science program in Computer Engineering, 2025
|
Preview |
PDF (Tesi_di_laurea)
- Thesis
Licence: Creative Commons Attribution Non-commercial No Derivatives. Download (27MB) | Preview |
Abstract
Vision-Language-Action (VLA) models are emerging as powerful tools for embodied AI, allowing robots to merge visual perception with language understanding to execute complex tasks. Their potential lies in combining perception, reasoning, and control within a single framework, which could greatly enhance robotics pipelines and boost generalization. However, it remains uncertain how effectively today’s open, mid-size vision-language models (VLMs) can be adapted into practical VLAs. Previous works like RT-2 have demonstrated impressive results but rely on large proprietary models and undisclosed training methods, while OpenVLA presents an open-source alternative based on composite architectures. In contrast, this study investigates whether a single, open, mid-size model like LLaMA 3.2 Vision Instruct can be fine-tuned into a functional VLA using only limited computational resources, along with carefully designed prompting and training strategies.
The approach is evaluated on two complementary benchmarks: ALFRED, which emphasizes high-level household reasoning and long-horizon planning, and Open X-Embodiment (OpenX), which concentrates on low-level robotic manipulation trajectories
Publication type
URI
![]() |
Modify record (reserved for operators) |
