polito.it
Politecnico di Torino (logo)

Real-World Fine-Tuning of Diffusion Policies for Autonomous Exploration Using Reinforcement Learning and Human Demonstrations

Alessandro De Marco

Real-World Fine-Tuning of Diffusion Policies for Autonomous Exploration Using Reinforcement Learning and Human Demonstrations.

Rel. Raffaello Camoriano, Luca Benini, Michele Magno. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Abstract:

Autonomous exploration is a fundamental challenge in robotics, with broad implications for operations in remote or hazardous environments. Diffusion policies, generative models that can predict robot actions, have emerged as powerful tools for navigation. However, these models are typically trained with imitation learning (IL) and often fail to generalize beyond their demonstrations. Furthermore, fine-tuning diffusion policies with reinforcement learning (RL) is challenging, as backpropagating through the denoising chain is non trivial, and sample collection in the real world is costly. This thesis addresses such challenges by adapting Q-weighted Variational Policy Optimization (QVPO) to fine-tune Navigation with Goal Masked Diffusion (NoMaD), a state-of-the-art diffusion-based navigation model that unifies goal-conditioned navigation and task-agnostic exploration through goal masking, predicting multimodal action sequences directly from past RGB frames. The fine-tuning is guided by an external critic that evaluates the sampled trajectories and reweights the diffusion loss according to their Q-values, enabling RL-based fine-tuning without traversing the denoising process. Since accurate simulation was unavailable and sim-to-real transfer can introduce significant discrepancies, we performed all training directly in the real world. To improve sample efficiency in this setting, we extend NoMaD by integrating Human Demonstrations (HD) as high-reward examples and employing Multi-step Temporal Difference (TD) updates and Heuristic Delayed Reward Adjustment (HDRA). We carry out evaluation experiments entirely under real-world conditions on a Clearpath Jackal Unmanned Ground Vehicle (UGV). The exploration policy is deployed onboard and evaluated at 4~Hz within an asynchronous actor–learner architecture, while policy updates are computed on a separate GPU server. Our evaluation focuses on exploration in cluttered environments under seen and unseen obstacle conditions. In experiments with previously seen obstacles, RL+HD improves coverage twofold (217 vs. 105 cells) and significantly increases average time-to-collision (TTC) (198 vs. 83 actions) relative to baseline NoMaD. In unseen obstacle environments, RL + HD maintains strong generalization with +407% coverage and 4.9× TTC, while RL alone favors cautious survival-oriented trajectories. Our results demonstrate that diffusion policies can be effectively fine-tuned with RL in the real world and that combining RL with human guidance substantially enhances robustness, coverage, and generalization.

Relatori: Raffaello Camoriano, Luca Benini, Michele Magno
Anno accademico: 2025/26
Tipo di pubblicazione: Elettronica
Numero di pagine: 77
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela: ETH Zurich (SVIZZERA)
Aziende collaboratrici: ETH Zurich
URI: http://webthesis.biblio.polito.it/id/eprint/37677
Modifica (riservato agli operatori) Modifica (riservato agli operatori)