polito.it
Politecnico di Torino (logo)

Adaptivity of Markovian and History-Based Reinforcement Learning Policies in Environments with Latent Dynamic Parameters

Francesco Giacometti

Adaptivity of Markovian and History-Based Reinforcement Learning Policies in Environments with Latent Dynamic Parameters.

Rel. Giuseppe Bruno Averta, Gabriele Tiboni. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

Abstract:

Simulated environments with latent dynamic parameters are used in robotics to train controllers with Reinforcement Learning by sampling the parameters at the start of each episode, a technique known as Domain Randomization. This has proven to enable a smoother transfer of the trained controllers from a simulated environment to a real environment, compared to controllers trained in non-randomized environments. In the Domain Randomization literature, we found an established connection between adaptivity to the environment and history-based policies, whereas Markovian policies are assumed not to be adaptive and are often qualified as robust. Adaptivity in this context refers to the ability to infer the value of the dynamic parameters and deploy an optimal strategy for the inferred value. While it is true that in environments with latent parameters the optimal policy is not guaranteed to be Markovian, because the environment is not Markovian relative to the observed state, we challenge the notion that Markovian policies cannot show adaptive behavior. We hypothesize that the policies in the Markovian class can exploit correlations between the state visitation distribution and the current value of the dynamic parameters, resulting in adaptive strategies. To test our hypothesis we compare the performance, in terms of reward, of Markovian and history-based policies in a range of randomized environments. We design a low-dimensional 2D navigation toy environment with several variations and we also use popular high-dimensional environments from the gym library. We also train a predictor of dynamic parameters for each policy class with the same inputs as the policy. We find that Markovian and history-based policies have similar results across the board, and that Markovian policies are indeed able to predict dynamic parameters with some degree of accuracy when certain conditions are met. In particular, we study the impact of the initial state distribution variance on the ability of Markovian policies to identify the latent dynamic parameters and on their performance. In conclusion, we also find that a much more significant factor in the performance of these policies in the high-dimensional environments is the property of the training algorithm to provide privileged information to the critic network.

Relatori: Giuseppe Bruno Averta, Gabriele Tiboni
Anno accademico: 2024/25
Tipo di pubblicazione: Elettronica
Numero di pagine: 49
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/35267
Modifica (riservato agli operatori) Modifica (riservato agli operatori)