Bayesian Inverse Reinforcement Learning

In the last couple of months I have been studying Inverse Reinforcement Learning approaches, in particular the work of Ramachandran & Amir from 2007. Here is a short introduction to the topic and some notes about this inspiring work.

Reinforcement Learning is a computational approach to learning from interaction grounded on the idea that we learn from observing how our environment responds to what we do. Reinforcement Learning problems involve learning what to do by maximizing a numerical reward signal (Sutton & Barto, 2018).

In situations where we are not explicitly given a reward function (i.e., apprenticeship), and in which knowledge of the reward is a goal by itself (i.e., preference elicitation) we refer to Inverse Reinforcement Learning Approaches.

Inverse Reinforcement Learning is the problem of learning the reward function underlying a Markov Decision Process given the dynamics of the system and the behavior of an expert.

The work of Ramachandran & Amir models the IRL problem from a bayesian perspective. Their model can be summarised as follows:

consider the actions of the expert as evidence to update a prior on reward functions;
solve reward learning and apprenticeship learning using this posterior;
perform inference for these tasks using a modified Markov Chain Monte Carlo (MCMC) algorithm

With this model, the authors show that the Markov Chain for the obtained distribution with a uniform prior mixes rapidly, and that the algorithm converges to the correct answer in polynomial time. Furthermore, they show that the original IRL is a special case of Bayesian IRL (BIRL) with a Laplacian prior. The advantages of this approach are:

it is not necessary to have a completely specified optimal policy as input to the IRL agent;
it is not needed to assume that the expert is infallible;
it is possible to incorporate external information about specific IRL problems into the prior of the model, or use evidence from multiple experts.

Preliminaries:

Markov Decision Process (finite and infinite)
Stationary policy
Bellman Equation

The intuition behind bayesian Inverse Reinforcement Learning is that to infer a single reward function from an expert agent there is the need of a probability distribution.

Bayesian inference evidence is used to infer the probability that a hypothesis may be true, and given a hypothesis H and evidence E is it possible to compute the prior probability – i.e. probability of an event before new data is collected, and the posterior probability – i.e. revised probability of an event occurring after taking into consideration new information.

Bayes’ Theorem – Posterior probability of hypothesis A given an evidence B is equal to the likelihood of the evidence B if the hypothesis A is true multiplied to the *a priori* probability that hypothesis A is true and divided by the *a priori* probability that evidence B is true.

In Bayesian Inverse Reinforcement Learning (BIRL):

the hypothesis is the reward function that explains the agent’s behavior;
the evidence is the observations of the expert’s behavior;
the evidence is used to infer the probability that a hypothesis may be true (i.e., the posterior distribution of the rewards, from a prior distribution)

There are two main categories of BIRL: reward learning and apprenticeship learning. The objective of reward learning is to learn a reward function. The objective of apprenticeship learning is to learn a policy (i.e., how to act).

In reward learning the error loss is the difference between the actual reward and the estimated reward.

if R is drawn from the posterior distribution, then L(R, R) is minimized by setting R to the mean of the posterior.
use a maximum a posteriori estimator (MAP) as the estimator – instead of maximum likelihood

In apprenticeship learning the policy loss function corresponds to the difference between the vector of optimal values for each state and these values under optimal policy.

Both reward learning and apprenticeship learning require computing the mean of the posterior distribution. To simplify the computation, the authors generate samples from these distributions and then return the sample mean as an estimate of the true mean of the distribution. They use the following algorithm – PolicyWalk – as a sampling procedure.

Bayesian Inverse Reinforcement Learning

Papers

Tools

Additional Material

Published by Silvia Tulli

Leave a comment Cancel reply