As a step towards efficient transfer learning between two opponent agents, my colleagues and I encountered one of the most discussed problems of multi-agents settings: reasoning about the information that comes from opponent agents.
To clarify some aspects related to this problem, I decided to write this short post about a terrific contribution of Jakob N. Foerster and his colleague. Their algorithm, known with the acronym of LOLA (Learning with Opponent-Learning Awareness), tried to solve this problem by providing agents with the ability to shape the anticipated learning of opponents in the environment.
A LOLA agent models how the opponent’s policy parameters depend on its own policy and how the opponent’s update impacts its own future expected reward. At the same time, a LOLA agent updates its policy to make the opponent’s learning step more beneficial to its own goals [Learning to Model Other Minds].
The experiment compared the learning behavior of Naive and LOLA agents in two classic infinitely iterated games and a Coin Game (a more difficult two-player environment).
- The Naive learner iteratively optimizes for its own expected total discounted return. The agent’s policy parameters are updated according to the argmax of the parameters of the opponent and its parameters.
- The LOLA learner adds the expected return after the opponent updates its policy with one naive learning step. By accounting for the difference in the expected return after the opponent updates its policy with respect to its policy, the LOLA learner aims to actively influence the opponent’s future policy update.
- The Exact LOLA and Naive learners (LOLA-Ex, NL-Ex) have access to the gradients and Hessians of the expected total discounted return for the agents as a function of both agents’ policy parameters.
- The Policy Gradient LOLA and Naive learners (LOLA-PG, NL-PG) estimate the gradients and Hessians based on policy gradients.
- To account for the fact that in adversarial settings the opponent’s parameters are typically obscured and have to be inferred from the opponent’s state-action trajectories, the authors compare the performance of policy-gradient based LOLA agents with and without opponent modelling. The model of the opponent’s behavior is estimated from the agent’s trajectories using maximum likelihood.
- The High-Order LOLA learner assumes that the opponent is a LOLA agent; this leads to the third-order derivatives in the learning rule.
The policy of each player is parameterised with a deep recurrent neural network. In the policy gradient experiment with LOLA, offline learning is assumed. The research questions investigated six different aspects concerning the behavior of the agents in the above mentioned conditions.
The result showed that when both agents have access to exact value functions and apply the LOLA learning rule, cooperation emerges based on tit-fot-tat in the infinitely repeated iterated prisoners’ dilemma while independent naive learners defect. In addition, the policy gradient based LOLA (LOLA-PG), applicable to a deep RL settings, performed similar to the one that uses the exact value function.
To find other relevant papers on this topic I recommend Connected Papers – which is also super cool – here is a new list of references related to the work of Foester et al. 2018. As for the code, I suggest the implementation of Alexis Jacq in Pytorch – LOLA_DiCE.