Inter-agent transfer learning

Recently, I have been following online the last AAMAS conference and I found particularly interesting the work of Felipe Leno Da Silva and his colleagues. The paper is titled “Agents teaching agents: a survey on inter-agent transfer learning” and proposes a framework for inter-agent transfer learning.

The work is motivated by the fact that even if Reinforcement Learning has been successfully deployed in many complex scenarios such as autonomous driving and video games, it has noticeable limitations in real-world scenarios.


Reinforcement learning agents need a great number of training-samples (high sample complexity) in order to successfully learn a target function. Even simple tasks such as Pommerman – reported as an example by the authors – still require a large amount of interaction with the environment, which can be excessively expensive in real-world applications.

This leads us to the conclusion that  RL needs ways to accelerate the learning process of the agents. One way of doing that is by leveraging the experience of another more competent agent.

The proposed framework describes transfer learning methods depending on the source of knowledge.

Single-agent transfer consists in reusing knowledge from previously solved tasks. Some examples of groups of algorithms that do that are:

  • value functionefficient way to determine the value of being in a state. Denoted by V(s), this value function measures potential future rewards we may get from being in state s.
  • policy reusepolicy reuse is a transfer learning approach to improve a reinforcement learner with guidance from previously learned similar policies.
  • multi-task learning – multi-task learning is the ability to transfer knowledge across tasks and generalize to novel ones.

Agents teaching agents consists in reusing the experience of another agent – examples of a group of algorithms that do that include human-feedback, action advising and learning from demonstrations.


This paper focuses on the second problem. The problem statement involves two agents, a reinforcement learning agent that we define as the learner, and a teacher agent. The learner agent is indeed learning.  The teacher agent may or may not be learning and it does not need to be a reinforcement learning agent – it might also be a human. The instructions are any information that the teacher communicates to accelerate the learning process. The instructions are (A) specialized to the task, (B) can be assimilated by the learner, (C) are made available during training, and (D) are devised without detailed knowledge of the learner’s internal representation. Examples of instructions that have been considered from the literature are demonstrations, action advice and scalar feedback.



In the literature we can find two main approaches:

  • Teacher-driven: the teacher is responsible for observing the student behavior and initiating the instruction when it is most needed – the teacher is responsible for initiating the inter-agent learning process.
  • Learner-driven: the learner is proactive and ask for instructions when desired – the learner is responsible for initiating the inter-agent learning process.

In the learner-driven approach, the learner generates a behavior to explore the environment, when the learner wants to receive instructions, the learner defines a query and explicitly sends a message to the teacher asking for instruction. The teacher evaluates the utility of giving instruction in that moment and defines the instruction. The instruction is explicitly communicated to the learner agent. In the end, the learner updates its knowledge of the environment incorporating the instructions received from the teacher.


In the teacher-driven approach, the process is similar but the teacher agent decides when to give the instructions. In this approach, the teacher has to observe the behavior of the learner to establish when to communicate the instruction.


In each of those approaches, the learner generates a behavior to explore the environment.

One way of doing so is by generating a random behavior, but it is also possible to reuse knowledge from previously solved tasks by using single agent transfer.

After generating the behavior, in the learner-driven approach, the learner has to decide when, to whom and how to ask for instruction. The first challenge is defining the query timing, that is crucial because (A) communication is a scarce resource, and (B) badly timed instructions can overcomplicate the learning process instead of accelerating it. Solutions to this problem, include predifining the action timing based on the agent confidence. After the query timing is defined, the learner has to identify the teacher, some of the literature assumes that the teacher is always available, which is not always the case – adaptive teacher selection is another of the problems that needs further explorations.

The last challenge related to query definition consists in constructing the query. Previous work often assumes that the query protocol is defined, however it is hard to generalize a query protocol to all possible situations. In addition, after the protocol is defined, the agent has to decide what to include in the message, and of course, this is domain dependent, the agent might include its observations or its level of uncertainty among other information.

The next step consists in the utility evaluation. In this step, the teacher decides whether or not the requested instruction is needed. The first challenge related to this task is the behavior observation of the learner actuation. This is linked to the instruction timing where the teacher has to define the appropriate time for proving instruction.

After the timing is defined the teacher has also to decide how to communicate the instructions. In the literature we can find many examples of instructions such as action advice, rules, demonstrations, preferences and scalar feedback. 

An important aspect to consider is how to interface and translate those instructions to overcome the mismatch between the way the agent processes the information and the action.

The last module consists of the knowledge update. The first challenge related to that is receiving the instruction. Most of the literature assumes that the teacher will always answer and that all the messages sent between the learner and the teacher will be always received, however it might not be the case in all the applications. Due to this, dealing with communication channels is still an open process. Another challenge linked to that is the instruction reliability, not all the instructions are reliable and teachers might not be benevolent in all situations, which means that we still need some way of evaluating instruction in order to define if the instruction should be trusted. Last challenge is how to merge this instruction with the knowledge of the agent, and this obviously depends on the type of instruction that has been given. This might be exploited by performing exploration depending on the received instruction or learning different models using scalar feedback.

The domains in which inter-agent learning might be useful are the ones in which exploring the environment is very expensive or dangerous, where the task has huge state spaces and scarce reward functions. An example are multi-agent problems where the agent configuration might change over time and where good baseline behavior are already available for testing other algorithms.



Inter-agent learning has played an important role in augmenting RL and making it scalable to complex domains. The survey describes two main approaches in the literature: the learner-driven and the teacher-driven and highlights open problems and promising applications. Check it out!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s