Learning Rewards from Linguistic Feedback

While discussing with my supervisor about how to move forward with our project, he suggested me to look at the work of Sumers et al. (2020). The paper presents a novel contribution on the use of unconstrained linguistic feedback as a learning signal for autonomous agents. The challenge to address is the agent’s learning by interpreting naturalistic feedback to infer the teacher’s preferences.

The authors formalized this inference as linear regression over features of a Markov decision process (MDP). First, they conducted educational research to map utterances to elements of the MDP. Second, they develop two different types of sentiment learner: “literal” learner and “pragmatic” learner – the latter adds to the explicit sentiment additional inductive biases. These two learners are compared with a third model which trains an inference network end-to-end to predict latent rewards from human interactions.

The main idea of the work is to perform Inverse Reinforcement Learning on linguistic input in the form of natural language commands. Unlike prior approaches, the authors (1) use unconstrained and unfiltered natural language (arbitrary language), and (2) seek to learn general latent preferences rather than infer command-contextual rewards.

The authors use a setting where the reward function is hidden from the learner agent but known to a teacher agent who is allowed to send natural-language messages (u).

The online learning task is formulated as Bayesian inference (Bayesian Linear Regression) over possible rewards: conditioning on the teacher’s language and recursively updating a belief state.

The weights represent the teacher’s preferences over features.

Learning unfolds over a series of interactive episodes:

The learner has a belief distribution over the teacher’s reward weights which it uses to identify its policy.
The learner has an opportunity to act in the world given this policy, sampling a trajectory.
The learner receives feedback in the form of a natural language utterance (u) from the teacher (and optionally a reward signal from the environment).
the learner uses the feedback to update its beliefs about the reward, which is then used for the next episode.

To infer latent rewards from feedback, the learner extracts the sentiment ζ and target features from the teacher’s utterance.

Intuitively, if the teacher says “Good job,” a learner could infer the teacher has positive weights on the features obtained by its prior trajectory. In the next section, we formalize this mapping to features.

Intuitively, if the teacher says “Good job,” a learner could infer the teacher has positive weights on the features obtained by its prior trajectory.

The authors draw on educational research to extract target features from arbitrary language.

They first identify correspondences between these forms and prior work in RL and then show each form targets a distinct element of the MDP (e.g., a prior trajectory). The identified classes of feedback are:

Evaluative Feedback – gives scalar value in response to the agent’s action
Imperative Feedback – tells what the correct action was
Descriptive Feedback – provides explicit information about how the learner should modify their behavior

fG refers to the grounding function which is the combination of the previously mentioned feedback.

To build the dataset the authors conducted a human-human experiment on MTurk. 104 pairs of participants (208) were asked to play a collaborative game.

One player (the learner) used a robot to collect a variety of colored shapes. Each yielded a reward, which the learner could not see. The second player (the teacher) watched the learner and could see the objects’ rewards.

The results of the experiment show that most episodes contained evaluative (63%) or descriptive (34%) feedback; only 6% used imperative (The infrequency of imperative feedback might depend on the task).

To simulate different types of learners, the authors implemented a “literal” learner, a “pragmatic learner” and an “End-to-End Inference Network”.

The “Literal” Model:

The LM uses a supervised classifier to implement fG(grounding function) and a small lexicon to extract target features

VADER – To extract sentiment
Label utterances from pilot experiment and trained a logistic regression on TF-IDF: evaluative Feedback referenced to feature counts from the learner’s trajectory, imperative language referenced a cluster of objects, descriptive language referenced to features in the utterance (e.g., colors or shape of the objects).
Belief Updates: for each utterance, they perform Bayesian updates to obtain posteriors.

The “Pragmatic” Model:

The PM is an augmented version of the LM with pragmatic principles, thus heuristics.

bias towards parsimony: the model interpret neutral sentiment as positive update
bias towards information that is relevant to the task at hand: the model interpret all features not referenced by the original update as negative update – gradually decaying weights of unmentioned features.

End-to-End Interface:

The End-to-End interface is a small inference network to predict the teacher’s latent rewards. The model uses human data to learn an end-to-end mapping from the (utterances, trajectories) tuples to the teacher’s reward parameters.

feed-forward architecture: trajectory is represented with its feature counts. The model concatenates the token embeddings with the feature counts, uses a single fully-connected 128-width hidden layer with ReLU activations, then uses a linear layer to map down to a 9-dimension output.
stochastic gradient descent with a learning rate of .005 and weight decay of 0.0001, stopping when validation set error increased. The network is trained, including embeddings, end-to-end with an L2 loss on the true reward.
The univariate Gaussian priors are initialized over each feature µ0 = 0, σ0 = 25, then updated (bayesian update) with the inference network on each interaction.

Research Questions

Do the models learn the humans’ reward function?
Does the sentiment approach provide an advantage over the end-to-end learned model?
Do the “pragmatic” augmentations improve the “literal” model?

The authors conducted a second interactive experiment pairing human teachers with their models and were able to answer positively to all the previously mentioned questions – i.e., the models were able to learn the humans’ reward function, the sentiment approach performed better than the end-to-end, and the “pragmatic” augmentation improved the “literal” model. They further analyzed how their models learn by testing forms of feedback separately.

The experiment was conducted with Prolific. Between-subject study with 148 additional participants each one paired with one model.

Learning from Live Interaction

significant main effect of time: performance improves over successive levels
significant interaction between interactivity and time
sentiment models collectively outperform the neural network
The end-to-end model learns how to use most of the literal information in the data, while the “pragmatic” model leads to better performance, comparable to human learners.

Learning from different form of feedback

all models improve when sampling the entire corpus: feedback across teachers helps mitigate individual idiosyncrasies.
“pragmatic” augmentations help on all forms, but most dramatically on “Descriptive” feedback
The inference network learns to contextualize feedback appropriately
Inspection of utterances reveals several failure modes on rarer speech patterns, most notably descriptive feedback with negative sentiment

Future Work

Improve sentiment models with theory-of-mind based pragmatics
Use stronger language models for the end-to-end interface

Let’s take the opportunity to make some good questions to the authors during the AAAI 2021 conference!

Learning Rewards from Linguistic Feedback

Published by Silvia Tulli

Leave a comment Cancel reply

Condividi:

Related

Published by Silvia Tulli

Leave a comment Cancel reply