Explaining the Outputs of Transformers Models: A Working Example

Most of the information available worldwide is in text form, from legal documents to medical reports. A variety of deep learning models have been applied to improve and automate complex language tasks. Examples of such tasks include, but are not limited to, tokenization (i.e., given a character sequence and a defined document unit, tokenization is the task of splitting it up into pieces, called tokens), text classification, speech recognition, and machine translation.

Due to the complexity of NLP models, a strand of explainability research focuses on facilitating the identification of different features that contribute to their outputs (Kokhlikyan et al. 2020). 

Danilevsky and his colleagues clustered the XAI literature for NLP with respect to the type of explanations (i.e., post-hoc, self-explaining), and whether the information or justification for the model’s prediction concerns a specific input (i.e., local), or the functioning of the model as a whole (i.e., global). Their taxonomy identified five main explanations techniques, namely: feature importance, surrogate model, example-driven, provenance-based, and declarative induction

Among the existing NLP models, we are going to analyze transformers models for two pivotal reasons: they rely on the attention mechanism (i.e., initially designed for neural machine translation), and they are exceptionally effective for common natural language understanding (NLU) and natural language generation (NLG) tasks (Vaswavi et al. 2017). 

Transformer models are general-purpose architectures that weigh the influence of different parts of the input data and aim at reducing sequential computation by depending on the self-attention mechanism to compute a representation of inputs and outputs.

In practice, the encoder represents the input as a set of key-value pairs, (K, V), of dimension dk and dv respectively. The decoder packs the previous output into a query Q of dimension m and obtains the next output by mapping this query to the set of keys and values (Weng et al. 2018). The matrix of outputs also referred to as the score matrix, determines the importance that a specific word has with respect to other words. 

The score matrix is the result of a scaled dot-product where the weight assigned to each output is determined by the dot-product of the query and all keys.

The attention mechanism repeats h times with different, learned linear projections of the queries, keys, and values to dk, dk, and dv dimensions, respectively. The independent attention outputs of each learned projection are then concatenated and linearly transformed into the expected dimension. 

In the multi-head attention, h corresponds to the parallel attention layers (i.e., heads) and the Ws are all learnable parameter matrices (i.e., ). 

An Example of Transformer Architecture: Bidirectional Encoder Representations from Transformers (BERT)

Bidirectional Encoder Representations from Transformers (BERT) (Delvin et al. 2018) is a transformer-based machine learning technique for NLP pre-training developed by Google. The peculiarity of this technique is that applies bidirectional training to language modeling.

In contrast to directional models that read the text input sequentially (e.g., OpenAI GPT, ELMo), bidirectional encodera process the entire sequence of words at once. This way of processing words allows BERT to learn to unambiguously contextualize words based on both left and right words, and by repeating this process multiple times (i.e., multi-head), to learn different contexts between different pairs of words. The figure below describes the BERT Architecture. To define the goal of the prediction, BERT makes use of two techniques: Masked Language Modeling (MLM), inspired by the Cloze procedure, and Next Sequence Prediction (NSP). The former consists of substituting approximately 15% of the tokens with a mask token and querying the model to predict the values of the masked tokens based on the surrounding words. This strategy requires adding a classification layer (e.g., Fully-connected layer), transforming the output vectors into the vocabulary dimension (i.e., embedding), using a softmax function to compute the probabilities for each token.

The latter involves training the model by giving pairs of sentences as input to learning to predict whether the second sentence in the pair is the subsequent sentence in the original document.

High-level overview of the BERT Transformer model: the input is a sequence of tokens embedded into vectors, the output is a sequence of vectors linked by the index to the input tokens. The encoder multi-head attention mechanism computes queries, keys, and values from the encoder states. The encoder feed-forward network takes additional information from other tokens and integrates them into the model. The decoder masked multi-head attention mechanism computes queries, keys, and values from the decoder states. The decoder multi-head attention mechanism looks at the source of the target tokens taking the queries from the decoder states and the keys and values from the encoder states. The decoder feed-forward network takes additional information from other tokens and integrates them in the model (original image from Vaswavi et al. 2017).}

Due to the increased attention received by the Transformer models, there exist a number of interfaces for exploring the inner workings of transformer models.

Captum is a multi-modal package for model interpretability built on PyTorch. Captum attribution algorithms can be grouped into three main categories; primary, layer, and neuron attribution algorithms. Primary attribution algorithms allow us to attribute output predictions to model inputs. Layer attribution algorithms allow us to attribute output predictions to all neurons in the hidden layer. Neuron attribution algorithms allow us to attribute an internal, hidden neuron to the input of the model (Kokhlikyan et al. 2020).

Together with other libraries (e.g., Ecco, Flair, Interpret-Flair), Captum has been broadly used to explain Transformer models. Inspired by Captum and Hugging Face, Transformer-Interpret is a dedicated tool for interpreting Transformer Models. The default attribution method used by Transformer-Interpret is Integrated Gradients. Integrated Gradients computes the integral of gradients with respect to inputs along the path from a given baseline to input.

A Zero-Shot Classification Explainer

A common text classification task is sentiment analysis. Sentiment analysis aims at detecting positive, neutral, or negative sentiment in text. Transformer-Interpret helps us identify and visualize how positive/negative attribution numbers associated with a word contribute positively/negatively towards the predicted class. 

The Sequence Classification Explainer is one of the Transformer-Interpret methods to explain sequence classification tasks as sentiment analysis tasks. This explainer computes attribution for text using a given model and tokenizer. Attributions can be forced along the axis of a particular output index or class name.

Therefore, to use this method it is possible to define a valid index or class name for the outputs. Moreover, since it is possible to compute attribution with respect to a particular embedding type, we can select an embedding type. 

The default value “0” is for word embeddings (i.e., learned representation for text where words that have the same meaning have a similar representation).

The method returns a list of tuples containing words and their associated attribution scores.

In NLP, zero-shot classification consists in defining a list of labels and then running a classifier to assign a probability to each label. Similar to the Sequence Classification Explainer, the Zero-Shot Classification Explainer returns a table but with attributions for every label. 

The working example is in Colab

Transformer-Interpret suitability analysis: pros and cons

The Transformer-Interpret tool has the advantage of relying on two well-documented packages and frameworks (e.g., Captum and HuggingFace Transformers). It provides simple methods to explain the most common natural language processing tasks performed by Transformer models, such as sequence classification, zero-shot classification, and question answering. However, methods to explain causal language models and multiple-choice models are still missing.

Find out more about this and related topics on our Practical Tutorial on Explainable AI Techniques written together with Adrien Bennetot.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s