A paper a day#2: What Does BERT Look At? An Analysis of BERT’s Attention

Peratham Wiriyathammabhum
4 min readJan 13, 2020

--

ArXiv link: https://arxiv.org/pdf/1906.04341v1.pdf

A paper a day: I am picking some interesting papers (to me/IMO) from ArXiv and summarize them informally here on medium.

Summary: This paper conducts an analysis on BERT models. In addition to existing analyses such as language model surprisal (output) or probing classifier (internal vectors), the authors propose an analytical framework for BERT’s attention mechanism. BERT’s attention heads might have different patterns such as attending to delimiter tokens, specific positional offsets or attending the whole sentence while the heads in the same layer exhibits similar behavior. Furthermore, the authors find that syntactic and coreference information is indeed captured by BERT’s heads (Verb-Object, Noun-Determiner, Object-Preposition, etc.).

[This paper is an analysis paper so I feel the organization will be hypothesis-experiments-conclusion instead of typical example-approach-experiments.]

Surface-level patterns: First, section 3 details about surface-level patterns from the attention maps extracted using 1000 random Wikipedia segments. The segments have at most 128 tokens and they are consisted of 2 paragraphs, [CLS]<paragraph-1>[SEP]<paragraph-2>[SEP]. BERT has 12 layers with 12 heads in each layer.

The relative position is attended heavily in earlier layers of the network. There are 4 attention heads on average in layers 2,4,7,8 which attend the previous token. There are 5 attention heads on average in layers 1,2,2,3,6 ( I think there are 2 set of heads from layer 2. This should explain the repetition of 2s.) which attend the next token.

Figure 1 from the paper.

More than half of a head’s total attention is given to special tokens, [CLS] in early layers, [SEP] in middle layers and [.]/[,] for deep layers. The authors explain that these special tokens are being used as no-op for the heads when there are no attention function.

Figure 2 from the paper.

They also applied gradient-based measures of feature importance. The results show that the attention to special tokens are high but the gradient magnitudes are low. This evidence further supports their hypothesis.

Figure 3 from the paper.

Lastly, they compute average entropy of the attention head to measure how many words an attention head focuses on. Some attention heads in lower layers span their attentions broadly while some attentions (10%) are very focused on any single word. The output of these heads roughly becomes bag-of-vectors sentence representation. For the [CLS] token, the entropy looks similar to the average except the last layer which is very high (broad attention). The reason might be that [CLS] is the only input for the ‘next sentence prediction’ during pre-training.

Figure 4 from the paper.

Probing individual attention heads:

Next, section 4 investigates attention heads to see what language aspect they learned. The datasets being used in this section are labeled datasets such as dependency parsing datasets (They use WSJ portion of the Penn Treebank annotated with Stanford Dependencies in this experiment.). BERT uses byte-pair tokenization but the analysis needs word-word attention maps. They normalize splitted words so that the attention from each word summed up to 1.

Figure 5 from the paper.

Each attention head does not model syntax much better than right-branching baseline but certain heads are specialized to some specific dependency relations and are achieving high accuracy which also outperforming the baseline. Also, the attentions are from self-supervised since some learned relations are from the data but contradicting the annotations (still perform well).

Table 1

For coreference resolution, they use CoNLL-2012 dataset and measure antecedent selection accuracy. One of the heads performs very well close to the rule-based system.

Table 2

Probing attention head combinations:

They conduct further experiments to see whether syntactic knowledge distributed in different heads are good as a whole. They freeze BERT’s attention parameters. The results still confirm the hypotheses from previous sections.

Table 3

Clustering attention heads:

They apply MDS to represent Jensen-Shannon divergence between the head outputs so that similar heads are near each other in the euclidean space. The results also confirm previous hypotheses that heads from the same layer are similar.

Figure 6 from the paper

GitHub link: https://github.com/clarkkev/attention-analysis

Paper bib: Clark et al., “What Does BERT Look At? An Analysis of BERT’s Attention”, BlackBoxNLP 2019 (ACL-W). https://arxiv.org/pdf/1906.04341v1.pdf

--

--