Understanding BERT's Multi-Head Attention Mechanism

Introduction to BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained model designed for natural language understanding. It processes text by encoding it with a deep bidirectional context, allowing it to consider both left and right context in all layers. BERT consists of 12 transformer layers for BERT-base and 24 layers for BERT-large, utilizing only the transformer encoder to generate contextual representations of tokens.

Multi-Head Attention in BERT

To understand BERT's multi-head attention mechanism, consider the sentence: "The cat sat on the mat."

Self-Attention Mechanism

In self-attention, each token in the sentence computes how much focus it should give to every other token, including itself, to build its representation. Let's break it down:

Token Embeddings

The sentence is split into tokens: ["The", "cat", "sat", "on", "the", "mat"]. Each token is represented by a vector. For simplicity, let's denote them as T1, T2, T3, T4, T5, T6 respectively.

Creating Q, K, and V Vectors

For each token Ti, three vectors are created: Qi (Query), Ki (Key), and Vi (Value) through learned linear projections.

Attention Calculation

Dot-Product Scores: Compute the dot product of the query vector of one token with the key vectors of all tokens to get attention scores. For instance, the attention scores for the token "cat" would be computed by taking the dot product of Qcat with all key vectors KThe, Kcat, Ksat, Kon, Kthe, Kmat.

Multi-Head Attention

Multi-head attention allows the model to focus on different parts of the sequence simultaneously, learning various contextual relationships between words. For example:

By employing multiple attention heads, BERT can capture a wide range of linguistic features, enhancing its understanding of language context.

Back to Blog