The capital of France is ___

When a large model generates a new word, how do the matrices inside the model change?

Embedding

We start with the embedding layer. Assume one word equals one token.

Input sentence: The capital of France is ___

-> [‘The’, ‘capital’, ‘of’, ‘France’, ‘is’]

-> [464, 3361, 295, 2238, 318], which are their Token IDs

This forms a sequence of length 5.

GPT-2 is the best model for breakdowns, as other models follow similar principles.

GPT-2 has a vocabulary of 50,000 tokens, with each token mapped to a 768-dimensional vector.

This means there exists an embedding table of size [50,000, 768].

Looking up the embedding table:

  • The = [0.316, 0.524,0.063, 0.481, 0.266, …]
  • capital = [0.123, 0.234, 0.345, 0.456, 0.567, …]
  • of = [0.432, 0.543, 0.654, 0.765, 0.876, …]
  • France = [0.543, 0.654, 0.765, 0.876, 0.987, …]
  • is = [0.654, 0.765, 0.876, 0.987, 0.098, …]

-> Thus, we obtain a [5, 768] matrix!

Attention

In GPT-2 small, Multi-Head Attention has 12 heads.

A key mathematical relationship: 12 * 64 = 768.

So [5, 768] == [5, 12, 64] == [5, 64] * 12

We will have three key matrices: Q, K, V. Each is [768, 768].

Let’s start with Q.

Since Q’s size is [768, 768] == [768, 12 * 64].

We split one of the 768 into 12 heads, each of which is [768, 64].

Q one head = [5, 768] * [768, 64] = [5, 64]

Q = 12 heads = [5, 64] + [5, 64] + … + [5, 64] = [5, 12 * 64] = [5, 768]

K and V are the same.

K = [5, 768]

V = [5, 768]

Here’s the most important attention formula:

attention_equation

Q = [5, 768]

K^T = [768, 5]

QK^T = [5, 5]

Softmax(QK^T/√d) = [5, 5]

Softmax(QK^T/√d) * V = [5, 768]

We also have an output projection matrix = [768, 768], which does not change the size of V.

Output projection(A) = [5, 768] * [768, 768] = [5, 768]

This is still [5, 768].

Multi-Layer Perceptron (MLP)

Attention output = [5, 768]

MLP has two layers: one expands [768 × 3072] and one contracts [3072 × 768]

Change: [5, 768] -> [5, 3072] -> [5, 768]

This is still [5, 768].

Repeating Attention + MLP for 12 Layers

(5, 768) -> (5, 768) -> (5, 768) … -> (5, 768)

This is still [5, 768].

Final Output

We extract the last row: [768]

Then, we use the original embedding table [50,000, 768] to map back to a Token ID, retrieving the final word: Paris

Parameter Count Calculation

50000 * 768 + (768 * 768 * 3 + 768 * 768 + 768 * 3072 * 2) * 12

= 38.4M + (7.08M) × 12 = 124M

They are: embedding(50000, 768), KQV weights(768, 768), output projection(768, 768), MLP weights(768, 3072) * 2, for 12 layers.

Breakdown of Parameters:

  • 38.4M → Embedding
  • 21.2M → KQV matrix
  • 7.1M → Output Projection
  • 56.6M → MLP

A comparison of GPT-2 model sizes:

gpt2_parameters

Summary

  • The embedding layer converts the sentence into a [5, 768] matrix.
  • The multi-head attention mechanism calculates Q, K, V, all with shape [5, 768].
  • Applying Softmax(QK^T / sqrt(d)) * V gives a new [5, 768].
  • The MLP expands and contracts dimensions, outputting another [5, 768].
  • Repeating this process for 12 Transformer layers keeps the final shape [5, 768].
  • The last row [768] is mapped back to a token using the embedding table.
  • The final predicted word: “Paris”.
  • Total parameters ~124M, primarily distributed across Embedding, Attention, and MLP.

Note that the 5 is the sentence length, and the actual generation process pads to the max length, 1024.

768 is the dimensionality of each token, also known as d_model.

12 is the number of attention heads, also referred to as h.

12 is also the number of Transformer layers, known as n_layers.

This entire process is truly fascinating! Even after writing and reviewing it multiple times, I still find it incredible how this leads to intelligence. The way these transformations give rise to meaningful text remains a mystery!

I’ll write a comparison of famous models’ (chatgpt, claude, llama, deepseek, qwen, etc) parameters composition soon! Wait, did I forget Gemini?