Transformer
Transformer
Transformer
1
Figure 1: Applying the Transformer to machine translation. Source: Google AI Blog.
That’s a lot to digest, the goal of this tutorial is to break it down into easy to understand parts.
In this tutorial you will:
• Prepare the data.
• Implement necessary components:
– Positional embeddings.
– Attention layers.
– The encoder and decoder.
• Build & train the Transformer.
• Generate translations.
• Export the model.
To get the most out of this tutorial, it helps if you know about the basics of text generation and
attention mechanisms.
A Transformer is a sequence-to-sequence encoder-decoder model similar to the model in the NMT
with attention tutorial. A single-layer Transformer takes a little more code to write, but is almost
identical to that encoder-decoder RNN model. The only difference is that the RNN layers are
replaced with self attention layers. This tutorial builds a 4-layer Transformer which is larger and
more powerful, but not fundamentally more complex.
The RNN+Attention model
A 1-layer transformer
After training the model in this notebook, you will be able to input a Portuguese sentence and
return the English translation.
Figure 2: Visualized attention weights that you can generate at the end of this tutorial.
2
1.2 Setup
Begin by installing TensorFlow Datasets for loading the dataset and TensorFlow Text for text
preprocessing:
[2]: # Install the most re version of TensorFlow to use the improved
# masking support for `tf.keras.layers.MultiHeadAttention`.
!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2
!pip uninstall -y -q tensorflow keras tensorflow-estimator tensorflow-text
!pip install protobuf~=3.20.3
!pip install -q tensorflow_datasets
!pip install -q -U tensorflow-text tensorflow
import numpy as np
import matplotlib.pyplot as plt
import tensorflow_text
2023-11-16 12:37:14.029081: E
external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register
cuDNN factory: Attempting to register factory for plugin cuDNN when one has
already been registered
2023-11-16 12:37:14.029127: E
external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register
cuFFT factory: Attempting to register factory for plugin cuFFT when one has
already been registered
2023-11-16 12:37:14.030685: E
external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to
register cuBLAS factory: Attempting to register factory for plugin cuBLAS when
one has already been registered
3
1.3 Data handling
This section downloads the dataset and the subword tokenizer, from this tutorial, then wraps it all
up in a tf.data.Dataset for training.
Toggle section
The tf.data.Dataset object returned by TensorFlow Datasets yields pairs of text examples:
[5]: for pt_examples, en_examples in train_examples.batch(3).take(1):
print('> Examples in Portuguese:')
for pt in pt_examples.numpy():
print(pt.decode('utf-8'))
print()
4
tokens can represent sentence-pieces, words, subwords, or characters. To learn more about tok-
enization, visit this guide.
This tutorial uses the tokenizers built in the subword tokenizer tutorial. That tutorial optimizes two
text.BertTokenizer objects (one for English, one for Portuguese) for this dataset and exports
them in a TensorFlow saved_model format.
Note: This is different from the original paper, section 5.1, where they used a single
byte-pair tokenizer for both the source and target with a vocabulary-size of 37000.
Download, extract, and import the saved_model:
[6]: model_name = 'ted_hrlr_translate_pt_en_converter'
tf.keras.utils.get_file(
f'{model_name}.zip',
f'https://storage.googleapis.com/download.tensorflow.org/models/
↪{model_name}.zip',
[6]: './ted_hrlr_translate_pt_en_converter.zip'
The tf.saved_model contains two text tokenizers, one for English and one for Portuguese. Both
have the same methods:
[8]: [item for item in dir(tokenizers.en) if not item.startswith('_')]
[8]: ['detokenize',
'get_reserved_tokens',
'get_vocab_path',
'get_vocab_size',
'lookup',
'tokenize',
'tokenizer',
'vocab']
The tokenize method converts a batch of strings to a padded-batch of token IDs. This method
splits punctuation, lowercases and unicode-normalizes the input before tokenizing. That standard-
ization is not visible here because the input data is already standardized.
[9]: print('> This is a batch of strings:')
for en in en_examples.numpy():
print(en.decode('utf-8'))
5
> This is a batch of strings:
and when you improve searchability , you actually take away the one advantage of
print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .
6
The output demonstrates the “subword” aspect of the subword tokenization.
For example, the word 'searchability' is decomposed into 'search' and '##ability', and the
word 'serendipity' into 's', '##ere', '##nd', '##ip' and '##ity'.
Note that the tokenized text includes '[START]' and '[END]' tokens.
The distribution of tokens per example in the dataset is as follows:
[13]: lengths = []
en_tokens = tokenizers.en.tokenize(en_examples)
lengths.append(en_tokens.row_lengths())
print('.', end='', flush=True)
7
1.3.3 Set up a data pipeline with tf.data
The following function takes batches of text as input, and converts them to a format suitable for
training.
1. It tokenizes them into ragged batches.
2. It trims each to be no longer than MAX_TOKENS.
3. It splits the target (English) tokens into inputs and labels. These are shifted by one step so
that at each input location the label is the id of the next token.
4. It converts the RaggedTensors to padded dense Tensors.
5. It returns an (inputs, labels) pair.
[15]: MAX_TOKENS=128
def prepare_batch(pt, en):
pt = tokenizers.pt.tokenize(pt) # Output is ragged.
pt = pt[:, :MAX_TOKENS] # Trim to MAX_TOKENS.
pt = pt.to_tensor() # Convert to 0-padded dense Tensor
en = tokenizers.en.tokenize(en)
en = en[:, :(MAX_TOKENS+1)]
en_inputs = en[:, :-1].to_tensor() # Drop the [END] tokens
en_labels = en[:, 1:].to_tensor() # Drop the [START] tokens
8
return (pt, en_inputs), en_labels
The function below converts a dataset of text examples into data of batches for training.
1. It tokenizes the text, and filters out the sequences that are too long. (The batch/unbatch is
included because the tokenizer is much more efficient on large batches).
2. The cache method ensures that that work is only executed once.
3. Then shuffle and, dense_to_ragged_batch randomize the order and assemble batches of
examples.
4. Finally prefetch runs the dataset in parallel with the model to ensure that data is available
when needed. See Better performance with the tf.data for details.
[16]: BUFFER_SIZE = 20000
BATCH_SIZE = 64
The resulting tf.data.Dataset objects are setup for training with Keras. Keras Model.fit train-
ing expects (inputs, labels) pairs. The inputs are pairs of tokenized Portuguese and English
sequences, (pt, en). The labels are the same English sequences shifted by 1. This shift is so
that at each location input en sequence, the label in the next token.
Inputs at the bottom, labels at the top.
This is the same as the text generation tutorial, except here you have additional input “context”
(the Portuguese sequence) that the model is “conditioned” on.
This setup is called “teacher forcing” because regardless of the model’s output at each timestep, it
gets the true value as input for the next timestep. This is a simple and efficient way to train a text
generation model. It’s efficient because you don’t need to run the model sequentially, the outputs
at the different sequence locations can be computed in parallel.
You might have expected the input, output, pairs to simply be the Portuguese, English se-
quences. Given the Portuguese sequence, the model would try to generate the English sequence.
It’s possible to train a model that way. You’d need to write out the inference loop and pass the
model’s output back to the input. It’s slower (time steps can’t run in parallel), and a harder task to
9
learn (the model can’t get the end of a sentence right until it gets the beginning right), but it can
give a more stable model because the model has to learn to correct its own errors during training.
[19]: for (pt, en), en_labels in train_batches.take(1):
break
print(pt.shape)
print(en.shape)
print(en_labels.shape)
(64, 62)
(64, 58)
(64, 58)
The en and en_labels are the same, just shifted by 1:
[20]: print(en[0][:10])
print(en_labels[0][:10])
10
A Transformer adds a “Positional Encoding” to the embedding vectors. It uses a set of sines
and cosines at different frequencies (across the sequence). By definition nearby elements will have
similar position encodings.
The original paper uses the following formula for calculating the positional encoding:
𝑃 𝐸(𝑝𝑜𝑠,2𝑖) = sin(𝑝𝑜𝑠/100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙 )
𝑃 𝐸(𝑝𝑜𝑠,2𝑖+1) = cos(𝑝𝑜𝑠/100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙 )
Note: The code below implements it, but instead of interleaving the sines and cosines, the vectors
of sines and cosines are simply concatenated. Permuting the channels like this is functionally
equivalent, and just a little easier to implement and show in the plots below.
[21]: def positional_encoding(length, depth):
depth = depth/2
pos_encoding = np.concatenate(
[np.sin(angle_rads), np.cos(angle_rads)],
axis=-1)
The position encoding function is a stack of sines and cosines that vibrate at different frequencies
depending on their location along the depth of the embedding vector. They vibrate across the
position axis.
[22]: #@title
pos_encoding = positional_encoding(length=2048, depth=512)
(2048, 512)
11
By definition these vectors align well with nearby vectors along the position axis. Below the position
encoding vectors are normalized and the vector from position 1000 is compared, by dot-product,
to all the others:
[23]: #@title
pos_encoding/=tf.norm(pos_encoding, axis=1, keepdims=True)
p = pos_encoding[1000]
dots = tf.einsum('pd,d -> p', pos_encoding, p)
plt.subplot(2,1,1)
plt.plot(dots)
plt.ylim([0,1])
plt.plot([950, 950, float('nan'), 1050, 1050],
[0,1,float('nan'),0,1], color='k', label='Zoom')
plt.legend()
plt.subplot(2,1,2)
plt.plot(dots)
plt.xlim([950, 1050])
plt.ylim([0,1])
12
So use this to create a PositionEmbedding layer that looks-up a token’s embedding vector and
adds the position vector:
[24]: class PositionalEmbedding(tf.keras.layers.Layer):
def __init__(self, vocab_size, d_model):
super().__init__()
self.d_model = d_model
self.embedding = tf.keras.layers.Embedding(vocab_size, d_model,␣
↪mask_zero=True)
x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
x = x + self.pos_encoding[tf.newaxis, :length, :]
return x
13
Note: The original paper, section 3.4 and 5.1, uses a single tokenizer and weight matrix
for both the source and target languages. This tutorial uses two separate tokenizers
and weight matrices.
[25]: embed_pt = PositionalEmbedding(vocab_size=tokenizers.pt.get_vocab_size(),␣
↪d_model=512)
embed_en = PositionalEmbedding(vocab_size=tokenizers.en.get_vocab_size(),␣
↪d_model=512)
pt_emb = embed_pt(pt)
en_emb = embed_en(en)
[26]: en_emb._keras_mask
14
[27]: class BaseAttention(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super().__init__()
self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
self.layernorm = tf.keras.layers.LayerNormalization()
self.add = tf.keras.layers.Add()
Attention refresher Before you get into the specifics of each usage, here is a quick refresher on
how attention works:
The base attention layer
There are two inputs:
1. The query sequence; the sequence being processed; the sequence doing the attending (bottom).
2. The context sequence; the sequence being attended to (left).
The output has the same shape as the query-sequence.
The common comparison is that this operation is like a dictionary lookup. A fuzzy, differentiable,
vectorized dictionary lookup.
Here’s a regular python dictionary, with 3 keys and 3 values being passed a single query.
d = {'color': 'blue', 'age': 22, 'type': 'pickup'}
result = d['color']
• The querys is what you’re trying to find.
• The keys what sort of information the dictionary has.
• The value is that information.
When you look up a query in a regular dictionary, the dictionary finds the matching key, and
returns its associated value. The query either has a matching key or it doesn’t. You can imagine
a fuzzy dictionary where the keys don’t have to match perfectly. If you looked up d["species"]
in the dictionary above, maybe you’d want it to return "pickup" since that’s the best match for
the query.
An attention layer does a fuzzy lookup like this, but it’s not just looking for the best key. It
combines the values based on how well the query matches each key.
How does that work? In an attention layer the query, key, and value are each vectors. Instead of
doing a hash lookup the attention layer combines the query and key vectors to determine how well
they match, the “attention score”. The layer returns the average across all the values, weighted
by the “attention scores”.
Each location the query-sequence provides a query vector. The context sequence acts as the
dictionary. At each location in the context sequence provides a key and value vector. The input
vectors are not used directly, the layers.MultiHeadAttention layer includes layers.Dense layers
to project the input vectors before using them.
15
1.5.4 The cross attention layer
At the literal center of the Transformer is the cross-attention layer. This layer connects the encoder
and decoder. This layer is the most straight-forward use of attention in the model, it performs the
same task as the attention block in the NMT with attention tutorial.
The cross attention layer
To implement this you pass the target sequence x as the query and the context sequence as the
key/value when calling the mha layer:
[28]: class CrossAttention(BaseAttention):
def call(self, x, context):
attn_output, attn_scores = self.mha(
query=x,
key=context,
value=context,
return_attention_scores=True)
x = self.add([x, attn_output])
x = self.layernorm(x)
return x
The caricature below shows how information flows through this layer. The columns represent the
weighted sum over the context sequence.
For simplicity the residual connections are not shown.
The cross attention layer
The output length is the length of the query sequence, and not the length of the context key/value
sequence.
The diagram is further simplified, below. There’s no need to draw the entire “Attention weights”
matrix. The point is that each query location can see all the key/value pairs in the context, but
no information is exchanged between the queries.
Each query sees the whole context.
Test run it on sample inputs:
[29]: sample_ca = CrossAttention(num_heads=2, key_dim=512)
print(pt_emb.shape)
print(en_emb.shape)
print(sample_ca(en_emb, pt_emb).shape)
16
(64, 58, 512)
print(pt_emb.shape)
print(sample_gsa(pt_emb).shape)
17
Again, the residual connections are omitted for clarity.
It’s more compact, and just as accurate to draw it like this:
The global self attention layer
The causal mask ensures that each location only has access to the locations that come before it:
18
The causal self attention layer
Again, the residual connections are omitted for simplicity.
The more compact representation of this layer would be:
The causal self attention layer
Test out the layer:
[33]: sample_csa = CausalSelfAttention(num_heads=2, key_dim=512)
print(en_emb.shape)
print(sample_csa(en_emb).shape)
tf.reduce_max(abs(out1 - out2)).numpy()
[34]: 4.7683716e-07
Note: When using Keras masks, the output values at invalid locations are not well defined. So the
above may not hold for masked regions.
19
def call(self, x):
x = self.add([x, self.seq(x)])
x = self.layer_norm(x)
return x
Test the layer, the output is the same shape as the input:
[36]: sample_ffn = FeedForward(512, 2048)
print(en_emb.shape)
print(sample_ffn(en_emb).shape)
self.self_attention = GlobalSelfAttention(
num_heads=num_heads,
key_dim=d_model,
dropout=dropout_rate)
And a quick test, the output will have the same shape as the input:
[38]: sample_encoder_layer = EncoderLayer(d_model=512, num_heads=8, dff=2048)
print(pt_emb.shape)
print(sample_encoder_layer(pt_emb).shape)
20
1.5.9 The encoder
Next build the encoder.
The encoder
The encoder consists of:
• A PositionalEmbedding layer at the input.
• A stack of EncoderLayer layers.
[39]: class Encoder(tf.keras.layers.Layer):
def __init__(self, *, num_layers, d_model, num_heads,
dff, vocab_size, dropout_rate=0.1):
super().__init__()
self.d_model = d_model
self.num_layers = num_layers
self.pos_embedding = PositionalEmbedding(
vocab_size=vocab_size, d_model=d_model)
self.enc_layers = [
EncoderLayer(d_model=d_model,
num_heads=num_heads,
dff=dff,
dropout_rate=dropout_rate)
for _ in range(num_layers)]
self.dropout = tf.keras.layers.Dropout(dropout_rate)
# Add dropout.
x = self.dropout(x)
for i in range(self.num_layers):
x = self.enc_layers[i](x)
21
sample_encoder_output = sample_encoder(pt, training=False)
(64, 62)
(64, 62, 512)
self.causal_self_attention = CausalSelfAttention(
num_heads=num_heads,
key_dim=d_model,
dropout=dropout_rate)
self.cross_attention = CrossAttention(
num_heads=num_heads,
key_dim=d_model,
dropout=dropout_rate)
22
Test the decoder layer:
[42]: sample_decoder_layer = DecoderLayer(d_model=512, num_heads=8, dff=2048)
sample_decoder_layer_output = sample_decoder_layer(
x=en_emb, context=pt_emb)
print(en_emb.shape)
print(pt_emb.shape)
print(sample_decoder_layer_output.shape) # `(batch_size, seq_len, d_model)`
self.d_model = d_model
self.num_layers = num_layers
self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size,
d_model=d_model)
self.dropout = tf.keras.layers.Dropout(dropout_rate)
self.dec_layers = [
DecoderLayer(d_model=d_model, num_heads=num_heads,
dff=dff, dropout_rate=dropout_rate)
for _ in range(num_layers)]
self.last_attn_scores = None
x = self.dropout(x)
for i in range(self.num_layers):
23
x = self.dec_layers[i](x, context)
self.last_attn_scores = self.dec_layers[-1].last_attn_scores
output = sample_decoder(
x=en,
context=pt_emb)
(64, 58)
(64, 62, 512)
(64, 58, 512)
Having created the Transformer encoder and decoder, it’s time to build the Transformer model and
train it.
24
A 1-layer transformer
A 4-layer transformer
The RNN+Attention model
Create the Transformer by extending tf.keras.Model:
Note: The original paper, section 3.4, shares the weight matrix between the embedding
layer and the final linear layer. To keep things simple, this tutorial uses two separate
weight matrices.
[46]: class Transformer(tf.keras.Model):
def __init__(self, *, num_layers, d_model, num_heads, dff,
input_vocab_size, target_vocab_size, dropout_rate=0.1):
super().__init__()
self.encoder = Encoder(num_layers=num_layers, d_model=d_model,
num_heads=num_heads, dff=dff,
vocab_size=input_vocab_size,
dropout_rate=dropout_rate)
self.final_layer = tf.keras.layers.Dense(target_vocab_size)
try:
# Drop the keras mask, so it doesn't scale the losses/metrics.
# b/250038731
del logits._keras_mask
except AttributeError:
pass
25
1.6.1 Hyperparameters
To keep this example small and relatively fast, the number of layers (num_layers), the dimension-
ality of the embeddings (d_model), and the internal dimensionality of the FeedForward layer (dff)
have been reduced.
The base model described in the original Transformer paper used num_layers=6, d_model=512,
and dff=2048.
The number of self-attention heads remains the same (num_heads=8).
[47]: num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1
Test it:
[49]: output = transformer((pt, en))
print(en.shape)
print(pt.shape)
print(output.shape)
(64, 58)
(64, 62)
(64, 58, 7010)
26
Model: "transformer"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
encoder_1 (Encoder) multiple 3632768
=================================================================
Total params: 10184162 (38.85 MB)
Trainable params: 10184162 (38.85 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
1.7 Training
It’s time to prepare the model and start training it.
−0.5
𝑙𝑟𝑎𝑡𝑒 = 𝑑𝑚𝑜𝑑𝑒𝑙 ∗ min(𝑠𝑡𝑒𝑝_𝑛𝑢𝑚−0.5 , 𝑠𝑡𝑒𝑝_𝑛𝑢𝑚 ⋅ 𝑤𝑎𝑟𝑚𝑢𝑝_𝑠𝑡𝑒𝑝𝑠−1.5 )
self.d_model = d_model
self.d_model = tf.cast(self.d_model, tf.float32)
self.warmup_steps = warmup_steps
27
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
epsilon=1e-9)
28
mask = tf.cast(mask, dtype=loss.dtype)
loss *= mask
loss = tf.reduce_sum(loss)/tf.reduce_sum(mask)
return loss
mask = label != 0
[57]: transformer.fit(train_batches,
epochs=20,
validation_data=val_batches)
Epoch 1/20
WARNING: All log messages before absl::InitializeLog() is called are written to
STDERR
I0000 00:00:1700138271.584619 39293 device_compiler.h:186] Compiled cluster
using XLA! This line is logged at most once for the lifetime of the process.
810/810 [==============================] - 233s 254ms/step - loss: 6.5998 -
masked_accuracy: 0.1435 - val_loss: 5.0554 - val_masked_accuracy: 0.2485
Epoch 2/20
810/810 [==============================] - 194s 239ms/step - loss: 4.5772 -
masked_accuracy: 0.2972 - val_loss: 4.1541 - val_masked_accuracy: 0.3407
Epoch 3/20
29
810/810 [==============================] - 194s 239ms/step - loss: 3.8242 -
masked_accuracy: 0.3798 - val_loss: 3.5155 - val_masked_accuracy: 0.4196
Epoch 4/20
810/810 [==============================] - 194s 239ms/step - loss: 3.2928 -
masked_accuracy: 0.4375 - val_loss: 3.0246 - val_masked_accuracy: 0.4807
Epoch 5/20
810/810 [==============================] - 193s 238ms/step - loss: 2.9003 -
masked_accuracy: 0.4818 - val_loss: 2.7157 - val_masked_accuracy: 0.5229
Epoch 6/20
810/810 [==============================] - 192s 236ms/step - loss: 2.5720 -
masked_accuracy: 0.5215 - val_loss: 2.5050 - val_masked_accuracy: 0.5463
Epoch 7/20
810/810 [==============================] - 194s 239ms/step - loss: 2.2960 -
masked_accuracy: 0.5575 - val_loss: 2.4090 - val_masked_accuracy: 0.5624
Epoch 8/20
810/810 [==============================] - 193s 237ms/step - loss: 2.1058 -
masked_accuracy: 0.5833 - val_loss: 2.2977 - val_masked_accuracy: 0.5793
Epoch 9/20
810/810 [==============================] - 193s 238ms/step - loss: 1.9573 -
masked_accuracy: 0.6036 - val_loss: 2.1928 - val_masked_accuracy: 0.5937
Epoch 10/20
810/810 [==============================] - 193s 237ms/step - loss: 1.8407 -
masked_accuracy: 0.6204 - val_loss: 2.1310 - val_masked_accuracy: 0.6077
Epoch 11/20
810/810 [==============================] - 194s 239ms/step - loss: 1.7406 -
masked_accuracy: 0.6353 - val_loss: 2.1256 - val_masked_accuracy: 0.6066
Epoch 12/20
810/810 [==============================] - 193s 238ms/step - loss: 1.6556 -
masked_accuracy: 0.6480 - val_loss: 2.1054 - val_masked_accuracy: 0.6122
Epoch 13/20
810/810 [==============================] - 193s 238ms/step - loss: 1.5841 -
masked_accuracy: 0.6585 - val_loss: 2.0783 - val_masked_accuracy: 0.6151
Epoch 14/20
810/810 [==============================] - 190s 234ms/step - loss: 1.5220 -
masked_accuracy: 0.6682 - val_loss: 2.0700 - val_masked_accuracy: 0.6180
Epoch 15/20
810/810 [==============================] - 192s 236ms/step - loss: 1.4666 -
masked_accuracy: 0.6766 - val_loss: 2.0557 - val_masked_accuracy: 0.6194
Epoch 16/20
810/810 [==============================] - 194s 239ms/step - loss: 1.4141 -
masked_accuracy: 0.6855 - val_loss: 2.0391 - val_masked_accuracy: 0.6268
Epoch 17/20
810/810 [==============================] - 193s 238ms/step - loss: 1.3688 -
masked_accuracy: 0.6924 - val_loss: 2.0478 - val_masked_accuracy: 0.6255
Epoch 18/20
810/810 [==============================] - 192s 236ms/step - loss: 1.3287 -
masked_accuracy: 0.6987 - val_loss: 2.0598 - val_masked_accuracy: 0.6251
Epoch 19/20
30
810/810 [==============================] - 194s 239ms/step - loss: 1.2885 -
masked_accuracy: 0.7053 - val_loss: 2.0598 - val_masked_accuracy: 0.6258
Epoch 20/20
810/810 [==============================] - 193s 238ms/step - loss: 1.2529 -
masked_accuracy: 0.7107 - val_loss: 2.0712 - val_masked_accuracy: 0.6274
sentence = self.tokenizers.pt.tokenize(sentence).to_tensor()
encoder_input = sentence
31
# `tf.TensorArray` is required here (instead of a Python list), so that the
# dynamic-loop can be traced by `tf.function`.
output_array = tf.TensorArray(dtype=tf.int64, size=0, dynamic_size=True)
output_array = output_array.write(0, start)
for i in tf.range(max_length):
output = tf.transpose(output_array.stack())
predictions = self.transformer([encoder_input, output], training=False)
if predicted_id == end:
break
output = tf.transpose(output_array.stack())
# The output shape is `(1, tokens)`.
text = tokenizers.en.detokenize(output)[0] # Shape: `()`.
tokens = tokenizers.en.lookup(output)[0]
Note: This function uses an unrolled loop, not a dynamic loop. It generates MAX_TOKENS on every
call. Refer to the NMT with attention tutorial for an example implementation with a dynamic
loop, which can be much more efficient.
Create an instance of this Translator class, and try it out a few times:
[59]: translator = Translator(tokenizers, transformer)
32
print(f'{"Ground truth":15s}: {ground_truth}')
Example 1:
[61]: sentence = 'este é um problema que temos que resolver.'
ground_truth = 'this is a problem we have to solve .'
ground_truth = "so i'll just share with you some stories very quickly of some␣
↪magical things that have happened."
33
For example:
[64]: sentence = 'este é o primeiro livro que eu fiz.'
ground_truth = "this is the first book i've ever done."
ax = plt.gca()
ax.matshow(attention)
ax.set_xticks(range(len(in_tokens)))
ax.set_yticks(range(len(translated_tokens)))
[66]: head = 0
# Shape: `(batch=1, num_heads, seq_len_q, seq_len_k)`.
attention_heads = tf.squeeze(attention_weights, 0)
attention = attention_heads[head]
attention.shape
34
And these are the output (English translation) tokens:
[68]: translated_tokens
35
plot_attention_head(in_tokens, translated_tokens, head)
ax.set_xlabel(f'Head {h+1}')
plt.tight_layout()
plt.show()
[71]: plot_attention_weights(sentence,
translated_tokens,
attention_weights[0])
The model can handle unfamiliar words. Neither 'triceratops' nor 'encyclopédia' are in the
input dataset, and the model attempts to transliterate them even without a shared vocabulary. For
example:
[72]: sentence = 'Eu li sobre triceratops na enciclopédia.'
ground_truth = 'I read about triceratops in the encyclopedia.'
36
1.10 Export the model
You have tested the model and the inference is working. Next, you can export it as a
tf.saved_model. To learn about saving and loading a model in the SavedModel format, use
this guide.
Create a class called ExportTranslator by subclassing the tf.Module subclass with a tf.function
on the __call__ method:
[73]: class ExportTranslator(tf.Module):
def __init__(self, translator):
self.translator = translator
@tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.string)])
def __call__(self, sentence):
(result,
tokens,
attention_weights) = self.translator(sentence, max_length=MAX_TOKENS)
return result
In the above tf.function only the output sentence is returned. Thanks to the non-strict execution
in tf.function any unnecessary values are never computed.
Wrap translator in the newly created ExportTranslator:
[74]: translator = ExportTranslator(translator)
Since the model is decoding the predictions using tf.argmax the predictions are deterministic. The
original model and one reloaded from its SavedModel should give identical predictions:
37
[75]: translator('este é o primeiro livro que eu fiz.').numpy()
1.11 Conclusion
In this tutorial you learned about:
• The Transformers and their significance in machine learning
• Attention, self-attention and multi-head attention
• Positional encoding with embeddings
• The encoder-decoder architecture of the original Transformer
• Masking in self-attention
• How to put it all together to translate text
The downsides of this architecture are:
• For a time-series, the output for a time-step is calculated from the entire history instead of
only the inputs and current hidden-state. This may be less efficient.
• If the input has a temporal/spatial relationship, like text or images, some positional encoding
must be added or the model will effectively see a bag of words.
If you want to practice, there are many things you could try with it. For example:
38
• Use a different dataset to train the Transformer.
• Create the “Base Transformer” or “Transformer XL” configurations from the original paper
by changing the hyperparameters.
• Use the layers defined here to create an implementation of BERT
• Use Beam search to get better predictions.
There are a wide variety of Transformer-based models, many of which improve upon the 2017 version
of the original Transformer with encoder-decoder, encoder-only and decoder-only architectures.
Some of these models are covered in the following research publications:
• “Efficient Transformers: a survey” (Tay et al., 2022)
• “Formal algorithms for Transformers” (Phuong and Hutter, 2022).
• T5 (“Exploring the limits of transfer learning with a unified text-to-text Transformer”) (Raffel
et al., 2019)
You can learn more about other models in the following Google blog posts:
• PaLM.
• LaMDA
• MUM
• Reformer
• BERT
If you’re interested in studying how attention-based models have been applied in tasks outside of
natural language processing, check out the following resources:
• Vision Transformer (ViT): Transformers for image recognition at scale
• Multi-task multitrack music transcription (MT3) with a Transformer
• Code generation with AlphaCode
• Reinforcement learning with multi-game decision Transformers
• Protein structure prediction with AlphaFold
• OptFormer: Towards universal hyperparameter optimization with Transformers
[ ]:
39