Positional Encoding in Transformers

Last Updated : 15 May, 2024

In the domain of natural language processing (NLP), transformer models have fundamentally reshaped our approach to sequence-to-sequence tasks. .However, unlike conventional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers lack inherent awareness of token order. In this article, we will understand the significance of positional encoding, which is a critical technique for embedding Transformer models with an understanding of sequence order.

Why are positional encodings important?

Positional encodings are crucial in Transformer models for several reasons:

Preserving Sequence Order: Transformer models process tokens in parallel, lacking inherent knowledge of token order. Positional encodings provide the model with information about the position of tokens in the sequence, ensuring that the model can differentiate between tokens based on their position. This is essential for tasks where word order matters, such as language translation and text generation.
Maintaining Contextual Information: In natural language processing tasks, the meaning of a word often depends on its position in the sentence. For example, in the sentence "The cat sat on the mat," the word "cat" has a different meaning than in "The mat sat on the cat." transformer
Enhancing Generalization: By incorporating positional information, transformer models can generalize better across sequences of different lengths. This is particularly important for tasks where the length of the input sequence varies, such as document summarization or question answering. Positional encodings enable the model to handle input sequences of varying lengths without sacrificing performance.
Mitigating Symmetry: Without positional encodings, the self-attention mechanism in Transformer models would treat tokens symmetrically, potentially leading to ambiguous representations. Positional encodings introduce an asymmetry into the model, ensuring that tokens at different positions are treated differently, thereby improving the model's ability to capture long-range dependencies.

In summary, positional encodings are essential in Transformer models for preserving sequence order, maintaining contextual information, enhancing generalization, and mitigating symmetry. They enable Transformer models to effectively process and understand input sequences, leading to improved performance across a wide range of natural language processing tasks.

Example of Positional Encoding:

Let's consider a simple example to illustrate the concept of positional encoding in the context of a Transformer model.

Suppose we have a Transformer model tasked with translating English sentences into French. One of the sentences in English is:

"The cat sat on the mat."

Before the sentence is fed into the Transformer model, it undergoes tokenization, where each word is converted into a token. Let's assume the tokens for this sentence are:

["The", "cat" , "sat", "on", "the" ,"mat"]

Next, each token is mapped to a high-dimensional vector representation through an embedding layer. These embeddings encode semantic information about the words in the sentence. However, they lack information about the order of the words.

Embeddings={E1,E2,E3,E4,E5,E6}

where each Ei is a 4-dimensional vector.

This is where positional encoding comes into play. To ensure that the model understands the order of the words in the sequence, positional encodings are added to the word embeddings. These encodings provide each token with a unique positional representation.

Calculating Positional Encodings

Let's say the embedding dimensionality is 4 for simplicity.
We'll use sine and cosine functions to generate positional encodings. Consider the following positional encodings for the tokens in our example sentence:

$\newline \text{PE}(1)=[\sin(\frac{1}{10000^{2 \times 0/4}}), \cos(\frac{1}{10000^{2 \times 0/4}}), \sin(\frac{1}{10000^{2 \times 1/4}}), \cos(\frac{1}{10000^{2 \times 1/4}})] \newline \text{PE}(2)=[\sin(\frac{2}{10000^{2 \times 0/4}}), \cos(\frac{2}{10000^{2 \times 0/4}}), \sin(\frac{2}{10000^{2 \times 1/4}}), \cos(\frac{2}{10000^{2 \times 1/4}})] \newline \text{PE}(3)=[\sin(\frac{3}{10000^{2 \times 0/4}}), \cos(\frac{3}{10000^{2 \times 0/4}}), \sin(\frac{3}{10000^{2 \times 1/4}}), \cos(\frac{3}{10000^{2 \times 1/4}})] \newline \text{PE}(4)=[\sin(\frac{4}{10000^{2 \times 0/4}}), \cos(\frac{4}{10000^{2 \times 0/4}}), \sin(\frac{4}{10000^{2 \times 1/4}}), \cos(\frac{4}{10000^{2 \times 1/4}})] \newline \text{PE}(5)=[\sin(\frac{5}{10000^{2 \times 0/4}}), \cos(\frac{5}{10000^{2 \times 0/4}}), \sin(\frac{5}{10000^{2 \times 1/4}}), \cos(\frac{5}{10000^{2 \times 1/4}})] \newline \text{PE}(6)=[\sin(\frac{6}{10000^{2 \times 0/4}}), \cos(\frac{6}{10000^{2 \times 0/4}}), \sin(\frac{6}{10000^{2 \times 1/4}}), \cos(\frac{6}{10000^{2 \times 1/4}})]$

These positional encodings are added element-wise to the word embeddings. The resulting vectors contain both semantic and positional information, allowing the Transformer model to understand not only the meaning of each word but also its position in the sequence.
This example illustrates how positional encoding ensures that the Transformer model can effectively process and understand input sequences by incorporating information about the order of the tokens.

Positional Encoding Layer in Transformers

The Positional Encoding layer in Transformers plays a critical role by providing necessary positional information to the model. This is particularly important because the Transformer architecture, unlike RNNs or LSTMs, processes input sequences in parallel and lacks inherent mechanisms to account for the sequential order of tokens. The mathematical intuition behind the Positional Encoding layer in Transformers is centered on enabling the model to incorporate information about the order of tokens in a sequence.

Positional encodings utilize a specific mathematical formula to generate a unique encoding for each position in the input sequence. Here’s a closer look at the methodology:

Formula for Positional Encoding: For each position ?p in the sequence, and for each dimension 2?2i and 2?+12i+1 in the encoding vector:
- Even-indexed dimensions: $PE(p, 2i) = \sin\left(\frac{p}{10000^{\frac{2i}{d_{\text{model}}}}}\right)$
- Odd-indexed dimensions: $PE(p, 2i+1) = \cos\left(\frac{p}{10000^{\frac{2i}{d_{\text{model}}}}}\right)$

These formulas use sine and cosine functions to create wave-like patterns that vary across the sequence positions. The usage of sine for even indices and cosine for odd indices ensures a rich combination of features that can effectively represent positional information across different sequence lengths.

Code Implementation of Positional Encoding in Transformers

The defined positional_encoding function generates a positional encoding matrix that is widely used in models like the Transformer to give the model information about the relative or absolute position of tokens in a sequence.Here, is a breakdown of what each part does.

Function Parameters:
- position: Total positions or length of the sequence.
- d_model: Dimensionality of the model's output.
Generating the Base Matrix:
- angle_rads: Creates a matrix where rows represent sequence positions and columns represent feature dimensions. Values are scaled by dividing each position index by 10000 raised to (2 * index / d_model).
Applying Sine and Cosine Functions:
- Even indices: Apply the sine function to encode positions.
- Odd indices: Apply the cosine function for a phase-shifted encoding.
Creating the Positional Encoding Tensor:
- The matrix is expanded to match input shape expectations of models like Transformers and cast to tf.float32.
Output:
- Returns a TensorFlow tensor of shape (1, position, d_model), ready to be added to input embeddings to incorporate positional information.

import numpy as np
import tensorflow as tf

def positional_encoding(position, d_model):
 
    # Create a matrix of shape [position, d_model] where each element is the position index
    angle_rads = np.arange(position)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model))
    
    # Apply sine to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    
    # Apply cosine to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    
    return tf.cast(pos_encoding, dtype=tf.float32)

# Example use
position = 50  # Length of the sequence
d_model = 512  # Dimensionality of the model's output
pos_encoding = positional_encoding(position, d_model)

print(pos_encoding.shape)  # Output the shape to verify the dimensions

Output:

(1, 50, 512)

The array provided is the positional encodings generated by the positional_encoding function for a sequence of length 10 and a model dimensionality of 512. Each row in the array corresponds to a position in the sequence, and each column represents a dimension in the model.

The output (1, 50, 512) from our positional encoding function becomes crucial for models like Transformers, which don't inherently capture sequential data order. Here’s how it functions:

Enhancing Input Features: Adds a unique signal to each position, ensuring distinct representations for identical tokens at different positions.
Modeling Sequential Information: Reintroduces order information to the model, essential for processing sequences effectively.
Compatibility with Model Architecture: Designed to be directly added to the model's embedding layer, matching the sequence length and feature size.
Improving Model Performance: Helps the model learn patterns related to token order more efficiently, enhancing training speed and effectiveness in sequence-dependent tasks.

Positional Encoding in Transformers - FAQs

Why are sine and cosine functions used in positional encoding?

Sine and cosine functions are used because they provide a continuous and differentiable method to encode position information, which helps in training deep learning models. Their periodic nature allows the model to learn and generalize across different positions effectively, and their alternating use across dimensions helps in maintaining unique encodings for each position.

How are positional encodings added to input embeddings?

Positional encodings are added directly to the input embeddings at the base of the Transformer model. This means that each token’s embedding, representing semantic information, is combined with its positional encoding, ensuring that the resulting representation includes both contextual and positional information.

Can Positional Encoding Generalize to Longer Sequences Than Seen During Training?

Yes, positional encoding can effectively handle longer sequences than those encountered during training, thanks to the use of trigonometric functions, which allow for efficient generalization across different sequence lengths.