Exploring Sequence Models: From RNNs to Transformers

Sequence models refer to CNN-based deep learning models specifically designed to process sequential data. Unlike traditional CNNs that are geared towards data organized in a grid-like structure (such as images), sequence models rely on the context provided by previous elements for accurate predictions.

The applications of sequence modeling span across various fields. For instance, it plays a vital role in Natural Language Processing (NLP) for tasks like language translation, text generation, and sentiment classification. Additionally, sequence models are extensively used in speech recognition, music generation, and stock forecasting, where converting spoken language into textual form is crucial.

This blog aims to delve into different types of sequential architectures, exploring their functionality, distinctions, and real-world applications.

About Us: We are Viso.ai, the driving force behind Viso Suite, a comprehensive end-to-end computer vision platform. Our team of AI experts offers a wide array of computer vision services and an unparalleled AI vision experience. Reach out to us to schedule a demo and discover the key features of our platform.

Evolution of Sequence Models

The evolution of sequence models mirrors the overall progress in deep learning, marked by incremental advancements and significant breakthroughs in handling sequential data. These models have empowered machines to process and generate complex data sequences with increased accuracy and efficiency. This blog will shed light on the following sequence models:

  1. Recurrent Neural Networks (RNNs): Conceptualized by John Hopfields and others in the 1980s.
  2. Long Short-Term Memory (LSTM): Proposed by Sepp Hochreiter and Jürgen Schmidhuber in 1997.
  3. Gated Recurrent Unit (GRU): Introduced by Kyunghyun Cho and colleagues in 2014 as a simplified version of LSTM.
  4. Transformers: Pioneered by Vaswani et al. in 2017, marking a significant paradigm shift in sequence modeling.

Sequence Model 1: Recurrent Neural Networks (RNN)

RNNs are essentially feed-forward networks with internal memory that aids in predicting subsequent elements in a sequence. This memory aspect is attributed to the recurrent nature of RNNs, where a hidden state is leveraged to capture context from the input sequence.

Unlike conventional feed-forward networks that process input data without memory, RNNs utilize their internal memory to analyze inputs. Therefore, the model’s prediction is influenced by what it has learned from previous time steps.

One of the key applications of RNNs is in predicting the next word (e.g., Google autocomplete) and speech recognition, where understanding the previous context is crucial for accurate predictions.

Let’s delve into the architecture of RNNs.

Input

The input provided to the model at time step t is typically denoted as x_t.

For instance, considering the word “kittens,” each letter can be regarded as a distinct time step.

Hidden State

The hidden state in RNNs is pivotal for processing sequential data. Represented as h_t at time t, the hidden state acts as a memory bank. Thus, during predictions, the model factors in its accumulated knowledge over time (hidden state) along with the current input.

RNNs vs. Feed Forward Network

In a typical feed-forward neural network or Multi-Layer Perceptron, data flows unidirectionally from the input layer through the hidden layers to the output layer. There are no feedback loops, and the output of any layer does not impact the same layer in subsequent iterations. Each input is independent and does not influence the next input, implying a lack of long-term dependencies.

In contrast, an RNN model features a loop where information cycles through. When making predictions, the model considers the current input and the knowledge gained from preceding inputs.

Weights

RNNs employ three different weights:

  • Input-to-Hidden Weights (W_xh): These weights link the input to the hidden state.
  • Hidden-to-Hidden Weights (W_hh): These weights connect the previous hidden state to the current hidden state and are learned by the network.
  • Hidden-to-Output Weights (W_hy): These weights connect the hidden state to the output.
Bias Vectors

Two bias vectors are utilized, one for the hidden state and the other for the output.

Activation Functions

The network employs two functions: tanh and ReLU, with tanh primarily used for the hidden state.

A single pass through the network entails:

At time step t, given input x_t​ and previous hidden state h_t-1:

  1. The network calculates the intermediate value z_t using the input, previous hidden state, weights, and biases.
  2. Subsequently, the model applies the tanh activation function to z_t to derive the new hidden state h_t.
  3. The network then computes the output y_t using the new hidden state, output weights, and output biases.

This process is iterated for each time step in the sequence, with the subsequent letter or word being predicted in the sequence.

Backpropagation through Time

Backpropagation in a neural network involves updating the weights to minimize loss. However, in RNNs, this process is more intricate than in a standard feed-forward network. Therefore, the conventional backpropagation algorithm is tailored to accommodate the recurrent nature of RNNs.

In a feed-forward network, the backpropagation process entails:

  • Forward Pass: The model computes the activations and outputs of each layer sequentially.
  • Backward Pass: It calculates the gradients of the loss with respect to the weights for all the layers.
  • Parameter Update: The weights and biases are adjusted using the gradient descent algorithm.

However, in RNNs, this process is adjusted to accommodate sequential data. To learn accurate predictions for the subsequent word, the model must discern the weights in previous time steps that led to correct or incorrect predictions.

Consequently, an unrolling process is executed. Unrolling RNNs involves expanding the entire network for each time step, representing the weights at that specific time step. For instance, if there are t time steps, there will be t unrolled versions.

Once this unrolling is completed, losses are computed for each time step, and the model then calculates the gradients of the loss for hidden states, weights, and biases, propagating the error throughout the unrolled network.

This elucidates the functioning of RNNs, which encounter challenges such as exploding and vanishing gradients and limited memory. These constraints collectively posed difficulties in training RNNs, leading to the inception of LSTMs, which built upon the foundations of RNNs with several enhancements.

Sequence Model 2: Long Short-Term Memory Networks (LSTM)

LSTM networks represent a specialized variant of RNN-based sequence models aimed at mitigating issues related to vanishing and exploding gradients. They find application in tasks like sentiment analysis. LSTM models, similar to RNNs, leverage the concept of memory but incorporate a gating mechanism to retain information over extended periods.

An LSTM network comprises the following elements:

Cell State

The cell state in an LSTM network functions as a vector that serves as the network’s memory by carrying information across various time steps. It traverses the entire sequence chain through specific linear transformations governed by the forget gate, input gate, and output gate.

Hidden State

Compared to the cell state, the hidden state operates as short-term memory, storing information for relatively longer durations. The hidden state acts as a message conduit, transferring information from the preceding time step to the subsequent one, akin to RNNs. It updates based on the previous hidden state, current input, and current cell state.

LSTMs employ three gates to govern the information stored in the cell state:

Forget Gate Operation

The forget gate determines which information from the prior cell state should be retained and which should be discarded. It yields an output value in the range of 0 to 1 for each element in the cell state. A value of 0 implies complete erasure of the information, while a value of 1 indicates full retention.

This decision is made through element-wise multiplication of the forget gate output with the previous cell state.

Input Gate Operation

The input gate regulates the addition of new information to the cell state. It comprises two components: the input gate and the cell candidate. The input gate layer uses a sigmoid function to generate values between 0 and 1, determining the significance of new information.

The gate outputs are not discrete but continuous, ranging from 0 to 1. This is attributed to the sigmoid activation function, which compresses any value into the 0 to 1 range.

Output Gate Operation

The output gate determines the subsequent hidden state by dictating the extent to which the cell state information is exposed to the hidden state.

Let’s explore how these components collaborate to make predictions.

  1. At each time step t, the network receives an input x_t.
  2. For each input, the LSTM computes the values of the various gates. Notably, these are learnable weights, enhancing the model’s ability to determine the gate values over time.
  3. The model calculates the Forget Gate.
  4. Subsequently, the model computes the Input Gate.
  5. The Cell State is updated by incorporating the previous cell state with the new information, determined by the gate values.
  6. Next, the Output Gate is computed to ascertain the degree of cell state information that should be revealed to the hidden state.
  7. The hidden state is passed to a fully connected layer to produce the final output.

Sequence Model 3: Gated Recurrent Unit (GRU)

LSTMs and Gated Recurrent Units are both variants of Recurrent Networks. However, GRUs distinguish themselves from LSTMs by employing a reduced number of gates. GRUs are simpler compared to LSTMs and incorporate only two gates instead of the three gates found in LSTMs.

GRUs also exhibit simplicity concerning memory, as they solely rely on the hidden state for retention. The key gates used in GRUs are:

  • The update gate in GRUs regulates the retention of past information.
  • The reset gate determines the amount of information in the memory that should be forgotten.
  • The hidden state stores information from the previous time step.

Sequence Model 4: Transformer Models

Transformer models have revolutionized the field of deep learning, garnering significant attention globally. Prominent Language Model (LLM) implementations such as ChatGPT and Gemini from Google leverage the transformer architecture.

The transformer architecture distinguishes itself from previous models by its capacity to assign varying importance to different parts of the input sequence of words. This unique ability, known as the self-attention mechanism, proves invaluable for capturing long-range dependencies in text data.

Self-Attention Model

The self-attention mechanism enables the model to assign diverse weights to different segments of the input data, allowing it to extract critical features effectively.

This mechanism operates by initially calculating the attention score for each word in the sequence, determining their relative significance. This process empowers the model to focus on pertinent segments and facilitates a nuanced understanding of natural language, setting it apart from conventional models.

Architecture of Transformer model

A hallmark feature of the Transformer model lies in its self-attention mechanisms, enabling parallel data processing rather than sequential processing as seen in Recurrent Neural Networks (RNNs) or Long Short-Term Memory Networks (LSTMs).

The Transformer architecture comprises an encoder and a decoder.

Encoder

The Encoder comprises multiple layers, each containing two sub-layers:

  1. Multi-head self-attention mechanism.
  2. Fully connected feed-forward network.

The output of each sub-layer undergoes a residual connection and layer normalization before being passed to the subsequent sub-layer.

The term “Multi-head” signifies that the model incorporates multiple sets (or “heads”) of learned linear transformations applied to the input. This feature enhances the network’s modeling capabilities.

For instance, consider the sentence: “The cat, which already ate, was full.” Through multi-head attention, the network:

  1. Head 1 focuses on the relationship between “cat” and “ate,” aiding the model in understanding the eating context.
  2. Head 2 centers on the connection between “ate” and “full,” aiding the model in comprehending why the cat is full.

Consequently, the network can process the input, extracting context more effectively in parallel.

Decoder

The Decoder follows a structure similar to the Encoder, but with a crucial distinction. Here, a Masked Multi-head Attention mechanism is employed. The primary components include:

  • Masked Self-Attention Layer: Analogous to the Self-Attention layer in the Encoder but incorporating masking.
  • Self Attention Layer
  • Feed-Forward Neural Network

The “masked” aspect pertains to a training technique where future tokens are concealed from the model.

This practice is essential because during training, the entire sequence (sentence) is fed into the model at once. However, this poses a dilemma as the model knows the subsequent word, hindering the learning process for prediction. Masking ensures the exclusion of the subsequent word from the training sequence, enabling the model to predict accurately.

For example, in a machine translation task, translating the English sentence “I am a student” to French, “Je suis un étudiant”:

[START] Je suis un étudiant [END]

Here’s how the masked layer aids in prediction:

  1. While predicting the initial word “Je,” all other words are masked out, allowing the model to focus solely on [START] without knowledge of the subsequent words.
  2. When predicting the subsequent word “suis,” the model masks out words to the right, restricting visibility to “un étudiant [END].” Thus, the model only processes [START] Je.

Summary

This blog has delved into diverse Convolution Neural Network architectures tailored for sequence modeling. Commencing with RNNs, which serve as the fundamental model for LSTM and GRU, RNNs diverge from traditional feed-forward networks due to their memory capabilities rooted in their recurrent nature. However, training RNNs posed challenges, prompting the introduction of LSTM and GRU with gating mechanisms to sustain information for prolonged periods.

Subsequently, we explored the Transformer machine learning model, an architecture prevalent in prominent LLMs such as ChatGPT and Gemini. Transformers stand out from other sequence models owing to their self-attention mechanism, allowing the model to assign varying importance to different segments of the sequence, thereby fostering human-like comprehension of textual data.

For further insights into the topics covered in this blog, feel free to peruse our additional blog posts.