Transformer 100 Q&A: Residual Connections in Deep Learning

1. What is a Residual Connection?

A residual connection is a network structure design that allows information to be directly transmitted between network layers without passing through all intermediate nonlinear transformations. Specifically, it adds the input of one layer directly to its output, forming a "shortcut" path.

Formula representation:

If the input of a certain layer is $x$ and the output is $F(x)$, then:

Output=F(x)+x

The main purpose of this structure is to solve the problem of gradient vanishing and gradient explosion during the training of deep neural networks, allowing information to propagate more smoothly through the network.

2. The Role and Logic of Residual Connections in ResNet

In 2015, Kaiming He et al. proposed ResNet (Residual Network), which is the origin of residual connections. ResNet successfully trained deep neural networks with more than 100 layers by introducing residual connections. The core ideas are as follows:

Solving the Degradation Problem: As the network depth increases, the training error of deep networks increases, which is known as the degradation problem. Residual connections alleviate this problem.
Direct Information Transmission: By using residual connections, the input information can be directly transmitted to deeper layers, making it easier for the network to learn the identity mapping.
Equation Explanation: In ResNet, the learning objective becomes learning the residual function $F(x) = H(x) - x$, where $H(x)$ is the ideal mapping. This makes the network easier to optimize.

3. Why Use Residual Connections in Transformers and Their Role

Transformers, proposed by Vaswani et al. in 2017, are a novel neural network architecture mainly used for natural language processing (NLP) tasks. The core components include the self-attention mechanism and the feed-forward neural network layer. Each sub-layer in the Transformer is followed by a residual connection for the following reasons:

Stabilizing the Training of Deep Networks: Although the layers of Transformers are not as deep as ResNet, residual connections still help alleviate the gradient vanishing problem, ensuring the effective transmission of information.
Accelerating Model Convergence: Residual connections allow gradients to flow more easily to preceding layers, promoting faster model convergence.
Preserving Original Information: By directly adding the input to the output, the model can learn complex transformations without destroying the original information.

Implementation of Residual Connections in Transformers:

For input $x$ and the output of a sub-layer (self-attention or feed-forward network) $Sublayer(x)$, the output of the residual connection is:

Output=LayerNorm(x+Sublayer(x))

Here, layer normalization (LayerNorm) is also added to furtherstabilize the training process.

4. Similarities and Differences Between ResNet and Transformer

Similarities:

Same Objective: In both ResNet and Transformer, the main purpose of residual connections is to solve the difficulties in training deep networks and promote effective transmission of information.
Similar Structure: Both add the input to the output of the nonlinear transformations.

Differences:

Different Application Fields: ResNet is mainly used for image recognition tasks in the field of computer vision, while Transformer is mainly used for natural language processing tasks.
Normalization Methods: In ResNet, batch normalization (BatchNorm) is usually performed after the convolutional layers, while in Transformer, layer normalization (LayerNorm) is performed after the residual connection.
Different Components: The sub-layers in ResNet are mainly convolutional layers, while the sub-layers in Transformer are self-attention mechanisms and feed-forward neural networks.

5. Understanding the Practical Application of Residual Connections Through a Case Study

Case Study: Transformer in Machine Translation Tasks

Assuming we have an English sentence to be translated into Chinese, the process in Transformer is as follows:

Step 1: Input Embedding

Convert each word in the English sentence into a vector representation to form the input matrix $X$.

Step 2: Self-Attention Sub-layer

Calculate the correlations between various positions in the input sequence to generate a new representation $Sublayer_{1}(X)$.

Step 3: First Residual Connection

Add the input $X$ to the output of the self-attention sub-layer and perform layer normalization:

H1=LayerNorm(X+Sublayer1(X))

Explanation:

Preserving Original Input Information: Even if the self-attention sub-layer does not learn well, the residual connection ensures that the input information is not lost.
Stable Training: It helps in the backward propagation of gradients.

Step 4: Feed-forward Neural Network Sub-layer

Perform a position-independent nonlinear transformation on $H_{1}$ to obtain $Sublayer_{2}(H_{1})$.

Step 5: Second Residual Connection

Add $H_{1}$ to the output of the feed-forward neural network and perform layer normalization:

H2=LayerNorm(H1+Sublayer2(H1))

Explanation:

Enhancing Feature Representation: Further extracting features while preserving the result of self-attention processing.
Preventing Overfitting: Residual connections can prevent the network from overly relying on a single layer.

Step 6: Output Result

$H_{2}$ is used as the output of the encoder and passed to the decoder for translation.

Through this case study, we can see:

Importance of Residual Connections: They ensure that the input information at each layer can be directly transmitted to subsequent layers, preventing information loss.
Similarity to ResNet: In deep networks, residual connections alleviate training difficulties.
Practical Effects: By introducing residual connections and other mechanisms, Transformersachieve excellent performance in tasks such as machine translation, text summarization, and sentiment analysis.
Summary
- Role of Residual Connections: In deep networks, residual connections can effectively mitigate the problems of gradient vanishing or explosion, promote the effective transmission of information, and stabilize the training process.
- Relationship with ResNet: Residual connections in Transformers and ResNet are conceptually and logically consistent, both aiming to solve the difficulties in training deep networks.
- In-depth Understanding: Through step-by-step analysis and case explanations, we can see the important role of residual connections in model performance and training stability.
Expectations and Acknowledgments
We hope that in future network architectures, residual connections will play a more extensive and in-depth role, becoming an indispensable part of the field of deep learning. We welcome readers to continue following this series to learn more about Transformers. If you have any questions, please feel free to ask!

Menu

Transformer 100 Q&A: Residual Connections in Deep Learning

Share

1. What is a Residual Connection?

Formula representation:

2. The Role and Logic of Residual Connections in ResNet

3. Why Use Residual Connections in Transformers and Their Role

Implementation of Residual Connections in Transformers:

4. Similarities and Differences Between ResNet and Transformer

5. Understanding the Practical Application of Residual Connections Through a Case Study

Step 1: Input Embedding

Step 2: Self-Attention Sub-layer

Step 3: First Residual Connection

Step 4: Feed-forward Neural Network Sub-layer

Step 5: Second Residual Connection

Step 6: Output Result

Summary

Expectations and Acknowledgments

User API Interface

Integrator API Interface

Overview of API Interface

Exploring the Power of LBAI's All-new Team Feature

AI and Industry: Industrial Intelligent Transformation for a More Promising Future for Enterprises

LBAI's AI Partner: Your Best Ally in Digital Transformation

LBAI's Team feature is coming soon: Redefining Collaboration and Leading the Era of Efficiency

LBAI Super Brain Integrates Embodied Intelligence, Reshaping the Future of Interaction

LBAI Technology Company Profile

Transformer 100 Q&A：An In-depth Analysis of dim, head_dim, and hidden_dim in the Parameter Table