ResNet (Residual Network)

ResNet is a deep convolutional neural network architecture that introduces skip connections (residual connections) to enable training of extremely deep networks. It won the ImageNet 2015 competition and became the foundational backbone for virtually all modern vision architectures, including the U-Net used in [[Diffusion Model|diffusion models]].


1. Core Concept: Residual Learning

1.1 The Degradation Problem

Before ResNet, deeper networks performed worse than shallower ones — not due to overfitting, but because optimization became harder as depth increased:

Network Depth Training Error Test Error
20 layers Lower Lower
56 layers Higher Higher

This counterintuitive phenomenon is called the degradation problem: adding more layers increases training error even though the deeper network can theoretically represent the shallower one (by setting extra layers to identity).

[!NOTE] Root Cause
Deep networks suffer from vanishing/exploding gradients that compound across many layers. Standard initialization and Batch Normalization help, but the core issue is that very deep networks struggle to learn identity mappings — which are surprisingly difficult for stacked nonlinear layers to represent.

1.2 The Residual Solution

Instead of learning a direct mapping H(x) , ResNet learns the residual F(x)=H(x)x :

y=F(x,{Wi})+x

where F(x,{Wi}) is the residual function (typically 2-3 weight layers), and x is the skip connection that bypasses these layers.

[!NOTE] Why This Works
If the optimal mapping is close to identity ( H(x)x ), the residual F(x)0 is much easier to learn than the full mapping. The skip connection provides a gradient highway that allows error signals to flow directly backward without attenuation.

1.3 Mathematical Intuition

Standard network (hard to optimize):

y=H(x)

Residual network (easier to optimize):

y=F(x)+x

Gradient flow through skip connection:

Lx=Ly(F(x)x+1)

The additive identity term 1 ensures gradients never vanish completely — even if Fx becomes very small.


2. Architecture Design

2.1 Basic Residual Block

The fundamental building block (used in ResNet-18/34):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class BasicBlock(nn.Module):
"""Basic residual block: 3x3 conv → BN → ReLU → 3x3 conv → BN → +skip → ReLU"""

def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)

# Skip connection with 1x1 conv when dimensions change
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)

def forward(self, x):
residual = self.shortcut(x)

out = self.conv1(x)
out = self.bn1(out)
out = F.relu(out)
out = self.conv2(out)
out = self.bn2(out)

out += residual # Element-wise addition
out = F.relu(out)
return out

Structure:

1
2
Input → [3×3 Conv → BN → ReLU → 3×3 Conv → BN] → + Input → ReLU → Output
└───────────────── Residual Path ────────┘

2.2 Bottleneck Block

Used in deeper variants (ResNet-50/101/152) to reduce computation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
class Bottleneck(nn.Module):
"""Bottleneck: 1x1 → 3x3 → 1x1 conv, reducing and restoring channels"""

expansion = 4 # Output channels = bottleneck_channels × 4

def __init__(self, in_channels, bottleneck_channels, stride=1):
super().__init__()
# 1x1 compress: 256 → 64
self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1, bias=False)
self.bn1 = nn.BatchNorm2d(bottleneck_channels)
# 3x3 spatial: 64 → 64
self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3,
stride=stride, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(bottleneck_channels)
# 1x1 expand: 64 → 256
self.conv3 = nn.Conv2d(bottleneck_channels,
bottleneck_channels * self.expansion, 1, bias=False)
self.bn3 = nn.BatchNorm2d(bottleneck_channels * self.expansion)

self.shortcut = nn.Sequential()
if stride != 1 or in_channels != bottleneck_channels * self.expansion:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, bottleneck_channels * self.expansion,
1, stride=stride, bias=False),
nn.BatchNorm2d(bottleneck_channels * self.expansion)
)

def forward(self, x):
residual = self.shortcut(x)
out = F.relu(self.bn1(self.conv1(x)))
out = F.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
out += residual
out = F.relu(out)
return out

Bottleneck structure:

1
256-d → [1×1, 64] → ReLU → [3×3, 64] → ReLU → [1×1, 256] → + skip → ReLU
Block Type Convolutions Parameter Count Used In
Basic 3×3, 3×3 232CinCout ResNet-18, 34
Bottleneck 1×1, 3×3, 1×1 CinCmid+32Cmid2+CmidCout ResNet-50, 101, 152

2.3 Full Architecture Zoo

Model Layers Blocks per Stage Block Type Parameters
ResNet-18 18 [2, 2, 2, 2] Basic 11.7M
ResNet-34 34 [3, 4, 6, 3] Basic 21.8M
ResNet-50 50 [3, 4, 6, 3] Bottleneck 25.6M
ResNet-101 101 [3, 4, 23, 3] Bottleneck 44.5M
ResNet-152 152 [3, 8, 36, 3] Bottleneck 60.2M

Overall architecture flow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Input (224×224×3)

7×7 Conv, 64, stride=2 → BN → ReLU

3×3 MaxPool, stride=2

Stage 1: [ResBlock × n₁], 64/256 channels, stride=1

Stage 2: [ResBlock × n₂], 128/512 channels, stride=2

Stage 3: [ResBlock × n₃], 256/1024 channels, stride=2

Stage 4: [ResBlock × n₄], 512/2048 channels, stride=2

Global Average Pooling → FC 1000 → Softmax

3. Why ResNet Works: Theoretical Analysis

3.1 Gradient Highway

The skip connection creates an uninterrupted gradient path. The backpropagation through a residual block:

Lxl=LxL(1+i=lL1F(xi)xl)

The term 1 guarantees that gradients from layer L can reach layer l without multiplicative attenuation.

3.2 Unrolled View as Ensemble

A ResNet with n residual blocks can be viewed as an exponential ensemble of 2n paths:

xL=x0+i=0L1F(xi)

Each path corresponds to including/not-including each residual block — effectively an ensemble of shallower networks that share parameters. Removing any single block during inference causes only a mild performance drop (≈0.3% accuracy loss), confirming this ensemble interpretation.

3.3 Identity Mapping Importance

The pre-activation design (He et al., 2016) proved that making the skip connection a pure identity mapping (no ReLU after the addition) is critical:

Design Skip Path Accuracy (ResNet-110, CIFAR-10)
Original Conv → BN → ReLU → Conv → BN → +x → ReLU 6.43% error
Pre-activation +x → BN → ReLU → Conv → BN → ReLU → Conv 4.62% error

[!NOTE] Pre-Activation Insight
Moving BatchNorm and ReLU before the convolution (instead of after) makes the information flow through the skip connection completely clean, enabling even 1000+ layer networks to be trained effectively.

3.4 Pre-activation ResNet Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class PreActBlock(nn.Module):
"""Pre-activation residual block (He et al., 2016)"""

def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.bn1 = nn.BatchNorm2d(in_channels)
self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
stride=stride, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
stride=1, padding=1, bias=False)

self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False)
)

def forward(self, x):
residual = self.shortcut(x)
out = F.relu(self.bn1(x))
out = self.conv1(out)
out = F.relu(self.bn2(out))
out = self.conv2(out)
return out + residual # No ReLU after addition → pure identity path

4. Connection to [[Neural ODE]]

4.1 ResNet as Discrete ODE

A ResNet block with step size Δt :

ht+Δt=ht+Δtfθ(ht)

In the limit Δt0 , this becomes a [[Neural ODE]]:

dh(t)dt=fθ(h(t),t)
Aspect ResNet [[Neural ODE]]
Layer index Discrete t=0,1,2,,L Continuous t[0,T]
Depth Fixed L ODE solver adapts steps
Step size Δt=1 Adaptive, solver-determined
Parameters Per-layer or shared Naturally shared across “depth”
Memory (backprop) O(L) O(1) (adjoint method)

4.2 The Continuum Bridge

ResNet provides the discrete scaffold from which [[Neural ODE]] emerged as the continuous limit. This connection means:

  1. ResNet training insights (initialization, normalization) transfer to Neural ODEs
  2. Neural ODEs can be discretized back into ResNet-like architectures for deployment
  3. ResNet’s skip connections are the discrete analog of numerical ODE integrators (specifically, the forward Euler method)
1
2
3
ResNet:    h_{t+1} = h_t + f(h_t)        ←  Forward Euler (ODE discretization)
Neural ODE: dh/dt = f(h(t), t) ← Continuous limit (Δt → 0)
DenseNet: h_{t+1} = concat(h_t, f(h_t)) ← Runge-Kutta-like (higher-order)

5. ResNet in [[Diffusion Model|Diffusion Models]]

5.1 U-Net Backbone

Modern diffusion models (DDPM, Stable Diffusion) use a U-Net architecture whose encoder and decoder are built from ResNet blocks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
U-Net for Diffusion
═══════════════════════════════════════
Encoder (Downsampling):
ResBlock × 2 → Downsample ────────────────────┐
ResBlock × 2 → Downsample ────────────┐ │
ResBlock × 2 → Downsample ────┐ │ │
ResBlock × 2 │ │ │
┌──┘ │ │
Middle: │ │ │
ResBlock + Self-Attention │ │ │
└──┐ │ │
Decoder (Upsampling): │ │ │
ResBlock × 2 ← Concat ←───────┘ │ │
ResBlock × 2 ← Concat ←───────────────┘ │
ResBlock × 2 ← Concat ←───────────────────────┘
ResBlock × 2
═══════════════════════════════════════

Each ResBlock in the U-Net processes:

  1. Time embedding: Injects diffusion timestep t via scaling/shifting
  2. Skip connection: Preserves fine-grained spatial information
  3. Group Normalization: Replaces BatchNorm (better for varying batch sizes)

5.2 Diffusion ResBlock Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class DiffusionResBlock(nn.Module):
"""Residual block used in diffusion model U-Nets."""

def __init__(self, channels, emb_channels, out_channels=None):
super().__init__()
out_channels = out_channels or channels

self.norm1 = nn.GroupNorm(32, channels)
self.conv1 = nn.Conv2d(channels, out_channels, 3, padding=1)
self.norm2 = nn.GroupNorm(32, out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)

# Time embedding projection
self.time_emb_proj = nn.Linear(emb_channels, out_channels)

# Skip connection when channels change
self.skip = nn.Conv2d(channels, out_channels, 1) \
if channels != out_channels else nn.Identity()

def forward(self, x, t_emb):
# Time embedding → scale & shift
t_scale_shift = self.time_emb_proj(F.silu(t_emb))
scale, shift = t_scale_shift.chunk(2, dim=1)

residual = self.skip(x)

h = self.norm1(x)
h = F.silu(h)
h = self.conv1(h)
h = h * (1 + scale[:, :, None, None]) + shift[:, :, None, None]

h = self.norm2(h)
h = F.silu(h)
h = self.conv2(h)

return h + residual

6. Beyond ResNet: Architecture Family Tree

6.1 Direct Descendants

Architecture Innovation Relationship to ResNet
ResNeXt Grouped convolutions (“cardinality”) Multi-branch ResNet
DenseNet Dense skip connections (all previous layers) Extreme skip connection variant
Wide ResNet Wider layers, fewer depth Trades depth for width
SE-ResNet Squeeze-and-Excitation attention Adds channel attention to ResBlocks
ResNeSt Split-attention blocks Combines ResNeXt + SE-Net

6.2 Conceptual Influences

Architecture How ResNet Influenced It
U-Net Skip connections between encoder-decoder (same spirit)
[[Neural ODE]] Continuous limit of discrete ResNet layers
Transformer Residual connections in every attention block
Highway Networks Learned gating on skip connections (predecessor, 2015)
FPN (Feature Pyramid) Lateral skip connections for multi-scale features

6.3 The Residual Principle in Modern ML

1
2
3
4
5
6
7
8
9
ResNet (2015) ────→ U-Net (2015) ────→ Diffusion U-Net (2020+)

├──→ Transformer (2017): Pre-LN residual blocks
│ └──→ ViT, GPT, BERT, LLaMA

├──→ DenseNet (2017): Dense residual connections

└──→ Neural ODE (2018): Continuous residual limit
└──→ FFJORD, Latent ODE, Neural CDE

7. Practical Training Insights

7.1 Key Hyperparameters

Parameter ResNet-50 (ImageNet) Notes
Initialization He (Kaiming) Normal Designed for ReLU
BatchNorm momentum 0.1 Standard
Weight decay 1e-4 L2 regularization
Learning rate 0.1 → divided by 10 at 30, 60, 80 epochs Step decay
Batch size 256 8× V100 GPUs
Epochs 90 Standard ImageNet schedule

7.2 Initialization Matters

ResNet uses Kaiming initialization specifically designed for layers followed by ReLU:

1
2
3
4
5
6
7
8
9
def kaiming_init(m):
if isinstance(m, nn.Conv2d):
# He initialization: std = sqrt(2 / fan_in)
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)

model.apply(kaiming_init)

7.3 Batch Normalization After Addition

In the original ResNet design, BatchNorm is applied before the skip connection addition:

1
2
3
4
5
Wrong (common mistake):
Conv → BN → ReLU → Conv → +x → BN → ReLU

Correct (original ResNet):
Conv → BN → ReLU → Conv → BN → +x → ReLU

The BN after addition destroys the clean identity path — this is why pre-activation ResNet emerged.


8. Theoretical Properties

8.1 Loss Landscape Smoothing

Residual connections smooth the loss landscape, making optimization easier:

  • Without skip connections: Loss surface has sharp local minima and chaotic curvature
  • With skip connections: Loss surface becomes smoother and more convex-like
  • This explains why ResNets converge faster and to better minima

8.2 Shattered Gradients Problem

In very deep plain networks, gradients resemble white noise — uncorrelated across layers. Residual connections preserve gradient correlation, enabling meaningful information exchange across the full network depth.

8.3 Depth Efficiency

Depth Plain Network Accuracy ResNet Accuracy
20 layers 91.25% 91.75%
56 layers 90.10% (degradation!) 93.03%
110 layers Untrainable 93.57%

This proved that depth is still valuable — it just requires the right architecture to unlock it.


9. Core Formula Cards

[!QUOTE] Residual Block

y=F(x,{Wi})+x

[!QUOTE] Gradient Flow with Skip Connection

Lxl=LxL(1+i=lL1F(xi)xl)

[!QUOTE] Unrolled ResNet (Ensemble View)

xL=x0+i=0L1F(xi,{Wi})

[!QUOTE] ResNet as Discrete ODE (Forward Euler)

ht+1=ht+fθ(ht)

[!QUOTE] Continuous Limit → [[Neural ODE]]

dh(t)dt=fθ(h(t),t)

[!QUOTE] Bottleneck Block Dimensions

Cin1×1Cin43×3Cin41×1Cin

10. Summary

Aspect Description
Core idea Learn residual F(x) rather than full mapping H(x)
Key mechanism Skip connections create gradient highways
Breakthrough First architecture to successfully train 100+ layer networks
Impact Backbone for U-Net (diffusion), Transformer (LLM), Neural ODE
Variants Pre-activation, Bottleneck, Wide ResNet, ResNeXt, DenseNet
Role in diffusion Building block of diffusion U-Net encoders/decoders
Continuous limit [[Neural ODE]] (forward Euler discretization → continuous ODE)

  • [[Neural ODE]]
  • [[Diffusion Model]]
  • [[U-Net]]
  • [[Convolutional Neural Network (CNN)]]
  • [[Vision Transformer (ViT)]]
  • [[Batch Normalization]]
  • [[DenseNet]]
  • [[Transformer]]
  • [[Welford]]

Dataview Query

1
2
3
LIST
FROM #resnet OR #skip_connection OR #residual_learning
SORT file.ctime DESC

References

  • Paper: Deep Residual Learning for Image Recognition (He et al., CVPR 2016 — Best Paper)
  • Paper: Identity Mappings in Deep Residual Networks (He et al., ECCV 2016 — Pre-activation ResNet)
  • Paper: Aggregated Residual Transformations for Deep Neural Networks (Xie et al., 2017 — ResNeXt)
  • Paper: Wide Residual Networks (Zagoruyko & Komodakis, 2016)
  • Paper: Neural Ordinary Differential Equations (Chen et al., NeurIPS 2018 — Best Paper)
  • Paper: Denoising Diffusion Probabilistic Models (Ho et al., 2020 — DDPM, U-Net backbone)
  • Blog: Understanding ResNet and its Variants — Towards Data Science
  • Blog: The Annotated ResNet — Aman Arora
  • Course: CS231n Convolutional Neural Networks for Visual Recognition (Stanford)
  • Course: CS236 Deep Generative Models (Stanford)
  • Code: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py