2026-06-30

ResNet (Residual Network)

ResNet is a deep convolutional neural network architecture that introduces skip connections (residual connections) to enable training of extremely deep networks. It won the ImageNet 2015 competition and became the foundational backbone for virtually all modern vision architectures, including the U-Net used in [[Diffusion Model|diffusion models]].

1. Core Concept: Residual Learning

1.1 The Degradation Problem

Before ResNet, deeper networks performed worse than shallower ones — not due to overfitting, but because optimization became harder as depth increased:

Network Depth	Training Error	Test Error
20 layers	Lower	Lower
56 layers	Higher	Higher

This counterintuitive phenomenon is called the degradation problem: adding more layers increases training error even though the deeper network can theoretically represent the shallower one (by setting extra layers to identity).

[!NOTE] Root Cause
Deep networks suffer from vanishing/exploding gradients that compound across many layers. Standard initialization and Batch Normalization help, but the core issue is that very deep networks struggle to learn identity mappings — which are surprisingly difficult for stacked nonlinear layers to represent.

1.2 The Residual Solution

Instead of learning a direct mapping $H (x)$ , ResNet learns the residual $F (x) = H (x) - x$ :

y = F (x, {W_{i}}) + x

where $F (x, {W_{i}})$ is the residual function (typically 2-3 weight layers), and $x$ is the skip connection that bypasses these layers.

[!NOTE] Why This Works
If the optimal mapping is close to identity ( $H (x) \approx x$ ), the residual $F (x) \approx 0$ is much easier to learn than the full mapping. The skip connection provides a gradient highway that allows error signals to flow directly backward without attenuation.

1.3 Mathematical Intuition

Standard network (hard to optimize):

y = H (x)

Residual network (easier to optimize):

y = F (x) + x

Gradient flow through skip connection:

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot (\frac{\partial F (x)}{\partial x} + 1)

The additive identity term $1$ ensures gradients never vanish completely — even if $\frac{\partial F}{\partial x}$ becomes very small.

2. Architecture Design

2.1 Basic Residual Block

The fundamental building block (used in ResNet-18/34):

class BasicBlock(nn.Module):
    """Basic residual block: 3x3 conv → BN → ReLU → 3x3 conv → BN → +skip → ReLU"""
    
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, 
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # Skip connection with 1x1 conv when dimensions change
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x):
        residual = self.shortcut(x)
        
        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        
        out += residual        # Element-wise addition
        out = F.relu(out)
        return out

Structure:

1 2	Input → [3×3 Conv → BN → ReLU → 3×3 Conv → BN] → + Input → ReLU → Output └───────────────── Residual Path ────────┘

2.2 Bottleneck Block

Used in deeper variants (ResNet-50/101/152) to reduce computation:

class Bottleneck(nn.Module):
    """Bottleneck: 1x1 → 3x3 → 1x1 conv, reducing and restoring channels"""
    
    expansion = 4  # Output channels = bottleneck_channels × 4
    
    def __init__(self, in_channels, bottleneck_channels, stride=1):
        super().__init__()
        # 1x1 compress: 256 → 64
        self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(bottleneck_channels)
        # 3x3 spatial: 64 → 64
        self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(bottleneck_channels)
        # 1x1 expand: 64 → 256
        self.conv3 = nn.Conv2d(bottleneck_channels, 
                               bottleneck_channels * self.expansion, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(bottleneck_channels * self.expansion)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != bottleneck_channels * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, bottleneck_channels * self.expansion,
                          1, stride=stride, bias=False),
                nn.BatchNorm2d(bottleneck_channels * self.expansion)
            )
    
    def forward(self, x):
        residual = self.shortcut(x)
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += residual
        out = F.relu(out)
        return out

Bottleneck structure:

1	256-d → [1×1, 64] → ReLU → [3×3, 64] → ReLU → [1×1, 256] → + skip → ReLU

Block Type	Convolutions	Parameter Count	Used In
Basic	3×3, 3×3	$2 \cdot 3^{2} \cdot C_{in} C_{out}$	ResNet-18, 34
Bottleneck	1×1, 3×3, 1×1	$C_{in} C_{mid} + 3^{2} C_{mid}^{2} + C_{mid} C_{out}$	ResNet-50, 101, 152

2.3 Full Architecture Zoo

Model	Layers	Blocks per Stage	Block Type	Parameters
ResNet-18	18	`[2, 2, 2, 2]`	Basic	11.7M
ResNet-34	34	`[3, 4, 6, 3]`	Basic	21.8M
ResNet-50	50	`[3, 4, 6, 3]`	Bottleneck	25.6M
ResNet-101	101	`[3, 4, 23, 3]`	Bottleneck	44.5M
ResNet-152	152	`[3, 8, 36, 3]`	Bottleneck	60.2M

Overall architecture flow:

Input (224×224×3)
  ↓
7×7 Conv, 64, stride=2 → BN → ReLU
  ↓
3×3 MaxPool, stride=2
  ↓
Stage 1: [ResBlock × n₁], 64/256 channels, stride=1
  ↓
Stage 2: [ResBlock × n₂], 128/512 channels, stride=2
  ↓
Stage 3: [ResBlock × n₃], 256/1024 channels, stride=2
  ↓
Stage 4: [ResBlock × n₄], 512/2048 channels, stride=2
  ↓
Global Average Pooling → FC 1000 → Softmax

3. Why ResNet Works: Theoretical Analysis

3.1 Gradient Highway

The skip connection creates an uninterrupted gradient path. The backpropagation through a residual block:

\frac{\partial L}{\partial x_{l}} = \frac{\partial L}{\partial x_{L}} (1 + \sum_{i = l}^{L - 1} \frac{\partial F (x_{i})}{\partial x_{l}})

The term $1$ guarantees that gradients from layer $L$ can reach layer $l$ without multiplicative attenuation.

3.2 Unrolled View as Ensemble

A ResNet with $n$ residual blocks can be viewed as an exponential ensemble of $2^{n}$ paths:

x_{L} = x_{0} + \sum_{i = 0}^{L - 1} F (x_{i})

Each path corresponds to including/not-including each residual block — effectively an ensemble of shallower networks that share parameters. Removing any single block during inference causes only a mild performance drop (≈0.3% accuracy loss), confirming this ensemble interpretation.

3.3 Identity Mapping Importance

The pre-activation design (He et al., 2016) proved that making the skip connection a pure identity mapping (no ReLU after the addition) is critical:

Design	Skip Path	Accuracy (ResNet-110, CIFAR-10)
Original	Conv → BN → ReLU → Conv → BN → +x → ReLU	6.43% error
Pre-activation	+x → BN → ReLU → Conv → BN → ReLU → Conv	4.62% error

[!NOTE] Pre-Activation Insight
Moving BatchNorm and ReLU before the convolution (instead of after) makes the information flow through the skip connection completely clean, enabling even 1000+ layer networks to be trained effectively.

3.4 Pre-activation ResNet Code

class PreActBlock(nn.Module):
    """Pre-activation residual block (He et al., 2016)"""
    
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                               stride=1, padding=1, bias=False)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False)
            )
    
    def forward(self, x):
        residual = self.shortcut(x)
        out = F.relu(self.bn1(x))
        out = self.conv1(out)
        out = F.relu(self.bn2(out))
        out = self.conv2(out)
        return out + residual  # No ReLU after addition → pure identity path

4. Connection to [[Neural ODE]]

4.1 ResNet as Discrete ODE

A ResNet block with step size $Δ t$ :

h_{t + Δ t} = h_{t} + Δ t \cdot f_{θ} (h_{t})

In the limit $Δ t \to 0$ , this becomes a [[Neural ODE]]:

\frac{d h (t)}{d t} = f_{θ} (h (t), t)

Aspect	ResNet	[[Neural ODE]]
Layer index	Discrete $t = 0, 1, 2, \dots, L$	Continuous $t \in [0, T]$
Depth	Fixed $L$	ODE solver adapts steps
Step size	$Δ t = 1$	Adaptive, solver-determined
Parameters	Per-layer or shared	Naturally shared across “depth”
Memory (backprop)	$O (L)$	$O (1)$ (adjoint method)

4.2 The Continuum Bridge

ResNet provides the discrete scaffold from which [[Neural ODE]] emerged as the continuous limit. This connection means:

ResNet training insights (initialization, normalization) transfer to Neural ODEs
Neural ODEs can be discretized back into ResNet-like architectures for deployment
ResNet’s skip connections are the discrete analog of numerical ODE integrators (specifically, the forward Euler method)

1
2
3

ResNet:    h_{t+1} = h_t + f(h_t)        ←  Forward Euler (ODE discretization)
Neural ODE: dh/dt = f(h(t), t)           ←  Continuous limit (Δt → 0)
DenseNet:  h_{t+1} = concat(h_t, f(h_t)) ←  Runge-Kutta-like (higher-order)

5. ResNet in [[Diffusion Model|Diffusion Models]]

5.1 U-Net Backbone

Modern diffusion models (DDPM, Stable Diffusion) use a U-Net architecture whose encoder and decoder are built from ResNet blocks:

U-Net for Diffusion
═══════════════════════════════════════
Encoder (Downsampling):
  ResBlock × 2 → Downsample ────────────────────┐
  ResBlock × 2 → Downsample ────────────┐       │
  ResBlock × 2 → Downsample ────┐       │       │
  ResBlock × 2                  │       │       │
                             ┌──┘       │       │
Middle:                      │          │       │
  ResBlock + Self-Attention   │          │       │
                             └──┐       │       │
Decoder (Upsampling):           │       │       │
  ResBlock × 2 ← Concat ←───────┘       │       │
  ResBlock × 2 ← Concat ←───────────────┘       │
  ResBlock × 2 ← Concat ←───────────────────────┘
  ResBlock × 2
═══════════════════════════════════════

Each ResBlock in the U-Net processes:

Time embedding: Injects diffusion timestep $t$ via scaling/shifting
Skip connection: Preserves fine-grained spatial information
Group Normalization: Replaces BatchNorm (better for varying batch sizes)

5.2 Diffusion ResBlock Code

class DiffusionResBlock(nn.Module):
    """Residual block used in diffusion model U-Nets."""
    
    def __init__(self, channels, emb_channels, out_channels=None):
        super().__init__()
        out_channels = out_channels or channels
        
        self.norm1 = nn.GroupNorm(32, channels)
        self.conv1 = nn.Conv2d(channels, out_channels, 3, padding=1)
        self.norm2 = nn.GroupNorm(32, out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        
        # Time embedding projection
        self.time_emb_proj = nn.Linear(emb_channels, out_channels)
        
        # Skip connection when channels change
        self.skip = nn.Conv2d(channels, out_channels, 1) \
                    if channels != out_channels else nn.Identity()
    
    def forward(self, x, t_emb):
        # Time embedding → scale & shift
        t_scale_shift = self.time_emb_proj(F.silu(t_emb))
        scale, shift = t_scale_shift.chunk(2, dim=1)
        
        residual = self.skip(x)
        
        h = self.norm1(x)
        h = F.silu(h)
        h = self.conv1(h)
        h = h * (1 + scale[:, :, None, None]) + shift[:, :, None, None]
        
        h = self.norm2(h)
        h = F.silu(h)
        h = self.conv2(h)
        
        return h + residual

6. Beyond ResNet: Architecture Family Tree

6.1 Direct Descendants

Architecture	Innovation	Relationship to ResNet
ResNeXt	Grouped convolutions (“cardinality”)	Multi-branch ResNet
DenseNet	Dense skip connections (all previous layers)	Extreme skip connection variant
Wide ResNet	Wider layers, fewer depth	Trades depth for width
SE-ResNet	Squeeze-and-Excitation attention	Adds channel attention to ResBlocks
ResNeSt	Split-attention blocks	Combines ResNeXt + SE-Net

6.2 Conceptual Influences

Architecture	How ResNet Influenced It
U-Net	Skip connections between encoder-decoder (same spirit)
[[Neural ODE]]	Continuous limit of discrete ResNet layers
Transformer	Residual connections in every attention block
Highway Networks	Learned gating on skip connections (predecessor, 2015)
FPN (Feature Pyramid)	Lateral skip connections for multi-scale features

6.3 The Residual Principle in Modern ML

ResNet (2015) ────→ U-Net (2015) ────→ Diffusion U-Net (2020+)
     │
     ├──→ Transformer (2017): Pre-LN residual blocks
     │         └──→ ViT, GPT, BERT, LLaMA
     │
     ├──→ DenseNet (2017): Dense residual connections
     │
     └──→ Neural ODE (2018): Continuous residual limit
               └──→ FFJORD, Latent ODE, Neural CDE

7. Practical Training Insights

7.1 Key Hyperparameters

Parameter	ResNet-50 (ImageNet)	Notes
Initialization	He (Kaiming) Normal	Designed for ReLU
BatchNorm momentum	0.1	Standard
Weight decay	1e-4	L2 regularization
Learning rate	0.1 → divided by 10 at 30, 60, 80 epochs	Step decay
Batch size	256	8× V100 GPUs
Epochs	90	Standard ImageNet schedule

7.2 Initialization Matters

ResNet uses Kaiming initialization specifically designed for layers followed by ReLU:

def kaiming_init(m):
    if isinstance(m, nn.Conv2d):
        # He initialization: std = sqrt(2 / fan_in)
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
    elif isinstance(m, nn.BatchNorm2d):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)

model.apply(kaiming_init)

7.3 Batch Normalization After Addition

In the original ResNet design, BatchNorm is applied before the skip connection addition:

Wrong (common mistake):
  Conv → BN → ReLU → Conv → +x → BN → ReLU
  
Correct (original ResNet):
  Conv → BN → ReLU → Conv → BN → +x → ReLU

The BN after addition destroys the clean identity path — this is why pre-activation ResNet emerged.

8. Theoretical Properties

8.1 Loss Landscape Smoothing

Residual connections smooth the loss landscape, making optimization easier:

Without skip connections: Loss surface has sharp local minima and chaotic curvature
With skip connections: Loss surface becomes smoother and more convex-like
This explains why ResNets converge faster and to better minima

8.2 Shattered Gradients Problem

In very deep plain networks, gradients resemble white noise — uncorrelated across layers. Residual connections preserve gradient correlation, enabling meaningful information exchange across the full network depth.

8.3 Depth Efficiency

Depth	Plain Network Accuracy	ResNet Accuracy
20 layers	91.25%	91.75%
56 layers	90.10% (degradation!)	93.03%
110 layers	Untrainable	93.57%

This proved that depth is still valuable — it just requires the right architecture to unlock it.

9. Core Formula Cards

[!QUOTE] Residual Block
$y = F (x, {W_{i}}) + x$

[!QUOTE] Gradient Flow with Skip Connection
$\frac{\partial L}{\partial x_{l}} = \frac{\partial L}{\partial x_{L}} (1 + \sum_{i = l}^{L - 1} \frac{\partial F (x_{i})}{\partial x_{l}})$

[!QUOTE] Unrolled ResNet (Ensemble View)
$x_{L} = x_{0} + \sum_{i = 0}^{L - 1} F (x_{i}, {W_{i}})$

[!QUOTE] ResNet as Discrete ODE (Forward Euler)
$h_{t + 1} = h_{t} + f_{θ} (h_{t})$

[!QUOTE] Continuous Limit → [[Neural ODE]]
$\frac{d h (t)}{d t} = f_{θ} (h (t), t)$

[!QUOTE] Bottleneck Block Dimensions
$C_{in} \overset{1 \times 1}{\to} \frac{C_{in}}{4} \overset{3 \times 3}{\to} \frac{C_{in}}{4} \overset{1 \times 1}{\to} C_{in}$

10. Summary

Aspect	Description
Core idea	Learn residual $F (x)$ rather than full mapping $H (x)$
Key mechanism	Skip connections create gradient highways
Breakthrough	First architecture to successfully train 100+ layer networks
Impact	Backbone for U-Net (diffusion), Transformer (LLM), Neural ODE
Variants	Pre-activation, Bottleneck, Wide ResNet, ResNeXt, DenseNet
Role in diffusion	Building block of diffusion U-Net encoders/decoders
Continuous limit	[[Neural ODE]] (forward Euler discretization → continuous ODE)

[[Neural ODE]]
[[Diffusion Model]]
[[U-Net]]
[[Convolutional Neural Network (CNN)]]
[[Vision Transformer (ViT)]]
[[Batch Normalization]]
[[DenseNet]]
[[Transformer]]
[[Welford]]

Dataview Query

1
2
3

LIST
FROM #resnet OR #skip_connection OR #residual_learning
SORT file.ctime DESC

References

Paper: Deep Residual Learning for Image Recognition (He et al., CVPR 2016 — Best Paper)
Paper: Identity Mappings in Deep Residual Networks (He et al., ECCV 2016 — Pre-activation ResNet)
Paper: Aggregated Residual Transformations for Deep Neural Networks (Xie et al., 2017 — ResNeXt)
Paper: Wide Residual Networks (Zagoruyko & Komodakis, 2016)
Paper: Neural Ordinary Differential Equations (Chen et al., NeurIPS 2018 — Best Paper)
Paper: Denoising Diffusion Probabilistic Models (Ho et al., 2020 — DDPM, U-Net backbone)
Blog: Understanding ResNet and its Variants — Towards Data Science
Blog: The Annotated ResNet — Aman Arora
Course: CS231n Convolutional Neural Networks for Visual Recognition (Stanford)
Course: CS236 Deep Generative Models (Stanford)
Code: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py

ChungMG

Mathematics & Machine Learning

ResNet (Residual Network)

1. Core Concept: Residual Learning

1.1 The Degradation Problem

1.2 The Residual Solution

1.3 Mathematical Intuition

2. Architecture Design

2.1 Basic Residual Block

2.2 Bottleneck Block

2.3 Full Architecture Zoo

3. Why ResNet Works: Theoretical Analysis

3.1 Gradient Highway

3.2 Unrolled View as Ensemble

3.3 Identity Mapping Importance

3.4 Pre-activation ResNet Code

4. Connection to [[Neural ODE]]

4.1 ResNet as Discrete ODE

4.2 The Continuum Bridge

5. ResNet in [[Diffusion Model|Diffusion Models]]

5.1 U-Net Backbone

5.2 Diffusion ResBlock Code

6. Beyond ResNet: Architecture Family Tree

6.1 Direct Descendants

6.2 Conceptual Influences

6.3 The Residual Principle in Modern ML

7. Practical Training Insights

7.1 Key Hyperparameters

7.2 Initialization Matters

7.3 Batch Normalization After Addition

8. Theoretical Properties

8.1 Loss Landscape Smoothing

8.2 Shattered Gradients Problem

8.3 Depth Efficiency

9. Core Formula Cards

10. Summary

Dataview Query

References

ResNet (Residual Network)

1. Core Concept: Residual Learning

1.1 The Degradation Problem

1.2 The Residual Solution

1.3 Mathematical Intuition

2. Architecture Design

2.1 Basic Residual Block

2.2 Bottleneck Block

2.3 Full Architecture Zoo

3. Why ResNet Works: Theoretical Analysis

3.1 Gradient Highway

3.2 Unrolled View as Ensemble

3.3 Identity Mapping Importance

3.4 Pre-activation ResNet Code

4. Connection to [[Neural ODE]]

4.1 ResNet as Discrete ODE

4.2 The Continuum Bridge

5. ResNet in [[Diffusion Model|Diffusion Models]]

5.1 U-Net Backbone

5.2 Diffusion ResBlock Code

6. Beyond ResNet: Architecture Family Tree

6.1 Direct Descendants

6.2 Conceptual Influences

6.3 The Residual Principle in Modern ML

7. Practical Training Insights

7.1 Key Hyperparameters

7.2 Initialization Matters

7.3 Batch Normalization After Addition

8. Theoretical Properties

8.1 Loss Landscape Smoothing

8.2 Shattered Gradients Problem

8.3 Depth Efficiency

9. Core Formula Cards

10. Summary

Related Concepts

Dataview Query

References