ResNet (Residual Network)
ResNet is a deep convolutional neural network architecture that introduces skip connections (residual connections) to enable training of extremely deep networks. It won the ImageNet 2015 competition and became the foundational backbone for virtually all modern vision architectures, including the U-Net used in [[Diffusion Model|diffusion models]].
1. Core Concept: Residual Learning
1.1 The Degradation Problem
Before ResNet, deeper networks performed worse than shallower ones — not due to overfitting, but because optimization became harder as depth increased:
| Network Depth | Training Error | Test Error |
|---|---|---|
| 20 layers | Lower | Lower |
| 56 layers | Higher | Higher |
This counterintuitive phenomenon is called the degradation problem: adding more layers increases training error even though the deeper network can theoretically represent the shallower one (by setting extra layers to identity).
[!NOTE] Root Cause
Deep networks suffer from vanishing/exploding gradients that compound across many layers. Standard initialization and Batch Normalization help, but the core issue is that very deep networks struggle to learn identity mappings — which are surprisingly difficult for stacked nonlinear layers to represent.
1.2 The Residual Solution
Instead of learning a direct mapping
where
[!NOTE] Why This Works
If the optimal mapping is close to identity (), the residual is much easier to learn than the full mapping. The skip connection provides a gradient highway that allows error signals to flow directly backward without attenuation.
1.3 Mathematical Intuition
Standard network (hard to optimize):
Residual network (easier to optimize):
Gradient flow through skip connection:
The additive identity term
2. Architecture Design
2.1 Basic Residual Block
The fundamental building block (used in ResNet-18/34):
1 | class BasicBlock(nn.Module): |
Structure:
1 | Input → [3×3 Conv → BN → ReLU → 3×3 Conv → BN] → + Input → ReLU → Output |
2.2 Bottleneck Block
Used in deeper variants (ResNet-50/101/152) to reduce computation:
1 | class Bottleneck(nn.Module): |
Bottleneck structure:
1 | 256-d → [1×1, 64] → ReLU → [3×3, 64] → ReLU → [1×1, 256] → + skip → ReLU |
| Block Type | Convolutions | Parameter Count | Used In |
|---|---|---|---|
| Basic | 3×3, 3×3 |
|
ResNet-18, 34 |
| Bottleneck | 1×1, 3×3, 1×1 |
|
ResNet-50, 101, 152 |
2.3 Full Architecture Zoo
| Model | Layers | Blocks per Stage | Block Type | Parameters |
|---|---|---|---|---|
| ResNet-18 | 18 | [2, 2, 2, 2] |
Basic | 11.7M |
| ResNet-34 | 34 | [3, 4, 6, 3] |
Basic | 21.8M |
| ResNet-50 | 50 | [3, 4, 6, 3] |
Bottleneck | 25.6M |
| ResNet-101 | 101 | [3, 4, 23, 3] |
Bottleneck | 44.5M |
| ResNet-152 | 152 | [3, 8, 36, 3] |
Bottleneck | 60.2M |
Overall architecture flow:
1 | Input (224×224×3) |
3. Why ResNet Works: Theoretical Analysis
3.1 Gradient Highway
The skip connection creates an uninterrupted gradient path. The backpropagation through a residual block:
The term
3.2 Unrolled View as Ensemble
A ResNet with
Each path corresponds to including/not-including each residual block — effectively an ensemble of shallower networks that share parameters. Removing any single block during inference causes only a mild performance drop (≈0.3% accuracy loss), confirming this ensemble interpretation.
3.3 Identity Mapping Importance
The pre-activation design (He et al., 2016) proved that making the skip connection a pure identity mapping (no ReLU after the addition) is critical:
| Design | Skip Path | Accuracy (ResNet-110, CIFAR-10) |
|---|---|---|
| Original | Conv → BN → ReLU → Conv → BN → +x → ReLU | 6.43% error |
| Pre-activation | +x → BN → ReLU → Conv → BN → ReLU → Conv | 4.62% error |
[!NOTE] Pre-Activation Insight
Moving BatchNorm and ReLU before the convolution (instead of after) makes the information flow through the skip connection completely clean, enabling even 1000+ layer networks to be trained effectively.
3.4 Pre-activation ResNet Code
1 | class PreActBlock(nn.Module): |
4. Connection to [[Neural ODE]]
4.1 ResNet as Discrete ODE
A ResNet block with step size
In the limit
| Aspect | ResNet | [[Neural ODE]] |
|---|---|---|
| Layer index | Discrete
|
Continuous
|
| Depth | Fixed
|
ODE solver adapts steps |
| Step size |
|
Adaptive, solver-determined |
| Parameters | Per-layer or shared | Naturally shared across “depth” |
| Memory (backprop) |
|
|
4.2 The Continuum Bridge
ResNet provides the discrete scaffold from which [[Neural ODE]] emerged as the continuous limit. This connection means:
- ResNet training insights (initialization, normalization) transfer to Neural ODEs
- Neural ODEs can be discretized back into ResNet-like architectures for deployment
- ResNet’s skip connections are the discrete analog of numerical ODE integrators (specifically, the forward Euler method)
1 | ResNet: h_{t+1} = h_t + f(h_t) ← Forward Euler (ODE discretization) |
5. ResNet in [[Diffusion Model|Diffusion Models]]
5.1 U-Net Backbone
Modern diffusion models (DDPM, Stable Diffusion) use a U-Net architecture whose encoder and decoder are built from ResNet blocks:
1 | U-Net for Diffusion |
Each ResBlock in the U-Net processes:
- Time embedding: Injects diffusion timestep
via scaling/shifting - Skip connection: Preserves fine-grained spatial information
- Group Normalization: Replaces BatchNorm (better for varying batch sizes)
5.2 Diffusion ResBlock Code
1 | class DiffusionResBlock(nn.Module): |
6. Beyond ResNet: Architecture Family Tree
6.1 Direct Descendants
| Architecture | Innovation | Relationship to ResNet |
|---|---|---|
| ResNeXt | Grouped convolutions (“cardinality”) | Multi-branch ResNet |
| DenseNet | Dense skip connections (all previous layers) | Extreme skip connection variant |
| Wide ResNet | Wider layers, fewer depth | Trades depth for width |
| SE-ResNet | Squeeze-and-Excitation attention | Adds channel attention to ResBlocks |
| ResNeSt | Split-attention blocks | Combines ResNeXt + SE-Net |
6.2 Conceptual Influences
| Architecture | How ResNet Influenced It |
|---|---|
| U-Net | Skip connections between encoder-decoder (same spirit) |
| [[Neural ODE]] | Continuous limit of discrete ResNet layers |
| Transformer | Residual connections in every attention block |
| Highway Networks | Learned gating on skip connections (predecessor, 2015) |
| FPN (Feature Pyramid) | Lateral skip connections for multi-scale features |
6.3 The Residual Principle in Modern ML
1 | ResNet (2015) ────→ U-Net (2015) ────→ Diffusion U-Net (2020+) |
7. Practical Training Insights
7.1 Key Hyperparameters
| Parameter | ResNet-50 (ImageNet) | Notes |
|---|---|---|
| Initialization | He (Kaiming) Normal | Designed for ReLU |
| BatchNorm momentum | 0.1 | Standard |
| Weight decay | 1e-4 | L2 regularization |
| Learning rate | 0.1 → divided by 10 at 30, 60, 80 epochs | Step decay |
| Batch size | 256 | 8× V100 GPUs |
| Epochs | 90 | Standard ImageNet schedule |
7.2 Initialization Matters
ResNet uses Kaiming initialization specifically designed for layers followed by ReLU:
1 | def kaiming_init(m): |
7.3 Batch Normalization After Addition
In the original ResNet design, BatchNorm is applied before the skip connection addition:
1 | Wrong (common mistake): |
The BN after addition destroys the clean identity path — this is why pre-activation ResNet emerged.
8. Theoretical Properties
8.1 Loss Landscape Smoothing
Residual connections smooth the loss landscape, making optimization easier:
- Without skip connections: Loss surface has sharp local minima and chaotic curvature
- With skip connections: Loss surface becomes smoother and more convex-like
- This explains why ResNets converge faster and to better minima
8.2 Shattered Gradients Problem
In very deep plain networks, gradients resemble white noise — uncorrelated across layers. Residual connections preserve gradient correlation, enabling meaningful information exchange across the full network depth.
8.3 Depth Efficiency
| Depth | Plain Network Accuracy | ResNet Accuracy |
|---|---|---|
| 20 layers | 91.25% | 91.75% |
| 56 layers | 90.10% (degradation!) | 93.03% |
| 110 layers | Untrainable | 93.57% |
This proved that depth is still valuable — it just requires the right architecture to unlock it.
9. Core Formula Cards
[!QUOTE] Residual Block
[!QUOTE] Gradient Flow with Skip Connection
[!QUOTE] Unrolled ResNet (Ensemble View)
[!QUOTE] ResNet as Discrete ODE (Forward Euler)
[!QUOTE] Continuous Limit → [[Neural ODE]]
[!QUOTE] Bottleneck Block Dimensions
10. Summary
| Aspect | Description |
|---|---|
| Core idea | Learn residual
|
| Key mechanism | Skip connections create gradient highways |
| Breakthrough | First architecture to successfully train 100+ layer networks |
| Impact | Backbone for U-Net (diffusion), Transformer (LLM), Neural ODE |
| Variants | Pre-activation, Bottleneck, Wide ResNet, ResNeXt, DenseNet |
| Role in diffusion | Building block of diffusion U-Net encoders/decoders |
| Continuous limit | [[Neural ODE]] (forward Euler discretization → continuous ODE) |
Related Concepts
- [[Neural ODE]]
- [[Diffusion Model]]
- [[U-Net]]
- [[Convolutional Neural Network (CNN)]]
- [[Vision Transformer (ViT)]]
- [[Batch Normalization]]
- [[DenseNet]]
- [[Transformer]]
- [[Welford]]
Dataview Query
1 | LIST |
References
- Paper: Deep Residual Learning for Image Recognition (He et al., CVPR 2016 — Best Paper)
- Paper: Identity Mappings in Deep Residual Networks (He et al., ECCV 2016 — Pre-activation ResNet)
- Paper: Aggregated Residual Transformations for Deep Neural Networks (Xie et al., 2017 — ResNeXt)
- Paper: Wide Residual Networks (Zagoruyko & Komodakis, 2016)
- Paper: Neural Ordinary Differential Equations (Chen et al., NeurIPS 2018 — Best Paper)
- Paper: Denoising Diffusion Probabilistic Models (Ho et al., 2020 — DDPM, U-Net backbone)
- Blog: Understanding ResNet and its Variants — Towards Data Science
- Blog: The Annotated ResNet — Aman Arora
- Course: CS231n Convolutional Neural Networks for Visual Recognition (Stanford)
- Course: CS236 Deep Generative Models (Stanford)
- Code: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py