U-Net

U-Net is a fully convolutional encoder-decoder architecture with symmetric skip connections, originally designed for biomedical image segmentation. It has since become the de facto backbone for [[Diffusion Model|diffusion models]] (DDPM, Stable Diffusion), where it serves as the noise prediction network ϵθ(xt,t) .


1. Core Concept

1.1 The U-Shaped Design

U-Net gets its name from its characteristic U-shaped architecture diagram:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
U-Net Architecture (Original, 2015)
═══════════════════════════════════════════════════════
Encoder (Contracting Path) Decoder (Expanding Path)

Input → [Conv×2] ──────────────────────────────→ [Conv×2] → Output
↓ MaxPool ↑ UpConv
[Conv×2] ────────────────────────────→ [Conv×2]
↓ MaxPool ↑ UpConv
[Conv×2] ──────────────────────────→ [Conv×2]
↓ MaxPool ↑ UpConv
[Conv×2] ────────────────────────→ [Conv×2]
↓ MaxPool ↑ UpConv
└────────── [Conv×2] (Bottleneck) ──────┘
═══════════════════════════════════════════════════════

Where each horizontal arrow ───→ represents a skip connection that concatenates encoder features directly into the decoder.

1.2 Key Design Principles

Principle Description Benefit
Symmetric Encoder-Decoder Mirror structure: downsampling path + upsampling path Multi-scale feature extraction
Skip Connections Direct concatenation of encoder features to decoder Preserve fine spatial details lost during downsampling
Fully Convolutional No fully connected layers Arbitrary input sizes
Multi-scale Processing Features at 4-5 resolution levels Capture both local texture and global structure

1.3 Why “U”?

The architecture compresses spatial resolution while expanding channel depth (encoder), then reverses the process (decoder), with skip connections bridging same-resolution levels — forming a U-shaped information flow:

H×W×CInputH2×W2×2CLevel 2H4×W4×4CLevel 3H8×W8×8CBottleneckH×W×CoutOutput

2. Original U-Net (Ronneberger et al., 2015)

2.1 Original Design

The original U-Net was proposed for biomedical image segmentation (cell tracking, organ segmentation):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
class OriginalUNet(nn.Module):
"""Original U-Net for biomedical segmentation."""

def __init__(self, in_channels=1, out_channels=2, features=[64, 128, 256, 512]):
super().__init__()

# Encoder (Contracting Path)
self.encoders = nn.ModuleList()
for i, feat in enumerate(features):
in_ch = in_channels if i == 0 else features[i-1]
self.encoders.append(self._double_conv(in_ch, feat))

self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Bottleneck
self.bottleneck = self._double_conv(features[-1], features[-1] * 2)

# Decoder (Expanding Path)
self.decoders = nn.ModuleList()
self.upconvs = nn.ModuleList()
for feat in reversed(features):
self.upconvs.append(
nn.ConvTranspose2d(feat * 2, feat, kernel_size=2, stride=2)
)
# After concatenation: feat (encoder) + feat (upconv) = 2*feat
self.decoders.append(self._double_conv(feat * 2, feat))

self.final_conv = nn.Conv2d(features[0], out_channels, kernel_size=1)

def _double_conv(self, in_ch, out_ch):
return nn.Sequential(
nn.Conv2d(in_ch, out_ch, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
)

def forward(self, x):
# Encoder
skip_connections = []
for encoder in self.encoders:
x = encoder(x)
skip_connections.append(x)
x = self.pool(x)

# Bottleneck
x = self.bottleneck(x)

# Decoder with skip connections
skip_connections = skip_connections[::-1] # reverse
for i, (upconv, decoder) in enumerate(zip(self.upconvs, self.decoders)):
x = upconv(x)
# Concatenate skip connection from encoder
skip = skip_connections[i]
# Handle size mismatch (crop if needed)
if x.shape != skip.shape:
x = F.interpolate(x, size=skip.shape[2:])
x = torch.cat([skip, x], dim=1)
x = decoder(x)

return self.final_conv(x)

2.2 Skip Connection Mechanics

The skip connection concatenates (not adds) encoder features directly to decoder features:

hdec(l)=Conv([henc(l)UpConv(hdec(l1))])

where denotes channel-wise concatenation. This is different from [[ResNet]]'s additive skip connection:

Aspect U-Net Skip ResNet Skip
Operation Concatenation Addition
Channel change Doubles channels (encoder + upconv) Preserves channels (identity)
Purpose Restore spatial details Ease gradient flow
Structure Cross-resolution (encoder → decoder) Same-resolution (input → output)

2.3 Training Strategy (Original Paper)

The original U-Net used several key training techniques:

Technique Description
Overlap-tile strategy Predict segmentation in tiles with overlap to handle large images
Elastic deformations Data augmentation via random elastic transformations
Weighted loss Higher weight on separation borders between touching objects
Weight map Pre-computed pixel-wise weight map emphasizing boundary pixels

Loss function (weighted cross-entropy):

E=xΩw(x)log(p(x)(x))

where w(x) is the weight map emphasizing borders between cells:

w(x)=wc(x)+w0exp((d1(x)+d2(x))22σ2)

3. U-Net in Diffusion Models

3.1 Why U-Net for Diffusion?

Diffusion models need a network ϵθ(xt,t) that:

  1. Preserves spatial resolution (input and output have same shape)
  2. Captures multi-scale features (noise patterns exist at all scales)
  3. Incorporates time conditioning (different denoising behavior at each t )
  4. Handles additional conditioning (text, class labels, images)

U-Net perfectly satisfies all four requirements.

3.2 Diffusion U-Net Architecture

Modern diffusion U-Nets extend the original design with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Diffusion U-Net (DDPM / Stable Diffusion)
═══════════════════════════════════════════════════════
Input (noisy image x_t + timestep t)

├── Time Embedding: Sinusoidal → MLP → [emb_dim]


Encoder:
ResBlock × 2 → Downsample ──────────────────────┐
ResBlock × 2 → Downsample ──────────────┐ │
ResBlock × 2 → Downsample ────┐ │ │
ResBlock × 2 │ │ │
┌──┘ │ │
Middle: │ │ │
ResBlock + Self-Attention │ │ │
└──┐ │ │
Decoder: │ │ │
ResBlock × 2 ← Concat ←───────┘ │ │
ResBlock × 2 ← Concat ←─────────────────┘ │
ResBlock × 2 ← Concat ←─────────────────────────┘
ResBlock × 2


Output (predicted noise ε_θ)
═══════════════════════════════════════════════════════

3.3 Key Components

Time Embedding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class SinusoidalTimeEmbedding(nn.Module):
"""Sinusoidal position encoding for diffusion timesteps."""

def __init__(self, dim):
super().__init__()
self.dim = dim

def forward(self, t):
# t: (B,) integer timesteps
half_dim = self.dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, device=t.device) * -emb)
emb = t[:, None].float() * emb[None, :]
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
return emb # (B, dim)

ResBlock with Time Conditioning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class DiffusionResBlock(nn.Module):
"""ResNet block with time embedding injection."""

def __init__(self, channels, emb_channels, out_channels=None,
dropout=0.0):
super().__init__()
out_channels = out_channels or channels

self.norm1 = nn.GroupNorm(32, channels)
self.conv1 = nn.Conv2d(channels, out_channels, 3, padding=1)
self.norm2 = nn.GroupNorm(32, out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)

# Time embedding injection
self.time_mlp = nn.Sequential(
nn.SiLU(),
nn.Linear(emb_channels, out_channels * 2),
)

self.dropout = nn.Dropout(dropout)
self.skip = nn.Conv2d(channels, out_channels, 1) \
if channels != out_channels else nn.Identity()

def forward(self, x, t_emb):
# Time conditioning via scale-and-shift
scale_shift = self.time_mlp(t_emb)[:, :, None, None]
scale, shift = scale_shift.chunk(2, dim=1)

h = self.norm1(x)
h = F.silu(h)
h = self.conv1(h)
h = h * (1 + scale) + shift # Inject time

h = self.norm2(h)
h = F.silu(h)
h = self.dropout(h)
h = self.conv2(h)

return h + self.skip(x)

Self-Attention Block

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class SelfAttention(nn.Module):
"""Multi-head self-attention for diffusion U-Net."""

def __init__(self, channels, num_heads=8):
super().__init__()
self.num_heads = num_heads
self.head_dim = channels // num_heads
self.scale = self.head_dim ** -0.5

self.qkv = nn.Conv2d(channels, channels * 3, 1, bias=False)
self.proj = nn.Conv2d(channels, channels, 1)

def forward(self, x):
B, C, H, W = x.shape
qkv = self.qkv(x).reshape(B, 3, self.num_heads,
self.head_dim, H * W)
q, k, v = qkv[:, 0], qkv[:, 1], qkv[:, 2]

# Scaled dot-product attention
attn = (q * self.scale) @ k.transpose(-2, -1)
attn = F.softmax(attn, dim=-1)

out = (attn @ v).reshape(B, C, H, W)
return self.proj(out)

Cross-Attention for Conditioning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class CrossAttention(nn.Module):
"""Cross-attention for text/image conditioning (Stable Diffusion)."""

def __init__(self, query_dim, context_dim, num_heads=8):
super().__init__()
self.num_heads = num_heads
self.head_dim = query_dim // num_heads
self.scale = self.head_dim ** -0.5

self.to_q = nn.Linear(query_dim, query_dim, bias=False)
self.to_k = nn.Linear(context_dim, query_dim, bias=False)
self.to_v = nn.Linear(context_dim, query_dim, bias=False)
self.to_out = nn.Linear(query_dim, query_dim)

def forward(self, x, context):
# x: (B, N, C) — spatial features flattened
# context: (B, L, C_ctx) — text/image embeddings
q = self.to_q(x)
k = self.to_k(context)
v = self.to_v(context)

# Reshape for multi-head
B, N, C = q.shape
q = q.reshape(B, N, self.num_heads, self.head_dim).transpose(1, 2)
k = k.reshape(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
v = v.reshape(B, -1, self.num_heads, self.head_dim).transpose(1, 2)

attn = (q * self.scale) @ k.transpose(-2, -1)
attn = F.softmax(attn, dim=-1)

out = (attn @ v).transpose(1, 2).reshape(B, N, C)
return self.to_out(out)

3.4 Complete Diffusion U-Net

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
class DiffusionUNet(nn.Module):
"""Full diffusion U-Net with time conditioning and attention."""

def __init__(self, in_channels=3, out_channels=3, model_channels=128,
channel_mult=[1, 2, 4, 8], num_res_blocks=2,
attention_resolutions=[16], dropout=0.0):
super().__init__()

# Time embedding
time_emb_dim = model_channels * 4
self.time_embed = nn.Sequential(
SinusoidalTimeEmbedding(model_channels),
nn.Linear(model_channels, time_emb_dim),
nn.SiLU(),
nn.Linear(time_emb_dim, time_emb_dim),
)

# Input projection
self.input_blocks = nn.ModuleList([
nn.Conv2d(in_channels, model_channels, 3, padding=1)
])

# Encoder
input_block_channels = [model_channels]
ch = model_channels
ds = 1
for level, mult in enumerate(channel_mult):
for _ in range(num_res_blocks):
layers = [
DiffusionResBlock(ch, time_emb_dim,
model_channels * mult, dropout)
]
ch = model_channels * mult
if ds in attention_resolutions:
layers.append(SelfAttention(ch))
self.input_blocks.append(nn.ModuleList(layers))
input_block_channels.append(ch)
if level != len(channel_mult) - 1:
self.input_blocks.append(
nn.ModuleList([nn.Conv2d(ch, ch, 3, stride=2, padding=1)])
)
input_block_channels.append(ch)
ds *= 2

# Middle block
self.middle_block = nn.ModuleList([
DiffusionResBlock(ch, time_emb_dim, ch, dropout),
SelfAttention(ch),
DiffusionResBlock(ch, time_emb_dim, ch, dropout),
])

# Decoder
self.output_blocks = nn.ModuleList([])
for level, mult in list(enumerate(channel_mult))[::-1]:
for i in range(num_res_blocks + 1):
skip_ch = input_block_channels.pop()
layers = [
DiffusionResBlock(ch + skip_ch, time_emb_dim,
model_channels * mult, dropout)
]
ch = model_channels * mult
if ds in attention_resolutions:
layers.append(SelfAttention(ch))
if level > 0 and i == num_res_blocks:
layers.append(nn.ConvTranspose2d(ch, ch, 3, stride=2, padding=1, output_padding=1))
ds //= 2
self.output_blocks.append(nn.ModuleList(layers))

# Output
self.out = nn.Sequential(
nn.GroupNorm(32, ch),
nn.SiLU(),
nn.Conv2d(ch, out_channels, 3, padding=1),
)

def forward(self, x, timesteps, context=None):
# Time embedding
t_emb = self.time_embed(timesteps)

# Encoder + collect skip connections
hs = []
h = x
for module in self.input_blocks:
if isinstance(module, nn.ModuleList):
for layer in module:
if isinstance(layer, DiffusionResBlock):
h = layer(h, t_emb)
elif isinstance(layer, SelfAttention):
h = layer(h)
else:
h = module(h)
hs.append(h)

# Middle
for layer in self.middle_block:
if isinstance(layer, DiffusionResBlock):
h = layer(h, t_emb)
elif isinstance(layer, SelfAttention):
h = layer(h)

# Decoder
for module in self.output_blocks:
skip = hs.pop()
h = torch.cat([h, skip], dim=1)
for layer in module:
if isinstance(layer, DiffusionResBlock):
h = layer(h, t_emb)
elif isinstance(layer, SelfAttention):
h = layer(h)
elif isinstance(layer, nn.ConvTranspose2d):
h = layer(h)

return self.out(h)

3.5 Design Choices in Diffusion U-Nets

Component Original U-Net (2015) Diffusion U-Net (2020+)
Base block Double Conv + ReLU ResBlock + SiLU
Normalization BatchNorm GroupNorm (32 groups)
Downsampling MaxPool (2×2) Strided Conv (stride=2)
Upsampling Transposed Conv Transposed Conv or Nearest + Conv
Attention None Self-Attn at low resolutions
Conditioning None Time emb (scale-shift), Cross-Attn (text)
Skip connection Concatenation Concatenation
Activation ReLU SiLU (Swish)

4. U-Net Variants

4.1 Architectural Evolution

Variant Year Innovation Use Case
U-Net 2015 Original encoder-decoder + skip connections Biomedical segmentation
3D U-Net 2016 Extends to 3D volumes CT/MRI segmentation
Attention U-Net 2018 Attention gates on skip connections Improve focus on target structures
U-Net++ 2018 Nested, dense skip pathways Better multi-scale feature fusion
U-Net+++ 2020 Full-scale skip connections Extreme multi-scale fusion
Diffusion U-Net 2020 ResBlock + Self-Attn + Time Embedding Noise prediction in diffusion
Stable Diffusion U-Net 2022 Cross-attention conditioning + latent space Text-to-image generation

4.2 Attention U-Net

Adds attention gates to skip connections, allowing the model to focus on relevant regions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class AttentionGate(nn.Module):
"""Attention gate for U-Net skip connections."""

def __init__(self, F_g, F_l, F_int):
# F_g: decoder feature channels
# F_l: encoder feature channels (skip)
super().__init__()
self.W_g = nn.Conv2d(F_g, F_int, 1)
self.W_x = nn.Conv2d(F_l, F_int, 1)
self.psi = nn.Conv2d(F_int, 1, 1)

def forward(self, g, x):
# g: gating signal from decoder
# x: skip connection from encoder
attn = self.psi(F.relu(self.W_g(g) + self.W_x(x)))
attn = torch.sigmoid(attn)
return x * attn

4.3 U-Net++

Replaces plain skip connections with dense convolutional blocks on skip pathways:

1
2
3
4
5
6
7
8
9
10
U-Net++ Skip Pathways:
X₀₀ ──────→ X₀₁ ──────→ X₀₂ ──────→ X₀₃
│ ╲ │ ╲ │ ╲
│ ╲ │ ╲ │ ╲
│ X₁₀ ──┼───→ X₁₁ ──┼───→ X₁₂
│ │ ╲ │ ╲
│ │ ╲ │ ╲
│ │ X₂₀ ──┼───→ X₂₁
│ │ │
└───────────┴───────────┴──→ Output

Each node Xi,j aggregates features from multiple preceding nodes:

Xi,j={H(D(Xi1,j)),j=0H([[Xi,k]k=0j1,U(Xi+1,j1)]),j>0

5. Comparison of U-Net Across Domains

5.1 Segmentation vs. Diffusion

Aspect Segmentation U-Net Diffusion U-Net
Input Raw image Noisy image xt
Output Segmentation mask Predicted noise ϵθ or x^0
Conditioning None Timestep t , text, class
Attention Optional (Attention U-Net) Self-attention + Cross-attention
Normalization BatchNorm GroupNorm
Activation ReLU SiLU (Swish)
Resolution Fixed (e.g., 572×572) Flexible (powers of 2)
Key insight Skip connections recover spatial precision Skip connections propagate high-freq details through denoising

5.2 U-Net vs. Other Architectures

Architecture Skip Connection Type Multi-scale Best For
U-Net Cross-resolution concat ✅ Yes Segmentation, diffusion
[[ResNet]] Same-resolution additive ❌ No Classification, feature extraction
FPN Lateral connections ✅ Yes Object detection
DiT (Transformer) Residual within blocks ❌ No (patches) Scalable diffusion
Hourglass Similar to U-Net ✅ Yes Pose estimation

6. U-Net as Universal Diffusion Backbone

6.1 Why Not Transformer?

The U-Net remains dominant in diffusion for several reasons:

Reason Explanation
Inductive bias Convolutional structure naturally handles 2D/3D spatial data
Computational efficiency O(N) for convolutions vs. O(N2) for dense attention
Multi-scale native Encoder-decoder inherently captures multiple resolutions
Proven performance DDPM, Stable Diffusion, Imagen all use U-Net backbones
DiT limitations Transformer (DiT) only outperforms U-Net at very large scales ($>$500M params)

6.2 Diffusion Models Using U-Net

Model U-Net Variant Key Modification
DDPM U-Net + ResBlock + Self-Attn Time embedding via scale-shift
Stable Diffusion Latent U-Net + Cross-Attn Text conditioning, latent space
Imagen Cascaded U-Nets (64→256→1024) Multi-stage super-resolution
ControlNet Frozen U-Net + Trainable Copy Zero-convolution control branches
SDXL Larger U-Net (2.6B params) Dual text encoders, refiner

7. Practical Implementation Tips

7.1 Architecture Design Choices

Decision Recommendation Rationale
Depth 4-5 resolution levels Balance receptive field and spatial detail
Base channels 64-256 Trade-off between capacity and memory
Channel multipliers [1, 2, 4, 8] or [1, 2, 4] Double channels at each level
Attention resolution 322 or 162 Attention only at low resolutions (expensive)
ResBlocks per level 2 Standard, 3 for higher quality
GroupNorm groups 32 Works well across batch sizes
Dropout 0.1–0.2 Only in ResBlocks, not attention

7.2 Training Recommendations

1
2
3
4
5
6
7
8
9
10
11
# Key hyperparameters for diffusion U-Net training
config = {
"model_channels": 128, # Base channel count
"channel_mult": [1, 2, 4, 8], # Multipliers per level
"num_res_blocks": 2, # ResBlocks per level
"attention_resolutions": [16], # Apply attention at ≤16×16
"dropout": 0.1, # Regularization
"num_heads": 8, # Attention heads
"use_scale_shift_norm": True, # Time embedding via scale-shift
"resblock_updown": True, # Use ResBlocks for up/down sampling
}

7.3 Common Pitfalls

Pitfall Symptom Fix
Spatial size mismatch Concatenation fails in decoder Ensure input size divisible by 2depth
Too much attention OOM, slow training Only apply attention at 322 or 162
BatchNorm with small batches Training instability Use GroupNorm instead of BatchNorm
Missing time conditioning Poor sample quality Verify time embedding reaches all ResBlocks
Channel mismatch in skip Shape error Check encoder/decoder channel alignment

8. Mathematical Properties

8.1 Receptive Field

The effective receptive field of a U-Net with L levels and kernel size k :

RFkLl=1L1sl

where sl is the stride at level l . For a typical 4-level U-Net with k=3 , s=2 at each level:

RF3423=818=648 pixels

Skip connections further increase the effective receptive field by allowing gradients to flow directly to high-resolution features.

8.2 Parameter Count

For a U-Net with base channels C , L levels, and R ResBlocks per level:

Paramsl=0L12R(mlC)2k2+2R(mL1C)2k2bottleneck+l=L102R(mlC2mlC)k2

where ml is the channel multiplier at level l . The dominant cost is at the bottleneck and the first decoder level.


9. Connection to Other Concepts

9.1 U-Net → [[ResNet]]

The diffusion U-Net uses ResNet blocks as its fundamental building block. Each ResBlock processes:

  1. Time conditioning via scale-and-shift
  2. Double convolution with residual connection
  3. GroupNorm + SiLU activation

This is the same residual principle that enables training very deep networks — applied inside the U-Net’s multi-scale structure.

9.2 U-Net → [[Diffusion Model]]

U-Net is the universal backbone for diffusion models. The denoising function ϵθ(xt,t) is parameterized as a U-Net because:

  • Noise prediction requires pixel-level precision (same input/output resolution)
  • Denoising tasks benefit from multi-scale feature hierarchies
  • Skip connections preserve fine details during denoising

9.3 U-Net → [[Neural ODE]]

While the [[ResNet]] discretely approximates an ODE, U-Net’s encoder-decoder structure with skip connections can be viewed as a discrete approximation of a continuous two-point boundary value problem — solving for the clean image given boundary conditions at t=0 (clean) and t=T (pure noise).


10. Core Formula Cards

# Formula Meaning
1 hdec(l)=Conv([henc(l)|Up(hdec(l1))]) Skip connection via concatenation
2 hl+1=ResBlock(hl,temb) Time-conditioned residual block
3 Attention(Q,K,V)=softmax(QKTdk)V Self/cross-attention in bottleneck
4 h=h(1+γ(t))+β(t) Time embedding via adaptive scale-shift
5 GN(x)=γxμgσg2+ϵ+β GroupNorm (32 groups, independent of batch)
6 Xi,j=H([{Xi,k}k=0j1,U(Xi+1,j1)]) U-Net++ dense skip pathway

11. Summary

U-Net is the dual-purpose architecture that bridges two eras of deep learning:

  • Segmentation era (2015–2019): Revolutionized biomedical imaging with its encoder-decoder + skip connection design, winning the ISBI cell tracking challenge by a large margin.
  • Generative era (2020–present): Became the backbone of [[Diffusion Model|diffusion models]], powering DDPM, Stable Diffusion, Imagen, and ControlNet.

Its enduring design principle — multi-scale processing with information-preserving skip connections — makes it the natural choice whenever a model must produce high-resolution output with precise spatial structure, whether that output is a segmentation mask or a denoised image.


  • [[Diffusion Model]]
  • [[ResNet]]
  • [[Score Function]]
  • [[Flow Matching]]
  • [[Neural ODE]]
  • [[Convolutional Neural Network (CNN)]]
  • [[Stable Diffusion]]
  • [[ControlNet]]
  • [[DiT]]
  • [[Vision Transformer (ViT)]]