2026-06-30

U-Net

U-Net is a fully convolutional encoder-decoder architecture with symmetric skip connections, originally designed for biomedical image segmentation. It has since become the de facto backbone for [[Diffusion Model|diffusion models]] (DDPM, Stable Diffusion), where it serves as the noise prediction network $ϵ_{θ} (x_{t}, t)$ .

1. Core Concept

1.1 The U-Shaped Design

U-Net gets its name from its characteristic U-shaped architecture diagram:

U-Net Architecture (Original, 2015)
═══════════════════════════════════════════════════════
Encoder (Contracting Path)          Decoder (Expanding Path)
                                    
Input → [Conv×2] ──────────────────────────────→ [Conv×2] → Output
           ↓ MaxPool                                  ↑ UpConv
         [Conv×2] ────────────────────────────→ [Conv×2]
           ↓ MaxPool                                  ↑ UpConv
         [Conv×2] ──────────────────────────→ [Conv×2]
           ↓ MaxPool                                  ↑ UpConv
         [Conv×2] ────────────────────────→ [Conv×2]
           ↓ MaxPool                                  ↑ UpConv
              └────────── [Conv×2] (Bottleneck) ──────┘
═══════════════════════════════════════════════════════

Where each horizontal arrow ───→ represents a skip connection that concatenates encoder features directly into the decoder.

1.2 Key Design Principles

Principle	Description	Benefit
Symmetric Encoder-Decoder	Mirror structure: downsampling path + upsampling path	Multi-scale feature extraction
Skip Connections	Direct concatenation of encoder features to decoder	Preserve fine spatial details lost during downsampling
Fully Convolutional	No fully connected layers	Arbitrary input sizes
Multi-scale Processing	Features at 4-5 resolution levels	Capture both local texture and global structure

1.3 Why “U”?

The architecture compresses spatial resolution while expanding channel depth (encoder), then reverses the process (decoder), with skip connections bridging same-resolution levels — forming a U-shaped information flow:

\underset{Input}{\underset{⏟}{H \times W \times C}} \to \underset{Level 2}{\underset{⏟}{\frac{H}{2} \times \frac{W}{2} \times 2 C}} \to \underset{Level 3}{\underset{⏟}{\frac{H}{4} \times \frac{W}{4} \times 4 C}} \to \underset{Bottleneck}{\underset{⏟}{\frac{H}{8} \times \frac{W}{8} \times 8 C}} \to \dots \to \underset{Output}{\underset{⏟}{H \times W \times C_{out}}}

2. Original U-Net (Ronneberger et al., 2015)

2.1 Original Design

The original U-Net was proposed for biomedical image segmentation (cell tracking, organ segmentation):

class OriginalUNet(nn.Module):
    """Original U-Net for biomedical segmentation."""
    
    def __init__(self, in_channels=1, out_channels=2, features=[64, 128, 256, 512]):
        super().__init__()
        
        # Encoder (Contracting Path)
        self.encoders = nn.ModuleList()
        for i, feat in enumerate(features):
            in_ch = in_channels if i == 0 else features[i-1]
            self.encoders.append(self._double_conv(in_ch, feat))
        
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Bottleneck
        self.bottleneck = self._double_conv(features[-1], features[-1] * 2)
        
        # Decoder (Expanding Path)
        self.decoders = nn.ModuleList()
        self.upconvs = nn.ModuleList()
        for feat in reversed(features):
            self.upconvs.append(
                nn.ConvTranspose2d(feat * 2, feat, kernel_size=2, stride=2)
            )
            # After concatenation: feat (encoder) + feat (upconv) = 2*feat
            self.decoders.append(self._double_conv(feat * 2, feat))
        
        self.final_conv = nn.Conv2d(features[0], out_channels, kernel_size=1)
    
    def _double_conv(self, in_ch, out_ch):
        return nn.Sequential(
            nn.Conv2d(in_ch, out_ch, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
        )
    
    def forward(self, x):
        # Encoder
        skip_connections = []
        for encoder in self.encoders:
            x = encoder(x)
            skip_connections.append(x)
            x = self.pool(x)
        
        # Bottleneck
        x = self.bottleneck(x)
        
        # Decoder with skip connections
        skip_connections = skip_connections[::-1]  # reverse
        for i, (upconv, decoder) in enumerate(zip(self.upconvs, self.decoders)):
            x = upconv(x)
            # Concatenate skip connection from encoder
            skip = skip_connections[i]
            # Handle size mismatch (crop if needed)
            if x.shape != skip.shape:
                x = F.interpolate(x, size=skip.shape[2:])
            x = torch.cat([skip, x], dim=1)
            x = decoder(x)
        
        return self.final_conv(x)

2.2 Skip Connection Mechanics

The skip connection concatenates (not adds) encoder features directly to decoder features:

h_{dec}^{(l)} = Conv ([h_{enc}^{(l)} ‖ UpConv (h_{dec}^{(l - 1)})])

where $∥$ denotes channel-wise concatenation. This is different from [[ResNet]]'s additive skip connection:

Aspect	U-Net Skip	ResNet Skip
Operation	Concatenation	Addition
Channel change	Doubles channels (encoder + upconv)	Preserves channels (identity)
Purpose	Restore spatial details	Ease gradient flow
Structure	Cross-resolution (encoder → decoder)	Same-resolution (input → output)

2.3 Training Strategy (Original Paper)

The original U-Net used several key training techniques:

Technique	Description
Overlap-tile strategy	Predict segmentation in tiles with overlap to handle large images
Elastic deformations	Data augmentation via random elastic transformations
Weighted loss	Higher weight on separation borders between touching objects
Weight map	Pre-computed pixel-wise weight map emphasizing boundary pixels

Loss function (weighted cross-entropy):

E = \sum_{x \in Ω} w (x) \log (p_{ℓ (x)} (x))

where $w (x)$ is the weight map emphasizing borders between cells:

w (x) = w_{c} (x) + w_{0} \cdot \exp (- \frac{(d_{1} (x) + d_{2} (x))^{2}}{2 σ^{2}})

3. U-Net in Diffusion Models

3.1 Why U-Net for Diffusion?

Diffusion models need a network $ϵ_{θ} (x_{t}, t)$ that:

Preserves spatial resolution (input and output have same shape)
Captures multi-scale features (noise patterns exist at all scales)
Incorporates time conditioning (different denoising behavior at each $t$ )
Handles additional conditioning (text, class labels, images)

U-Net perfectly satisfies all four requirements.

3.2 Diffusion U-Net Architecture

Modern diffusion U-Nets extend the original design with:

Diffusion U-Net (DDPM / Stable Diffusion)
═══════════════════════════════════════════════════════
Input (noisy image x_t + timestep t)
  │
  ├── Time Embedding: Sinusoidal → MLP → [emb_dim]
  │
  ▼
Encoder:
  ResBlock × 2 → Downsample ──────────────────────┐
  ResBlock × 2 → Downsample ──────────────┐       │
  ResBlock × 2 → Downsample ────┐         │       │
  ResBlock × 2                  │         │       │
                             ┌──┘         │       │
Middle:                      │            │       │
  ResBlock + Self-Attention   │            │       │
                             └──┐         │       │
Decoder:                        │         │       │
  ResBlock × 2 ← Concat ←───────┘         │       │
  ResBlock × 2 ← Concat ←─────────────────┘       │
  ResBlock × 2 ← Concat ←─────────────────────────┘
  ResBlock × 2
  │
  ▼
Output (predicted noise ε_θ)
═══════════════════════════════════════════════════════

3.3 Key Components

Time Embedding

class SinusoidalTimeEmbedding(nn.Module):
    """Sinusoidal position encoding for diffusion timesteps."""
    
    def __init__(self, dim):
        super().__init__()
        self.dim = dim
    
    def forward(self, t):
        # t: (B,) integer timesteps
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=t.device) * -emb)
        emb = t[:, None].float() * emb[None, :]
        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
        return emb  # (B, dim)

ResBlock with Time Conditioning

class DiffusionResBlock(nn.Module):
    """ResNet block with time embedding injection."""
    
    def __init__(self, channels, emb_channels, out_channels=None,
                 dropout=0.0):
        super().__init__()
        out_channels = out_channels or channels
        
        self.norm1 = nn.GroupNorm(32, channels)
        self.conv1 = nn.Conv2d(channels, out_channels, 3, padding=1)
        self.norm2 = nn.GroupNorm(32, out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        
        # Time embedding injection
        self.time_mlp = nn.Sequential(
            nn.SiLU(),
            nn.Linear(emb_channels, out_channels * 2),
        )
        
        self.dropout = nn.Dropout(dropout)
        self.skip = nn.Conv2d(channels, out_channels, 1) \
                    if channels != out_channels else nn.Identity()
    
    def forward(self, x, t_emb):
        # Time conditioning via scale-and-shift
        scale_shift = self.time_mlp(t_emb)[:, :, None, None]
        scale, shift = scale_shift.chunk(2, dim=1)
        
        h = self.norm1(x)
        h = F.silu(h)
        h = self.conv1(h)
        h = h * (1 + scale) + shift  # Inject time
        
        h = self.norm2(h)
        h = F.silu(h)
        h = self.dropout(h)
        h = self.conv2(h)
        
        return h + self.skip(x)

Self-Attention Block

class SelfAttention(nn.Module):
    """Multi-head self-attention for diffusion U-Net."""
    
    def __init__(self, channels, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = channels // num_heads
        self.scale = self.head_dim ** -0.5
        
        self.qkv = nn.Conv2d(channels, channels * 3, 1, bias=False)
        self.proj = nn.Conv2d(channels, channels, 1)
    
    def forward(self, x):
        B, C, H, W = x.shape
        qkv = self.qkv(x).reshape(B, 3, self.num_heads, 
                                   self.head_dim, H * W)
        q, k, v = qkv[:, 0], qkv[:, 1], qkv[:, 2]
        
        # Scaled dot-product attention
        attn = (q * self.scale) @ k.transpose(-2, -1)
        attn = F.softmax(attn, dim=-1)
        
        out = (attn @ v).reshape(B, C, H, W)
        return self.proj(out)

Cross-Attention for Conditioning

class CrossAttention(nn.Module):
    """Cross-attention for text/image conditioning (Stable Diffusion)."""
    
    def __init__(self, query_dim, context_dim, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = query_dim // num_heads
        self.scale = self.head_dim ** -0.5
        
        self.to_q = nn.Linear(query_dim, query_dim, bias=False)
        self.to_k = nn.Linear(context_dim, query_dim, bias=False)
        self.to_v = nn.Linear(context_dim, query_dim, bias=False)
        self.to_out = nn.Linear(query_dim, query_dim)
    
    def forward(self, x, context):
        # x: (B, N, C)  — spatial features flattened
        # context: (B, L, C_ctx) — text/image embeddings
        q = self.to_q(x)
        k = self.to_k(context)
        v = self.to_v(context)
        
        # Reshape for multi-head
        B, N, C = q.shape
        q = q.reshape(B, N, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.reshape(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.reshape(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        attn = (q * self.scale) @ k.transpose(-2, -1)
        attn = F.softmax(attn, dim=-1)
        
        out = (attn @ v).transpose(1, 2).reshape(B, N, C)
        return self.to_out(out)

3.4 Complete Diffusion U-Net

class DiffusionUNet(nn.Module):
    """Full diffusion U-Net with time conditioning and attention."""
    
    def __init__(self, in_channels=3, out_channels=3, model_channels=128,
                 channel_mult=[1, 2, 4, 8], num_res_blocks=2,
                 attention_resolutions=[16], dropout=0.0):
        super().__init__()
        
        # Time embedding
        time_emb_dim = model_channels * 4
        self.time_embed = nn.Sequential(
            SinusoidalTimeEmbedding(model_channels),
            nn.Linear(model_channels, time_emb_dim),
            nn.SiLU(),
            nn.Linear(time_emb_dim, time_emb_dim),
        )
        
        # Input projection
        self.input_blocks = nn.ModuleList([
            nn.Conv2d(in_channels, model_channels, 3, padding=1)
        ])
        
        # Encoder
        input_block_channels = [model_channels]
        ch = model_channels
        ds = 1
        for level, mult in enumerate(channel_mult):
            for _ in range(num_res_blocks):
                layers = [
                    DiffusionResBlock(ch, time_emb_dim, 
                                     model_channels * mult, dropout)
                ]
                ch = model_channels * mult
                if ds in attention_resolutions:
                    layers.append(SelfAttention(ch))
                self.input_blocks.append(nn.ModuleList(layers))
                input_block_channels.append(ch)
            if level != len(channel_mult) - 1:
                self.input_blocks.append(
                    nn.ModuleList([nn.Conv2d(ch, ch, 3, stride=2, padding=1)])
                )
                input_block_channels.append(ch)
                ds *= 2
        
        # Middle block
        self.middle_block = nn.ModuleList([
            DiffusionResBlock(ch, time_emb_dim, ch, dropout),
            SelfAttention(ch),
            DiffusionResBlock(ch, time_emb_dim, ch, dropout),
        ])
        
        # Decoder
        self.output_blocks = nn.ModuleList([])
        for level, mult in list(enumerate(channel_mult))[::-1]:
            for i in range(num_res_blocks + 1):
                skip_ch = input_block_channels.pop()
                layers = [
                    DiffusionResBlock(ch + skip_ch, time_emb_dim,
                                     model_channels * mult, dropout)
                ]
                ch = model_channels * mult
                if ds in attention_resolutions:
                    layers.append(SelfAttention(ch))
                if level > 0 and i == num_res_blocks:
                    layers.append(nn.ConvTranspose2d(ch, ch, 3, stride=2, padding=1, output_padding=1))
                    ds //= 2
                self.output_blocks.append(nn.ModuleList(layers))
        
        # Output
        self.out = nn.Sequential(
            nn.GroupNorm(32, ch),
            nn.SiLU(),
            nn.Conv2d(ch, out_channels, 3, padding=1),
        )
    
    def forward(self, x, timesteps, context=None):
        # Time embedding
        t_emb = self.time_embed(timesteps)
        
        # Encoder + collect skip connections
        hs = []
        h = x
        for module in self.input_blocks:
            if isinstance(module, nn.ModuleList):
                for layer in module:
                    if isinstance(layer, DiffusionResBlock):
                        h = layer(h, t_emb)
                    elif isinstance(layer, SelfAttention):
                        h = layer(h)
            else:
                h = module(h)
            hs.append(h)
        
        # Middle
        for layer in self.middle_block:
            if isinstance(layer, DiffusionResBlock):
                h = layer(h, t_emb)
            elif isinstance(layer, SelfAttention):
                h = layer(h)
        
        # Decoder
        for module in self.output_blocks:
            skip = hs.pop()
            h = torch.cat([h, skip], dim=1)
            for layer in module:
                if isinstance(layer, DiffusionResBlock):
                    h = layer(h, t_emb)
                elif isinstance(layer, SelfAttention):
                    h = layer(h)
                elif isinstance(layer, nn.ConvTranspose2d):
                    h = layer(h)
        
        return self.out(h)

3.5 Design Choices in Diffusion U-Nets

Component	Original U-Net (2015)	Diffusion U-Net (2020+)
Base block	Double Conv + ReLU	ResBlock + SiLU
Normalization	BatchNorm	GroupNorm (32 groups)
Downsampling	MaxPool (2×2)	Strided Conv (stride=2)
Upsampling	Transposed Conv	Transposed Conv or Nearest + Conv
Attention	None	Self-Attn at low resolutions
Conditioning	None	Time emb (scale-shift), Cross-Attn (text)
Skip connection	Concatenation	Concatenation
Activation	ReLU	SiLU (Swish)

4. U-Net Variants

4.1 Architectural Evolution

Variant	Year	Innovation	Use Case
U-Net	2015	Original encoder-decoder + skip connections	Biomedical segmentation
3D U-Net	2016	Extends to 3D volumes	CT/MRI segmentation
Attention U-Net	2018	Attention gates on skip connections	Improve focus on target structures
U-Net++	2018	Nested, dense skip pathways	Better multi-scale feature fusion
U-Net+++	2020	Full-scale skip connections	Extreme multi-scale fusion
Diffusion U-Net	2020	ResBlock + Self-Attn + Time Embedding	Noise prediction in diffusion
Stable Diffusion U-Net	2022	Cross-attention conditioning + latent space	Text-to-image generation

4.2 Attention U-Net

Adds attention gates to skip connections, allowing the model to focus on relevant regions:

class AttentionGate(nn.Module):
    """Attention gate for U-Net skip connections."""
    
    def __init__(self, F_g, F_l, F_int):
        # F_g: decoder feature channels
        # F_l: encoder feature channels (skip)
        super().__init__()
        self.W_g = nn.Conv2d(F_g, F_int, 1)
        self.W_x = nn.Conv2d(F_l, F_int, 1)
        self.psi = nn.Conv2d(F_int, 1, 1)
    
    def forward(self, g, x):
        # g: gating signal from decoder
        # x: skip connection from encoder
        attn = self.psi(F.relu(self.W_g(g) + self.W_x(x)))
        attn = torch.sigmoid(attn)
        return x * attn

4.3 U-Net++

Replaces plain skip connections with dense convolutional blocks on skip pathways:

U-Net++ Skip Pathways:
X₀₀ ──────→ X₀₁ ──────→ X₀₂ ──────→ X₀₃
  │   ╲       │   ╲       │   ╲       
  │    ╲      │    ╲      │    ╲      
  │     X₁₀ ──┼───→ X₁₁ ──┼───→ X₁₂  
  │           │   ╲       │   ╲       
  │           │    ╲      │    ╲      
  │           │     X₂₀ ──┼───→ X₂₁  
  │           │           │          
  └───────────┴───────────┴──→ Output

Each node $X^{i, j}$ aggregates features from multiple preceding nodes:

X^{i, j} = {\begin{cases} H (D (X^{i - 1, j})), & j = 0 \\ H ([{[X^{i, k}]}_{k = 0}^{j - 1}, U (X^{i + 1, j - 1})]), & j > 0 \end{cases}

5. Comparison of U-Net Across Domains

5.1 Segmentation vs. Diffusion

Aspect	Segmentation U-Net	Diffusion U-Net
Input	Raw image	Noisy image $x_{t}$
Output	Segmentation mask	Predicted noise $ϵ_{θ}$ or ${\hat{x}}_{0}$
Conditioning	None	Timestep $t$ , text, class
Attention	Optional (Attention U-Net)	Self-attention + Cross-attention
Normalization	BatchNorm	GroupNorm
Activation	ReLU	SiLU (Swish)
Resolution	Fixed (e.g., 572×572)	Flexible (powers of 2)
Key insight	Skip connections recover spatial precision	Skip connections propagate high-freq details through denoising

5.2 U-Net vs. Other Architectures

Architecture	Skip Connection Type	Multi-scale	Best For
U-Net	Cross-resolution concat	✅ Yes	Segmentation, diffusion
[[ResNet]]	Same-resolution additive	❌ No	Classification, feature extraction
FPN	Lateral connections	✅ Yes	Object detection
DiT (Transformer)	Residual within blocks	❌ No (patches)	Scalable diffusion
Hourglass	Similar to U-Net	✅ Yes	Pose estimation

6. U-Net as Universal Diffusion Backbone

6.1 Why Not Transformer?

The U-Net remains dominant in diffusion for several reasons:

Reason	Explanation
Inductive bias	Convolutional structure naturally handles 2D/3D spatial data
Computational efficiency	$O (N)$ for convolutions vs. $O (N^{2})$ for dense attention
Multi-scale native	Encoder-decoder inherently captures multiple resolutions
Proven performance	DDPM, Stable Diffusion, Imagen all use U-Net backbones
DiT limitations	Transformer (DiT) only outperforms U-Net at very large scales ($>$500M params)

6.2 Diffusion Models Using U-Net

Model	U-Net Variant	Key Modification
DDPM	U-Net + ResBlock + Self-Attn	Time embedding via scale-shift
Stable Diffusion	Latent U-Net + Cross-Attn	Text conditioning, latent space
Imagen	Cascaded U-Nets (64→256→1024)	Multi-stage super-resolution
ControlNet	Frozen U-Net + Trainable Copy	Zero-convolution control branches
SDXL	Larger U-Net (2.6B params)	Dual text encoders, refiner

7. Practical Implementation Tips

7.1 Architecture Design Choices

Decision	Recommendation	Rationale
Depth	4-5 resolution levels	Balance receptive field and spatial detail
Base channels	64-256	Trade-off between capacity and memory
Channel multipliers	[1, 2, 4, 8] or [1, 2, 4]	Double channels at each level
Attention resolution	$\leq 32^{2}$ or $\leq 16^{2}$	Attention only at low resolutions (expensive)
ResBlocks per level	2	Standard, 3 for higher quality
GroupNorm groups	32	Works well across batch sizes
Dropout	0.1–0.2	Only in ResBlocks, not attention

7.2 Training Recommendations

# Key hyperparameters for diffusion U-Net training
config = {
    "model_channels": 128,         # Base channel count
    "channel_mult": [1, 2, 4, 8],  # Multipliers per level
    "num_res_blocks": 2,           # ResBlocks per level
    "attention_resolutions": [16], # Apply attention at ≤16×16
    "dropout": 0.1,                # Regularization
    "num_heads": 8,                # Attention heads
    "use_scale_shift_norm": True,  # Time embedding via scale-shift
    "resblock_updown": True,       # Use ResBlocks for up/down sampling
}

7.3 Common Pitfalls

Pitfall	Symptom	Fix
Spatial size mismatch	Concatenation fails in decoder	Ensure input size divisible by $2^{depth}$
Too much attention	OOM, slow training	Only apply attention at $\leq 32^{2}$ or $\leq 16^{2}$
BatchNorm with small batches	Training instability	Use GroupNorm instead of BatchNorm
Missing time conditioning	Poor sample quality	Verify time embedding reaches all ResBlocks
Channel mismatch in skip	Shape error	Check encoder/decoder channel alignment

8. Mathematical Properties

8.1 Receptive Field

The effective receptive field of a U-Net with $L$ levels and kernel size $k$ :

RF \approx k^{L} \cdot \prod_{l = 1}^{L - 1} s_{l}

where $s_{l}$ is the stride at level $l$ . For a typical 4-level U-Net with $k = 3$ , $s = 2$ at each level:

RF \approx 3^{4} \cdot 2^{3} = 81 \cdot 8 = 648 pixels

Skip connections further increase the effective receptive field by allowing gradients to flow directly to high-resolution features.

8.2 Parameter Count

For a U-Net with base channels $C$ , $L$ levels, and $R$ ResBlocks per level:

Params \approx \sum_{l = 0}^{L - 1} 2 R \cdot (m_{l} C)^{2} \cdot k^{2} + \underset{bottleneck}{\underset{⏟}{2 R \cdot (m_{L - 1} C)^{2} \cdot k^{2}}} + \sum_{l = L - 1}^{0} 2 R \cdot (m_{l} C \cdot 2 m_{l} C) \cdot k^{2}

where $m_{l}$ is the channel multiplier at level $l$ . The dominant cost is at the bottleneck and the first decoder level.

9. Connection to Other Concepts

9.1 U-Net → [[ResNet]]

The diffusion U-Net uses ResNet blocks as its fundamental building block. Each ResBlock processes:

Time conditioning via scale-and-shift
Double convolution with residual connection
GroupNorm + SiLU activation

This is the same residual principle that enables training very deep networks — applied inside the U-Net’s multi-scale structure.

9.2 U-Net → [[Diffusion Model]]

U-Net is the universal backbone for diffusion models. The denoising function $ϵ_{θ} (x_{t}, t)$ is parameterized as a U-Net because:

Noise prediction requires pixel-level precision (same input/output resolution)
Denoising tasks benefit from multi-scale feature hierarchies
Skip connections preserve fine details during denoising

9.3 U-Net → [[Neural ODE]]

While the [[ResNet]] discretely approximates an ODE, U-Net’s encoder-decoder structure with skip connections can be viewed as a discrete approximation of a continuous two-point boundary value problem — solving for the clean image given boundary conditions at $t = 0$ (clean) and $t = T$ (pure noise).

10. Core Formula Cards

#	Formula	Meaning
1	$h_{dec}^{(l)} = Conv ([h_{enc}^{(l)} \| Up (h_{dec}^{(l - 1)})])$	Skip connection via concatenation
2	$h_{l + 1} = ResBlock (h_{l}, t_{emb})$	Time-conditioned residual block
3	$Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$	Self/cross-attention in bottleneck
4	$h = h \cdot (1 + γ (t)) + β (t)$	Time embedding via adaptive scale-shift
5	$GN (x) = γ \cdot \frac{x - μ_{g}}{\sqrt{σ_{g}^{2} + ϵ}} + β$	GroupNorm (32 groups, independent of batch)
6	$X^{i, j} = H ([{X^{i, k}}_{k = 0}^{j - 1}, U (X^{i + 1, j - 1})])$	U-Net++ dense skip pathway

11. Summary

U-Net is the dual-purpose architecture that bridges two eras of deep learning:

Segmentation era (2015–2019): Revolutionized biomedical imaging with its encoder-decoder + skip connection design, winning the ISBI cell tracking challenge by a large margin.
Generative era (2020–present): Became the backbone of [[Diffusion Model|diffusion models]], powering DDPM, Stable Diffusion, Imagen, and ControlNet.

Its enduring design principle — multi-scale processing with information-preserving skip connections — makes it the natural choice whenever a model must produce high-resolution output with precise spatial structure, whether that output is a segmentation mask or a denoised image.

[[Diffusion Model]]
[[ResNet]]
[[Score Function]]
[[Flow Matching]]
[[Neural ODE]]
[[Convolutional Neural Network (CNN)]]
[[Stable Diffusion]]
[[ControlNet]]
[[DiT]]
[[Vision Transformer (ViT)]]