Vision Transformer (ViT)
Vision Transformer (ViT) is a pure Transformer-based architecture for computer vision that treats an image as a sequence of patches — analogous to tokens in NLP. By discarding convolution’s inductive biases entirely, ViT demonstrates that a standard Transformer encoder, when pre-trained on sufficient data, can match or surpass [[ResNet|CNNs]] on image classification while serving as the architectural foundation for modern generative models like [[DiT]].
1. Core Concept
1.1 “An Image is Worth 16×16 Words”
The key insight of Dosovitskiy et al. (2020) is radical in its simplicity:
Split an image into fixed-size patches, linearly embed each patch, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
1 | ViT Architecture Overview |
1.2 The Inductive Bias Trade-off
| Property | CNN ([[ResNet]]) | ViT |
|---|---|---|
| Locality | Hard-coded via small kernels | Learned from data |
| Translation equivariance | Built-in (weight sharing) | Learned via position embeddings |
| Hierarchy | Explicit (pooling/strides) | Uniform across all layers |
| Global context | Only in deepest layers | Every layer (self-attention) |
| Data efficiency | High (strong priors) | Low (needs massive data) |
| Scalability ceiling | Plateaus | Continues improving |
The trade-off: CNNs start better with limited data; ViT overtakes them when pre-trained on large-scale datasets (ImageNet-21k, JFT-300M).
1.3 Mathematical Formulation
Given an image
Step 1 — Patchify & Embed:
where
Step 2 — Transformer Encoder (for
Step 3 — Classification:
where
2. Architecture in Detail
2.1 Patch Embedding
1 | import torch |
Why Conv2d instead of manual unfolding? A single Conv2d(kernel_size=P, stride=P) simultaneously splits patches and projects them — mathematically equivalent but far more efficient on GPU.
2.2 Position Embedding
ViT uses learned 1D position embeddings (not 2D, not sinusoidal by default):
1 | # Standard ViT: learned position embedding |
| Position Embedding Type | Pros | Cons | Used In |
|---|---|---|---|
| Learned 1D | Simple, effective | Fixed resolution | Original ViT |
| Sinusoidal 1D | Extrapolates to longer sequences | No 2D structure | Vanilla Transformer style |
| Learned 2D | Encodes spatial structure | More parameters | Some ViT variants |
| Sinusoidal 2D | Resolution-agnostic, no params | Less expressive | MAE, DiT |
| Relative (RoPE) | Captures relative distances | Slightly more compute | Recent vision models |
2.3 Multi-Head Self-Attention (MSA)
1 | class MultiHeadAttention(nn.Module): |
Computational cost:
2.4 Complete ViT Model
1 | class VisionTransformer(nn.Module): |
2.5 ViT Model Variants
| Model | Layers | Hidden Dim | Heads | MLP Size | Params | Patch |
|---|---|---|---|---|---|---|
| ViT-Tiny | 12 | 192 | 3 | 768 | 5.7M | 16 |
| ViT-Small | 12 | 384 | 6 | 1536 | 22M | 16 |
| ViT-Base | 12 | 768 | 12 | 3072 | 86M | 16 |
| ViT-Large | 24 | 1024 | 16 | 4096 | 307M | 16 |
| ViT-Huge | 32 | 1280 | 16 | 5120 | 632M | 14 |
3. Key Design Principles
3.1 Why Self-Attention for Vision?
| Property | How ViT Achieves It |
|---|---|
| Global receptive field | Every patch attends to every other patch — even in layer 1 |
| Dynamic weighting | Attention weights are input-dependent (unlike fixed convolution kernels) |
| Content-based interaction | Similar patches attend to each other regardless of spatial distance |
| Scalable capacity | More data → more meaningful attention patterns learned |
| Multi-modal unification | Same architecture for images, text, audio (foundation models) |
3.2 The [CLS] Token Design
ViT borrows BERT’s [CLS] token convention:
- A learnable embedding prepended to the patch sequence
- After L Transformer layers, the [CLS] token’s output serves as the global image representation
- Alternative: Global Average Pooling (GAP) over all patch tokens — used in some variants
1 | [CLS] Design: |
GAP vs [CLS]: In practice, both work similarly. [CLS] is the original design choice; GAP is simpler and used in DeiT, CAE, and some modern variants.
4. ViT Variants and Evolution
4.1 Family Tree
1 | ViT (Dosovitskiy et al., 2020) |
4.2 DeiT: Data-Efficient ViT
Problem: Original ViT needs 300M+ images (JFT) to beat ResNet.
DeiT solution: Knowledge distillation from a CNN teacher:
- Trained on ImageNet-1K only (1.2M images)
- Distillation token alongside [CLS] token
- Achieves ViT-level performance without massive pre-training
4.3 Swin Transformer: Hierarchical ViT
Swin introduces shifted window attention to achieve linear complexity:
| Aspect | ViT | Swin Transformer |
|---|---|---|
| Token resolution | Constant (coarse) | Hierarchical (fine → coarse) |
| Attention scope | Global (all tokens) | Local windows (shifted) |
| Complexity | (\mathcal{O}(N^2)) | (\mathcal{O}(N)) |
| Dense prediction | Requires adaptation | Native (FPN-compatible) |
| Use case | Classification, generation | Detection, segmentation |
4.4 MAE: Masked Autoencoder
Inspired by BERT’s masked language modeling:
- Mask 75% of patches (not 15% like BERT — images are more redundant)
- Encoder processes only visible patches (efficient)
- Decoder reconstructs masked patches from encoded visible tokens + mask tokens
- Pre-training objective: MSE between predicted and original pixels
1 | # MAE masking (simplified) |
4.5 DINO: Self-Supervised ViT
DINO (self-DIstillation with NO labels) shows that ViT trained with self-supervision automatically learns semantic segmentation:
- Teacher-student framework with momentum encoder
- ViT’s attention maps naturally highlight objects
- DINOv2 extends this to billion-scale data → universal visual features
5. ViT as Foundation for Generative Models
5.1 ViT → [[DiT]]
[[DiT]] directly inherits ViT’s design for diffusion-based generation:
| Component | ViT | [[DiT]] |
|---|---|---|
| Input | Image | Noisy latent (VAE-compressed) |
| Patch embed | Conv2d projection | Same |
| Position embed | Learned 1D | Sinusoidal 2D (resolution-agnostic) |
| Transformer block | Standard (LN → MSA → LN → MLP) | adaLN-conditioned variant |
| Output | Class logits | Predicted noise ε |
| Conditioning | None (class token only) | Time + text via adaLN |
Why ViT works for diffusion:
- Global attention captures long-range structure crucial for image generation
- No architectural bottleneck — full-resolution processing throughout
- Scalability — [[DiT]] shows power-law improvement with model size, which U-Net doesn’t match
- Unified ecosystem — same infrastructure as LLMs (FlashAttention, FSDP, etc.)
5.2 Patch Size in Generative Context
In DiT, the patch size
Latent of
-
: 1024 tokens → heavy but maximum detail -
: 256 tokens → default (best trade-off, DiT-XL/2) -
: 64 tokens → fast, some quality loss -
: 16 tokens → very fast, coarse results
5.3 Beyond DiT: SORA and Video Generation
SORA extends the ViT→DiT lineage to video by using 3D spacetime patches:
This is a direct generalization: ViT treats an image as 2D patches; SORA treats a video as 3D spacetime patches.
6. Training and Transfer Learning
6.1 Pre-training Strategies
| Strategy | Data Required | Performance | Example |
|---|---|---|---|
| Supervised (JFT-300M) | 300M labeled images | Best | Original ViT |
| Supervised (ImageNet-21K) | 14M labeled images | Strong | ViT-B/16 @ 384 |
| Supervised + Distillation (ImageNet-1K) | 1.2M labeled images | Good | DeiT |
| Self-supervised (MAE) | Unlabeled images | Strong | MAE ViT-L |
| Self-supervised (DINOv2) | 142M curated images | SOTA features | DINOv2 ViT-g |
6.2 Fine-tuning at Higher Resolution
ViT can be fine-tuned on resolutions different from pre-training:
1 | def resize_pos_embed(pos_embed, new_num_patches): |
6.3 Key Hyperparameters
| Parameter | ViT-B/16 | ViT-L/16 | Notes |
|---|---|---|---|
| Optimizer | AdamW | AdamW | β₁=0.9, β₂=0.999 |
| Learning rate | 3e-3 | 3e-3 | Cosine schedule |
| Weight decay | 0.3 | 0.3 | Excludes bias & LN |
| Batch size | 4096 | 4096 | Large batches crucial |
| Epochs | 300 | 300 | + warmup (10K steps) |
| Dropout | 0.1 | 0.1 | Reduced for larger data |
| Gradient clipping | 1.0 | 1.0 | Global norm |
7. Comparison: ViT vs CNN vs Hybrid
7.1 Performance Landscape
| Architecture | ImageNet Top-1 | Params | FLOPs | Pre-train Data |
|---|---|---|---|---|
| ResNet-50 | 76.1% | 25M | 4.1G | ImageNet-1K |
| ResNet-152 | 78.3% | 60M | 11.6G | ImageNet-1K |
| EfficientNet-B7 | 84.3% | 66M | 38G | ImageNet-1K |
| ViT-B/16 | 77.9% / 84.0% | 86M | 17.6G | IN-1K / IN-21K |
| ViT-L/16 | 76.5% / 85.2% | 307M | 61.6G | IN-1K / JFT-300M |
| DeiT-B | 81.8% | 86M | 17.6G | ImageNet-1K |
ViT underperforms on ImageNet-1K alone but dominates when pre-trained on larger datasets — reflecting the inductive bias trade-off.
7.2 When to Use What
| Scenario | Recommendation | Reason |
|---|---|---|
| Small dataset (<100K) | CNN or DeiT | ViT overfits without massive data |
| Medium dataset (100K–1M) | ViT + strong augmentation | Or pre-trained ViT + fine-tuning |
| Large dataset (>1M) | ViT | Full potential unlocked |
| Dense prediction | Swin / ConvNeXt | Hierarchical features needed |
| Generative modeling | DiT (ViT-based) | Scalability + global attention |
| Multi-modal | ViT | Unified Transformer ecosystem |
| Edge deployment | MobileNet / EfficientNet | ViT too heavy without optimization |
8. Scaling Properties
8.1 ViT Scaling Behavior
ViT exhibits three-phase scaling:
1 | Performance vs Data Size (ViT): |
- Phase 1 (small data): [[ResNet]] wins — strong inductive biases compensate for data scarcity
- Phase 2 (medium data): ViT catches up — self-attention starts learning useful patterns
- Phase 3 (large data): ViT dominates — global attention unlocks superior representations
8.2 Compute-Optimal ViT
Following the Chinchilla scaling principles, for a given compute budget
where
9. Theoretical Understanding
9.1 Attention as a Learnable Filter
While CNN uses fixed local filters, ViT’s self-attention computes:
The attention matrix
9.2 Why ViT Needs More Data
Theoretical explanation: ViT has weaker inductive biases than CNN.
- CNN’s convolution is equivariant to translation by design:
- ViT must learn translation-equivariant behavior from data alone
- This requires seeing objects at diverse positions — hence more data
The sample complexity of learning translation invariance from scratch is significantly higher than having it built into the architecture.
9.3 Fourier Analysis Perspective
Recent work (Park & Kim, 2022) shows that ViT’s self-attention acts as a low-pass filter in early layers and progressively allows higher frequencies in deeper layers — analogous to CNN’s hierarchical feature learning, but achieved through learned attention patterns rather than architectural constraints.
10. Core Formula Cards
| # | Formula | Meaning |
|---|---|---|
| ① |
|
Patch embedding + position encoding |
| ② |
|
Residual self-attention |
| ③ |
|
Residual MLP |
| ④ |
|
Scaled dot-product attention |
| ⑤ |
|
Multi-head aggregation |
| ⑥ |
|
Classification from [CLS] token |
11. Summary
ViT’s core contribution is a philosophical shift in computer vision:
| Era | Paradigm | Philosophy |
|---|---|---|
| Pre-2020 | CNNs | “Hard-code what we know about vision (locality, translation invariance)” |
| Post-2020 | ViT/Transformers | “Provide capacity + data; let the model learn visual structure” |
This shift enabled:
- Unified architectures across vision, language, and audio
- Scalable pre-training (MAE, DINO, CLIP) rivaling NLP’s BERT/GPT moment
- Generative models at scale — [[DiT]], SORA, Stable Diffusion 3 all inherit ViT’s design
ViT is not just an architecture — it’s the bridge that brought vision into the Transformer era, enabling the same scaling laws and infrastructure that revolutionized NLP to transform computer vision and generative modeling.
Dataview Query
1 | LIST |
Related Concepts
- [[ResNet]]
- [[DiT]]
- [[Diffusion Model]]
- [[U-Net]]
- [[Transformer]]
- [[RoPE]]
- [[Neural ODE]]
- [[Convolutional Neural Network (CNN)]]
References
- Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., ICLR 2021 — Oral)
- Paper: Training data-efficient image transformers & distillation through attention (Touvron et al., ICML 2021 — DeiT)
- Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Liu et al., ICCV 2021 — Best Paper)
- Paper: Masked Autoencoders Are Scalable Vision Learners (He et al., CVPR 2022)
- Paper: Emerging Properties in Self-Supervised Vision Transformers (Caron et al., ICCV 2021 — DINO)
- Paper: Scalable Diffusion Models with Transformers (Peebles & Xie, ICCV 2023 — [[DiT]])
- Paper: Attention Is All You Need (Vaswani et al., NeurIPS 2017 — Original Transformer)
- Blog: Vision Transformer (ViT) — A Flutter on ImageNet — lucidrains
- Blog: The Illustrated Vision Transformer — Jay Alammar
- Code: https://github.com/huggingface/pytorch-image-models (timm)
- Code: https://github.com/facebookresearch/dino (DINO)
- Code: https://github.com/facebookresearch/mae (MAE)
- Course: CS231n Convolutional Neural Networks for Visual Recognition (Stanford)
- Course: CS25 Transformers United (Stanford)
"