Welford’s Algorithm
Welford’s algorithm is a numerically stable online algorithm for computing the running mean, variance, and covariance of a data stream in a single pass. Unlike the naïve two-pass or textbook one-pass formulas, Welford’s method avoids catastrophic cancellation by maintaining a running sum of squared deviations rather than a sum of squares — making it the standard choice for variance computation in production machine learning systems, including [[ResNet|Batch Normalization]], reinforcement learning, and distributed training.
1. Core Concept
1.1 The Problem: Numerically Unstable Variance
The textbook variance formula is:
Problem: When data values are large but the variance is small (e.g.,
[!NOTE] Catastrophic Cancellation Example
Consider. The true variance is . Using the naïve formula in float32, and , and their difference can yield zero or even negative variance due to floating-point rounding.
1.2 Welford’s Solution
Instead of accumulating
Since deviations
1.3 Key Properties
- Single-pass: Each data point is processed exactly once — no need to store the full dataset
- Online: New data can be incorporated incrementally without recomputing from scratch
- Numerically stable: Avoids catastrophic cancellation in all regimes
- Memory efficient: Only
state variables regardless of data size
2. Mathematical Derivation
2.1 Running Mean Update
Definition: The arithmetic mean of the first
Derivation of the incremental form:
We want to express
Recognize that
Rewrite
Interpretation: The new mean equals the old mean plus a correction term proportional to the prediction error
2.2 Running Variance Update (Core Result)
Definition: Define the sum of squared deviations from the current mean:
Theorem (Welford, 1962): The quantity
Complete Proof:
Step 1: Split the sum into the first
Step 2: Express
Step 3: Expand the sum of squares of the first
Step 4: The cross-term vanishes. By definition of
Therefore:
Step 5: Substitute
Step 6: Handle the
So:
Step 7: Combine all terms:
Factor out
Step 8: Rewrite into the symmetric Welford form. From Step 6,
This is exactly the correction term from Step 7. Hence:
[!NOTE] Why This Is Stable
The productinvolves deviations from the mean, not raw values. These deviations are small even when is large, so no cancellation occurs when accumulating into . In floating-point arithmetic, both factors are computed via subtraction of quantities of similar magnitude, preserving relative precision.
[!NOTE] Non-Negativity Guarantee
Sincealways lies between and , both factors and have the same sign (or one is zero). Therefore the correction term always, ensuring — the computed variance can never become negative, unlike the naïve formula.
2.3 Equivalence of Two Forms
The recurrence has two algebraically equivalent forms, both useful in practice:
Form A (product of deviations — numerically preferred):
Form B (scaled squared deviation — analytically convenient):
Form A requires two subtractions and one multiplication. Form B requires one subtraction, one squaring, and one division. In practice, Form A is preferred because it avoids the explicit division by
2.4 Final Variance and Standard Deviation
From
where
3. Algorithm
3.1 Basic Welford Algorithm
1 | class WelfordOnline: |
3.2 Batched Update (Vectorized)
For processing mini-batches in deep learning:
1 | class WelfordBatched: |
4. Parallel and Distributed Welford
4.1 Chan’s Parallel Algorithm
When data is split across multiple workers (distributed training), Welford statistics can be merged using Chan et al.'s formula (1983).
Setup: Given two disjoint subsets
Goal: Compute
Merged count and mean:
Derivation of merged
Start from the definition:
For the
By the same argument for
Therefore:
Now simplify the correction terms. Define
Substitute:
Final result:
[!NOTE] Interpretation of the Correction Term
The extra termis the between-group sum of squares — it accounts for the variance introduced by the difference in group means. This is the same decomposition used in ANOVA: .
1 | def merge_welford(stats_a, stats_b): |
4.2 Distributed Training Application
In Distributed Data Parallel (DDP) training, each GPU computes local batch statistics, which are then merged via Chan’s algorithm:
1 | GPU 0: (n₀, μ₀, M₂₀) ──┐ |
Advantage over all-reduce of sums:
- All-reduce of
and : suffers from catastrophic cancellation - Chan’s merge: numerically stable, identical results regardless of GPU count
5. Welford for Covariance and Correlation
5.1 Online Covariance — Derivation
Definition: The co-deviation sum for two streams
Derivation of the recurrence: Split the sum as before:
Using the identities from Section 2:
Expanding the product:
For the
Combining:
Substituting
Rewriting in the symmetric Welford form (analogous to Section 2.3):
Note the asymmetry: the first factor uses the old mean
The covariance estimates are:
1 | class WelfordCovariance: |
5.2 Online Covariance Matrix
For
Structure: At each step, this is a rank-1 update — the outer product of two
The correlation matrix is then:
6. Weighted and Exponentially-Weighted Variants
6.1 Weighted Welford
For weighted observations
Derivation of the weighted mean update:
Rewrite as:
Derivation of the weighted
Expanding
For the
Combining and simplifying:
Rewriting in the symmetric form:
This reduces to the standard (unweighted) Welford recurrence when
6.2 Exponentially Weighted Moving Variance (EWMA)
For non-stationary data (e.g., loss tracking, reward monitoring), we use exponential forgetting with decay rate
The exponentially weighted variance update (West, 1979):
Derivation: This follows from the weighted Welford formula with exponentially decaying weights
[!NOTE] Connection to Adam Optimizer
The Adam optimizer maintains an exponentially weighted second moment. While structurally similar, Adam tracks (raw second moment), not (centered variance). The EWMA variance formula tracks the centered second moment, which is a strictly more informative quantity.
7. Applications in Deep Learning
7.1 [[ResNet|Batch Normalization]]
BatchNorm computes per-channel mean and variance during training:
While standard BatchNorm uses the naïve formula (batch sizes are typically small enough), Welford’s algorithm is used in production implementations (PyTorch, TensorFlow) for numerical robustness, especially with:
- Small batch sizes (e.g., batch size 1-4 in detection/segmentation)
- Float16 / BFloat16 training (limited mantissa bits)
- Running statistics accumulation across many batches
1 | # PyTorch internally uses Welford for running stats |
7.2 Layer Normalization & RMSNorm
In [[Vision Transformer (ViT)|Transformers]], LayerNorm requires computing variance over the feature dimension:
For large hidden dimensions (
7.3 Reinforcement Learning: Advantage Estimation
In PPO and other policy gradient methods, advantage normalization requires running statistics of returns:
Since RL agents interact with the environment over millions of timesteps, Welford’s online algorithm is the natural choice — storing all returns would be prohibitively expensive.
7.4 Gradient Monitoring and Clipping
Tracking gradient statistics for adaptive clipping:
Welford tracks
7.5 Loss Logging and Early Stopping
Monitoring training loss statistics:
- Running mean of loss for smoothed curves
- Running variance for detecting training instability (loss spikes)
- Running covariance between loss and learning rate for adaptive scheduling
8. Comparison with Other Methods
8.1 Variance Computation Methods
| Method | Passes | Online | Numerically Stable | Memory | Parallel |
|---|---|---|---|---|---|
| Two-pass (subtract mean, then square) | 2 | ❌ | ✅ Stable |
|
❌ |
| Naïve one-pass (
|
1 | ✅ | ❌ Catastrophic cancellation |
|
✅ |
| Welford | 1 | ✅ | ✅ Stable |
|
✅ (Chan) |
| Youngs & Cramer | 1 | ✅ | ✅ Stable |
|
✅ |
| Kahan summation + naïve | 1 | ✅ | ⚠️ Better but not guaranteed |
|
❌ |
8.2 When to Use What
| Scenario | Recommended Method |
|---|---|
| Small dataset, fits in memory | Two-pass (simplest, stable) |
| Streaming data, single pass | Welford |
| Distributed / multi-GPU | Chan’s parallel Welford |
| Mixed precision (float16) | Welford (essential) |
| High-dimensional covariance | Welford with rank-1 updates |
9. Numerical Analysis
9.1 Error Bounds
For
| Method | Relative Error Bound |
|---|---|
| Naïve |
|
| Welford |
|
Key insight: Welford’s error does not depend on the condition number
9.2 Catastrophic Cancellation Demonstration
1 | import numpy as np |
9.3 Float16 Considerations
In mixed-precision training (AMP), variance computation is especially fragile:
| Precision | Mantissa Bits | Cancellation Threshold |
|---|---|---|
| float32 | 23 bits |
|
| float16 | 10 bits |
|
| bfloat16 | 7 bits |
|
Welford’s algorithm is essential for float16/bfloat16 training — without it, normalization layers produce NaN or Inf values.
10. Core Formula Cards
[!QUOTE] Welford Running Mean
[!QUOTE] Welford Running Variance (Core Recurrence)
[!QUOTE] Variance from
[!QUOTE] Chan’s Parallel Merge
[!QUOTE] Online Covariance
[!QUOTE] Weighted Welford Update
[!QUOTE] Exponentially Weighted Variance
11. Summary
| Aspect | Description |
|---|---|
| Core idea | Accumulate squared deviations from running mean, not raw squares |
| Key advantage | Numerically stable — immune to catastrophic cancellation |
| Complexity |
|
| Parallel extension | Chan’s merge formula for distributed computation |
| Generalization | Covariance, correlation, weighted, exponentially-weighted |
| Role in DL | BatchNorm running stats, LayerNorm, RL advantage normalization, gradient monitoring |
| When essential | Mixed precision (float16/bfloat16), large-scale data, distributed training |
Welford’s algorithm exemplifies a fundamental principle in numerical computing: reformulate to avoid subtracting large, nearly-equal quantities. By tracking deviations rather than raw sums, it achieves stability that the naïve formula cannot — a lesson that recurs throughout machine learning, from [[ResNet|BatchNorm]] to the [[Score Function|score function]] estimation in diffusion models.
Related Concepts
- [[ResNet]]
- [[U-Net]]
- [[Vision Transformer (ViT)]]
- [[Diffusion Model]]
- [[Score Function]]
Dataview Query
1 | LIST |
References
- Paper: Note on a Method for Calculating Corrected Sums of Squares and Products (Welford, Technometrics 1962)
- Paper: Updating Mean and Variance Estimates: An Improved Method (West, Communications of the ACM 1979)
- Paper: Algorithms for Computing the Sample Variance: Analysis and Recommendations (Chan, Golub, LeVeque, The American Statistician 1983)
- Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe & Szegedy, ICML 2015)
- Blog: Accurately computing running variance — John D. Cook
- Docs: PyTorch
torch.nn.BatchNorm2dimplementation notes - Code:
cpython/Modules/_statisticsmodule.c— Python’sstatistics.varianceuses Welford internally