Welford’s Algorithm

Welford’s algorithm is a numerically stable online algorithm for computing the running mean, variance, and covariance of a data stream in a single pass. Unlike the naïve two-pass or textbook one-pass formulas, Welford’s method avoids catastrophic cancellation by maintaining a running sum of squared deviations rather than a sum of squares — making it the standard choice for variance computation in production machine learning systems, including [[ResNet|Batch Normalization]], reinforcement learning, and distributed training.


1. Core Concept

1.1 The Problem: Numerically Unstable Variance

The textbook variance formula is:

σ2=1ni=1nxi2(1ni=1nxi)2=x2x¯2

Problem: When data values are large but the variance is small (e.g., xi108 with σ21 ), the two terms x2 and x¯2 are nearly equal large numbers. Their subtraction causes catastrophic cancellation, destroying significant digits.

[!NOTE] Catastrophic Cancellation Example
Consider x=[108+1,108+2,108+3] . The true variance is 1.0 . Using the naïve formula in float32, x21016 and x¯21016 , and their difference can yield zero or even negative variance due to floating-point rounding.

1.2 Welford’s Solution

Instead of accumulating xi2 , Welford maintains a running sum of squared deviations from the current mean:

M2,n=i=1n(xix¯n)2

Since deviations (xix¯n) are small even when xi is large, the sum M2,n never suffers from cancellation.

1.3 Key Properties

  • Single-pass: Each data point is processed exactly once — no need to store the full dataset
  • Online: New data can be incorporated incrementally without recomputing from scratch
  • Numerically stable: Avoids catastrophic cancellation in all regimes
  • Memory efficient: Only O(1) state variables regardless of data size

2. Mathematical Derivation

2.1 Running Mean Update

Definition: The arithmetic mean of the first n observations is:

x¯n=1ni=1nxi

Derivation of the incremental form:

We want to express x¯n in terms of x¯n1 without re-accessing previous data points. Start by separating the last observation:

x¯n=1ni=1nxi=1n(i=1n1xi+xn)

Recognize that i=1n1xi=(n1)x¯n1 , so:

x¯n=1n((n1)x¯n1+xn)

Rewrite (n1)x¯n1 as nx¯n1x¯n1 :

x¯n=1n(nx¯n1x¯n1+xn)=x¯n1+xnx¯n1n x¯n=x¯n1+xnx¯n1n

Interpretation: The new mean equals the old mean plus a correction term proportional to the prediction error (xnx¯n1) , scaled by 1/n . As more data is observed, each new point has diminishing influence on the mean.

2.2 Running Variance Update (Core Result)

Definition: Define the sum of squared deviations from the current mean:

M2,n=i=1n(xix¯n)2

Theorem (Welford, 1962): The quantity M2,n satisfies the recurrence:

M2,n=M2,n1+(xnx¯n1)(xnx¯n)

Complete Proof:

Step 1: Split the sum into the first n1 terms and the n -th term:

M2,n=i=1n1(xix¯n)2+(xnx¯n)2

Step 2: Express (xix¯n) in terms of (xix¯n1) . Define the shift δ=x¯nx¯n1=xnx¯n1n , so:

xix¯n=(xix¯n1)δ

Step 3: Expand the sum of squares of the first n1 terms:

i=1n1(xix¯n)2=i=1n1((xix¯n1)δ)2 =i=1n1(xix¯n1)22δi=1n1(xix¯n1)+(n1)δ2

Step 4: The cross-term vanishes. By definition of x¯n1 :

i=1n1(xix¯n1)=i=1n1xi(n1)x¯n1=0

Therefore:

i=1n1(xix¯n)2=M2,n1+(n1)δ2

Step 5: Substitute δ=xnx¯n1n :

(n1)δ2=(n1)(xnx¯n1)2n2=n1n2(xnx¯n1)2

Step 6: Handle the n -th term. Since x¯n=x¯n1+δ :

xnx¯n=xnx¯n1δ=(xnx¯n1)xnx¯n1n=n1n(xnx¯n1)

So:

(xnx¯n)2=(n1)2n2(xnx¯n1)2

Step 7: Combine all terms:

M2,n=M2,n1+n1n2(xnx¯n1)2+(n1)2n2(xnx¯n1)2

Factor out (xnx¯n1)2 :

M2,n=M2,n1+(xnx¯n1)2(n1)+(n1)2n2 =M2,n1+(xnx¯n1)2(n1)(1+n1)n2=M2,n1+(xnx¯n1)2(n1)nn2 =M2,n1+n1n(xnx¯n1)2

Step 8: Rewrite into the symmetric Welford form. From Step 6, (xnx¯n)=n1n(xnx¯n1) , therefore:

(xnx¯n1)(xnx¯n)=(xnx¯n1)n1n(xnx¯n1)=n1n(xnx¯n1)2

This is exactly the correction term from Step 7. Hence:

M2,n=M2,n1+(xnx¯n1)(xnx¯n)

[!NOTE] Why This Is Stable
The product (xnx¯n1)(xnx¯n) involves deviations from the mean, not raw values. These deviations are small even when xi is large, so no cancellation occurs when accumulating into M2,n . In floating-point arithmetic, both factors are computed via subtraction of quantities of similar magnitude, preserving relative precision.

[!NOTE] Non-Negativity Guarantee
Since x¯n always lies between x¯n1 and xn , both factors (xnx¯n1) and (xnx¯n) have the same sign (or one is zero). Therefore the correction term (xnx¯n1)(xnx¯n)0 always, ensuring M2,n0 — the computed variance can never become negative, unlike the naïve formula.

2.3 Equivalence of Two Forms

The recurrence has two algebraically equivalent forms, both useful in practice:

Form A (product of deviations — numerically preferred):

M2,n=M2,n1+(xnx¯n1)(xnx¯n)

Form B (scaled squared deviation — analytically convenient):

M2,n=M2,n1+n1n(xnx¯n1)2

Form A requires two subtractions and one multiplication. Form B requires one subtraction, one squaring, and one division. In practice, Form A is preferred because it avoids the explicit division by n in the correction term, reducing floating-point rounding.

2.4 Final Variance and Standard Deviation

From M2,n , the variance estimates are obtained by a single division:

σpopulation2=M2,nn,ssample2=M2,nn1

where s2 uses Bessel’s correction ( n1 instead of n ) to yield an unbiased estimator of the population variance.

σ=M2,nn,s=M2,nn1

3. Algorithm

3.1 Basic Welford Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class WelfordOnline:
"""Welford's online algorithm for running mean and variance."""

def __init__(self):
self.n = 0
self.mean = 0.0
self.M2 = 0.0 # Sum of squared deviations

def update(self, x):
"""Process a new data point."""
self.n += 1
delta = x - self.mean
self.mean += delta / self.n
delta2 = x - self.mean
self.M2 += delta * delta2

@property
def variance(self):
"""Population variance."""
return self.M2 / self.n if self.n > 0 else 0.0

@property
def sample_variance(self):
"""Sample variance (Bessel's correction)."""
return self.M2 / (self.n - 1) if self.n > 1 else 0.0

@property
def std(self):
"""Population standard deviation."""
return math.sqrt(self.variance)

3.2 Batched Update (Vectorized)

For processing mini-batches in deep learning:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class WelfordBatched:
"""Welford's algorithm with batched (vectorized) updates."""

def __init__(self, shape):
self.n = 0
self.mean = torch.zeros(shape)
self.M2 = torch.zeros(shape)

def update(self, x_batch):
"""
Update statistics with a batch of data.

Args:
x_batch: Tensor of shape (batch_size, *shape)
"""
batch_size = x_batch.shape[0]

# Batch mean and variance
batch_mean = x_batch.mean(dim=0)
batch_M2 = ((x_batch - batch_mean) ** 2).sum(dim=0)

# Parallel Welford merge
delta = batch_mean - self.mean
total_n = self.n + batch_size

self.mean = self.mean + delta * (batch_size / total_n)
self.M2 = (self.M2 + batch_M2
+ delta ** 2 * (self.n * batch_size / total_n))
self.n = total_n

4. Parallel and Distributed Welford

4.1 Chan’s Parallel Algorithm

When data is split across multiple workers (distributed training), Welford statistics can be merged using Chan et al.'s formula (1983).

Setup: Given two disjoint subsets A={a1,,anA} and B={b1,,bnB} with precomputed statistics (nA,x¯A,M2,A) and (nB,x¯B,M2,B) .

Goal: Compute (nAB,x¯AB,M2,AB) for AB without re-accessing the raw data.

Merged count and mean:

nAB=nA+nB x¯AB=nAx¯A+nBx¯BnA+nB

Derivation of merged M2 :

Start from the definition:

M2,AB=iA(aix¯AB)2+jB(bjx¯AB)2

For the A -part, write aix¯AB=(aix¯A)+(x¯Ax¯AB) and expand:

iA(aix¯AB)2=iA(aix¯A)2+2(x¯Ax¯AB)iA(aix¯A)=0+nA(x¯Ax¯AB)2 =M2,A+nA(x¯Ax¯AB)2

By the same argument for B :

jB(bjx¯AB)2=M2,B+nB(x¯Bx¯AB)2

Therefore:

M2,AB=M2,A+M2,B+nA(x¯Ax¯AB)2+nB(x¯Bx¯AB)2

Now simplify the correction terms. Define Δ=x¯Ax¯B . From the merged mean formula:

x¯Ax¯AB=x¯AnAx¯A+nBx¯BnA+nB=nB(x¯Ax¯B)nA+nB=nBnABΔ x¯Bx¯AB=nAnABΔ

Substitute:

nA(x¯Ax¯AB)2+nB(x¯Bx¯AB)2=nA(nBΔnAB)2+nB(nAΔnAB)2 =nAnB2+nBnA2nAB2Δ2=nAnB(nB+nA)nAB2Δ2=nAnBnABΔ2

Final result:

M2,AB=M2,A+M2,B+nAnBnA+nB(x¯Ax¯B)2

[!NOTE] Interpretation of the Correction Term
The extra term nAnBnA+nBΔ2 is the between-group sum of squares — it accounts for the variance introduced by the difference in group means. This is the same decomposition used in ANOVA: SStotal=SSwithin+SSbetween .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def merge_welford(stats_a, stats_b):
"""
Merge two Welford statistics (Chan's parallel algorithm).

Args:
stats_a, stats_b: Tuples of (n, mean, M2)

Returns:
Merged (n, mean, M2)
"""
n_a, mean_a, m2_a = stats_a
n_b, mean_b, m2_b = stats_b

n = n_a + n_b
delta = mean_a - mean_b
mean = (n_a * mean_a + n_b * mean_b) / n
m2 = m2_a + m2_b + delta ** 2 * (n_a * n_b / n)

return n, mean, m2

4.2 Distributed Training Application

In Distributed Data Parallel (DDP) training, each GPU computes local batch statistics, which are then merged via Chan’s algorithm:

1
2
3
4
GPU 0: (n₀, μ₀, M₂₀) ──┐
GPU 1: (n₁, μ₁, M₂₁) ──┤── Merge (Chan) ──→ Global (n, μ, M₂)
GPU 2: (n₂, μ₂, M₂₂) ──┤
GPU 3: (n₃, μ₃, M₂₃) ──┘

Advantage over all-reduce of sums:

  • All-reduce of x and x2 : suffers from catastrophic cancellation
  • Chan’s merge: numerically stable, identical results regardless of GPU count

5. Welford for Covariance and Correlation

5.1 Online Covariance — Derivation

Definition: The co-deviation sum for two streams {xi} and {yi} is:

Cn=i=1n(xix¯n)(yiy¯n)

Derivation of the recurrence: Split the sum as before:

Cn=i=1n1(xix¯n)(yiy¯n)+(xnx¯n)(yny¯n)

Using the identities from Section 2:

xix¯n=(xix¯n1)δx,δx=xnx¯n1n yiy¯n=(yiy¯n1)δy,δy=yny¯n1n

Expanding the product:

i=1n1(xix¯n)(yiy¯n)=Cn1δxi=1n1(yiy¯n1)=0δyi=1n1(xix¯n1)=0+(n1)δxδy =Cn1+(n1)δxδy

For the n -th term, from Section 2.2 Step 6:

xnx¯n=n1n(xnx¯n1),yny¯n=n1n(yny¯n1)

Combining:

Cn=Cn1+(n1)δxδy+(n1)2n2(xnx¯n1)(yny¯n1)

Substituting δxδy=(xnx¯n1)(yny¯n1)n2 :

Cn=Cn1+n1+(n1)2n2(xnx¯n1)(yny¯n1) =Cn1+n1n(xnx¯n1)(yny¯n1)

Rewriting in the symmetric Welford form (analogous to Section 2.3):

Cn=Cn1+(xnx¯n1)(yny¯n)

Note the asymmetry: the first factor uses the old mean x¯n1 while the second uses the new mean y¯n . This is not a typo — it is the algebraically correct form.

The covariance estimates are:

Cov(x,y)=Cnn(population),Cnn1(sample)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class WelfordCovariance:
"""Online covariance via Welford's algorithm."""

def __init__(self):
self.n = 0
self.mean_x = 0.0
self.mean_y = 0.0
self.C = 0.0 # Co-deviation sum

def update(self, x, y):
self.n += 1
dx = x - self.mean_x
self.mean_x += dx / self.n
dy = y - self.mean_y # Note: uses OLD mean_y
self.mean_y += (y - self.mean_y) / self.n
self.C += dx * (y - self.mean_y) # NEW mean_y

@property
def covariance(self):
return self.C / self.n if self.n > 0 else 0.0

@property
def correlation(self):
# Requires also tracking M2_x and M2_y
return self.C / math.sqrt(self.M2_x * self.M2_y) if self.n > 0 else 0.0

5.2 Online Covariance Matrix

For d -dimensional data xiRd , the full covariance matrix generalizes the scalar case via an outer product:

Cn=Cn1+(xnx¯n1)(xnx¯n)

Structure: At each step, this is a rank-1 update — the outer product of two d -dimensional vectors. The computational cost is O(d2) per observation, and the storage is O(d2) for the full covariance matrix.

The correlation matrix is then:

Rn=diag(Cn)1/2Cndiag(Cn)1/2

6. Weighted and Exponentially-Weighted Variants

6.1 Weighted Welford

For weighted observations (xi,wi) with positive weights, define:

Wn=i=1nwi,x¯n=1Wni=1nwixi,M2,n=i=1nwi(xix¯n)2

Derivation of the weighted mean update:

x¯n=Wn1x¯n1+wnxnWn=Wn1x¯n1+wnxnWn1+wn

Rewrite as:

x¯n=x¯n1+wnWn(xnx¯n1)

Derivation of the weighted M2 update: Following the same algebraic strategy as Section 2.2, define δw=x¯nx¯n1=wnWn(xnx¯n1) :

M2,n=i=1n1wi(xix¯n)2+wn(xnx¯n)2

Expanding (xix¯n)=(xix¯n1)δw and using i=1n1wi(xix¯n1)=0 :

i=1n1wi(xix¯n)2=M2,n1+Wn1δw2

For the n -th term: xnx¯n=(xnx¯n1)δw=(1wnWn)(xnx¯n1)=Wn1Wn(xnx¯n1) , so:

wn(xnx¯n)2=wnWn12Wn2(xnx¯n1)2

Combining and simplifying:

Wn1δw2+wn(xnx¯n)2=Wn1wn2Wn2(xnx¯n1)2+wnWn12Wn2(xnx¯n1)2 =wnWn1(wn+Wn1)Wn2(xnx¯n1)2=wnWn1Wn(xnx¯n1)2

Rewriting in the symmetric form:

M2,n=M2,n1+wn(xnx¯n1)(xnx¯n)

This reduces to the standard (unweighted) Welford recurrence when wi=1 for all i .

6.2 Exponentially Weighted Moving Variance (EWMA)

For non-stationary data (e.g., loss tracking, reward monitoring), we use exponential forgetting with decay rate α(0,1) :

x¯t=(1α)x¯t1+αxt

The exponentially weighted variance update (West, 1979):

vt=(1α)(vt1+α(xtx¯t1)2)

Derivation: This follows from the weighted Welford formula with exponentially decaying weights wt=(1α)Tt , which yields effective Wt1α as T . Substituting into the weighted recurrence and normalizing gives the formula above.

[!NOTE] Connection to Adam Optimizer
The Adam optimizer maintains an exponentially weighted second moment vt=β2vt1+(1β2)gt2 . While structurally similar, Adam tracks E[g2] (raw second moment), not Var(g) (centered variance). The EWMA variance formula tracks the centered second moment, which is a strictly more informative quantity.


7. Applications in Deep Learning

7.1 [[ResNet|Batch Normalization]]

BatchNorm computes per-channel mean and variance during training:

x^=xμBσB2+ϵ

While standard BatchNorm uses the naïve formula (batch sizes are typically small enough), Welford’s algorithm is used in production implementations (PyTorch, TensorFlow) for numerical robustness, especially with:

  • Small batch sizes (e.g., batch size 1-4 in detection/segmentation)
  • Float16 / BFloat16 training (limited mantissa bits)
  • Running statistics accumulation across many batches
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# PyTorch internally uses Welford for running stats
# Conceptual equivalent:
class RunningBatchNorm:
def __init__(self, num_features, momentum=0.1):
self.mean = torch.zeros(num_features)
self.var = torch.ones(num_features)
self.welford = WelfordBatched(shape=(num_features,))
self.momentum = momentum

def forward(self, x):
if self.training:
# Batch statistics (naïve is fine for single batch)
batch_mean = x.mean(dim=[0, 2, 3])
batch_var = x.var(dim=[0, 2, 3], unbiased=False)

# Update running statistics (Welford-style)
self.mean = (1 - self.momentum) * self.mean + self.momentum * batch_mean
self.var = (1 - self.momentum) * self.var + self.momentum * batch_var

return (x - batch_mean[None, :, None, None]) / \
torch.sqrt(batch_var[None, :, None, None] + 1e-5)
else:
return (x - self.mean[None, :, None, None]) / \
torch.sqrt(self.var[None, :, None, None] + 1e-5)

7.2 Layer Normalization & RMSNorm

In [[Vision Transformer (ViT)|Transformers]], LayerNorm requires computing variance over the feature dimension:

LayerNorm(x)=xμσ2+ϵγ+β

For large hidden dimensions ( d=4096+ ), Welford ensures stable normalization even in mixed precision.

7.3 Reinforcement Learning: Advantage Estimation

In PPO and other policy gradient methods, advantage normalization requires running statistics of returns:

A^t=AtμAσA+ϵ

Since RL agents interact with the environment over millions of timesteps, Welford’s online algorithm is the natural choice — storing all returns would be prohibitively expensive.

7.4 Gradient Monitoring and Clipping

Tracking gradient statistics for adaptive clipping:

grad_norm_clip=clip(gσg,max_val)

Welford tracks σg online without storing gradient history.

7.5 Loss Logging and Early Stopping

Monitoring training loss statistics:

  • Running mean of loss for smoothed curves
  • Running variance for detecting training instability (loss spikes)
  • Running covariance between loss and learning rate for adaptive scheduling

8. Comparison with Other Methods

8.1 Variance Computation Methods

Method Passes Online Numerically Stable Memory Parallel
Two-pass (subtract mean, then square) 2 ✅ Stable O(n)
Naïve one-pass ( x2x¯2 ) 1 ❌ Catastrophic cancellation O(1)
Welford 1 ✅ Stable O(1) ✅ (Chan)
Youngs & Cramer 1 ✅ Stable O(1)
Kahan summation + naïve 1 ⚠️ Better but not guaranteed O(1)

8.2 When to Use What

Scenario Recommended Method
Small dataset, fits in memory Two-pass (simplest, stable)
Streaming data, single pass Welford
Distributed / multi-GPU Chan’s parallel Welford
Mixed precision (float16) Welford (essential)
High-dimensional covariance Welford with rank-1 updates

9. Numerical Analysis

9.1 Error Bounds

For n observations with machine precision ϵmach :

Method Relative Error Bound
Naïve O(nκ2ϵmach) where κ=max|xi|σ
Welford O(nϵmach) — independent of data scale

Key insight: Welford’s error does not depend on the condition number κ , making it robust regardless of data magnitude.

9.2 Catastrophic Cancellation Demonstration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np

# Data: large mean, small variance
data = np.array([1e8 + 1, 1e8 + 2, 1e8 + 3], dtype=np.float32)

# Naïve formula
mean_sq = np.mean(data ** 2)
sq_mean = np.mean(data) ** 2
naive_var = mean_sq - sq_mean
print(f"Naïve variance: {naive_var:.6f}") # Often 0.0 or negative!

# Welford
welford = WelfordOnline()
for x in data:
welford.update(float(x))
print(f"Welford variance: {welford.variance:.6f}") # Correct: 0.666667

9.3 Float16 Considerations

In mixed-precision training (AMP), variance computation is especially fragile:

Precision Mantissa Bits Cancellation Threshold
float32 23 bits κ>103 problematic
float16 10 bits κ>101 problematic
bfloat16 7 bits κ>100 problematic

Welford’s algorithm is essential for float16/bfloat16 training — without it, normalization layers produce NaN or Inf values.


10. Core Formula Cards

[!QUOTE] Welford Running Mean

x¯n=x¯n1+xnx¯n1n

[!QUOTE] Welford Running Variance (Core Recurrence)

M2,n=M2,n1+(xnx¯n1)(xnx¯n)

[!QUOTE] Variance from M2

σ2=M2,nn,s2=M2,nn1

[!QUOTE] Chan’s Parallel Merge

M2,AB=M2,A+M2,B+nAnBnA+nB(x¯Ax¯B)2

[!QUOTE] Online Covariance

Cn=Cn1+(xnx¯n1)(yny¯n)

[!QUOTE] Weighted Welford Update

M2,n=M2,n1+wn(xnx¯n1)(xnx¯n)

[!QUOTE] Exponentially Weighted Variance

vt=(1α)(vt1+α(xtx¯t1)2)

11. Summary

Aspect Description
Core idea Accumulate squared deviations from running mean, not raw squares
Key advantage Numerically stable — immune to catastrophic cancellation
Complexity O(1) time and space per update
Parallel extension Chan’s merge formula for distributed computation
Generalization Covariance, correlation, weighted, exponentially-weighted
Role in DL BatchNorm running stats, LayerNorm, RL advantage normalization, gradient monitoring
When essential Mixed precision (float16/bfloat16), large-scale data, distributed training

Welford’s algorithm exemplifies a fundamental principle in numerical computing: reformulate to avoid subtracting large, nearly-equal quantities. By tracking deviations rather than raw sums, it achieves stability that the naïve formula cannot — a lesson that recurs throughout machine learning, from [[ResNet|BatchNorm]] to the [[Score Function|score function]] estimation in diffusion models.


  • [[ResNet]]
  • [[U-Net]]
  • [[Vision Transformer (ViT)]]
  • [[Diffusion Model]]
  • [[Score Function]]

Dataview Query

1
2
3
LIST
FROM #welford OR #online_algorithm OR #numerical_stability
SORT file.ctime DESC

References

  • Paper: Note on a Method for Calculating Corrected Sums of Squares and Products (Welford, Technometrics 1962)
  • Paper: Updating Mean and Variance Estimates: An Improved Method (West, Communications of the ACM 1979)
  • Paper: Algorithms for Computing the Sample Variance: Analysis and Recommendations (Chan, Golub, LeVeque, The American Statistician 1983)
  • Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe & Szegedy, ICML 2015)
  • Blog: Accurately computing running variance — John D. Cook
  • Docs: PyTorch torch.nn.BatchNorm2d implementation notes
  • Code: cpython/Modules/_statisticsmodule.c — Python’s statistics.variance uses Welford internally