2026-06-30

Welford’s Algorithm

Welford’s algorithm is a numerically stable online algorithm for computing the running mean, variance, and covariance of a data stream in a single pass. Unlike the naïve two-pass or textbook one-pass formulas, Welford’s method avoids catastrophic cancellation by maintaining a running sum of squared deviations rather than a sum of squares — making it the standard choice for variance computation in production machine learning systems, including [[ResNet|Batch Normalization]], reinforcement learning, and distributed training.

1. Core Concept

1.1 The Problem: Numerically Unstable Variance

The textbook variance formula is:

σ^{2} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2} - {(\frac{1}{n} \sum_{i = 1}^{n} x_{i})}^{2} = \overset{―}{x^{2}} - {\bar{x}}^{2}

Problem: When data values are large but the variance is small (e.g., $x_{i} \approx 10^{8}$ with $σ^{2} \approx 1$ ), the two terms $\overset{―}{x^{2}}$ and ${\bar{x}}^{2}$ are nearly equal large numbers. Their subtraction causes catastrophic cancellation, destroying significant digits.

[!NOTE] Catastrophic Cancellation Example
Consider $x = [10^{8} + 1, 10^{8} + 2, 10^{8} + 3]$ . The true variance is $1.0$ . Using the naïve formula in float32, $\overset{―}{x^{2}} \approx 10^{16}$ and ${\bar{x}}^{2} \approx 10^{16}$ , and their difference can yield zero or even negative variance due to floating-point rounding.

1.2 Welford’s Solution

Instead of accumulating $\sum x_{i}^{2}$ , Welford maintains a running sum of squared deviations from the current mean:

M_{2, n} = \sum_{i = 1}^{n} (x_{i} - {\bar{x}}_{n})^{2}

Since deviations $(x_{i} - {\bar{x}}_{n})$ are small even when $x_{i}$ is large, the sum $M_{2, n}$ never suffers from cancellation.

1.3 Key Properties

Single-pass: Each data point is processed exactly once — no need to store the full dataset
Online: New data can be incorporated incrementally without recomputing from scratch
Numerically stable: Avoids catastrophic cancellation in all regimes
Memory efficient: Only $O (1)$ state variables regardless of data size

2. Mathematical Derivation

2.1 Running Mean Update

Definition: The arithmetic mean of the first $n$ observations is:

{\bar{x}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

Derivation of the incremental form:

We want to express ${\bar{x}}_{n}$ in terms of ${\bar{x}}_{n - 1}$ without re-accessing previous data points. Start by separating the last observation:

{\bar{x}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} = \frac{1}{n} (\sum_{i = 1}^{n - 1} x_{i} + x_{n})

Recognize that $\sum_{i = 1}^{n - 1} x_{i} = (n - 1) {\bar{x}}_{n - 1}$ , so:

{\bar{x}}_{n} = \frac{1}{n} ((n - 1) {\bar{x}}_{n - 1} + x_{n})

Rewrite $(n - 1) {\bar{x}}_{n - 1}$ as $n {\bar{x}}_{n - 1} - {\bar{x}}_{n - 1}$ :

{\bar{x}}_{n} = \frac{1}{n} (n {\bar{x}}_{n - 1} - {\bar{x}}_{n - 1} + x_{n}) = {\bar{x}}_{n - 1} + \frac{x_{n} - {\bar{x}}_{n - 1}}{n}

{\bar{x}}_{n} = {\bar{x}}_{n - 1} + \frac{x_{n} - {\bar{x}}_{n - 1}}{n}

Interpretation: The new mean equals the old mean plus a correction term proportional to the prediction error $(x_{n} - {\bar{x}}_{n - 1})$ , scaled by $1 / n$ . As more data is observed, each new point has diminishing influence on the mean.

2.2 Running Variance Update (Core Result)

Definition: Define the sum of squared deviations from the current mean:

M_{2, n} = \sum_{i = 1}^{n} (x_{i} - {\bar{x}}_{n})^{2}

Theorem (Welford, 1962): The quantity $M_{2, n}$ satisfies the recurrence:

M_{2, n} = M_{2, n - 1} + (x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n})

Complete Proof:

Step 1: Split the sum into the first $n - 1$ terms and the $n$ -th term:

M_{2, n} = \sum_{i = 1}^{n - 1} (x_{i} - {\bar{x}}_{n})^{2} + (x_{n} - {\bar{x}}_{n})^{2}

Step 2: Express $(x_{i} - {\bar{x}}_{n})$ in terms of $(x_{i} - {\bar{x}}_{n - 1})$ . Define the shift $δ = {\bar{x}}_{n} - {\bar{x}}_{n - 1} = \frac{x_{n} - {\bar{x}}_{n - 1}}{n}$ , so:

x_{i} - {\bar{x}}_{n} = (x_{i} - {\bar{x}}_{n - 1}) - δ

Step 3: Expand the sum of squares of the first $n - 1$ terms:

\sum_{i = 1}^{n - 1} (x_{i} - {\bar{x}}_{n})^{2} = \sum_{i = 1}^{n - 1} ((x_{i} - {\bar{x}}_{n - 1}) - δ)^{2}

= \sum_{i = 1}^{n - 1} (x_{i} - {\bar{x}}_{n - 1})^{2} - 2 δ \sum_{i = 1}^{n - 1} (x_{i} - {\bar{x}}_{n - 1}) + (n - 1) δ^{2}

Step 4: The cross-term vanishes. By definition of ${\bar{x}}_{n - 1}$ :

\sum_{i = 1}^{n - 1} (x_{i} - {\bar{x}}_{n - 1}) = \sum_{i = 1}^{n - 1} x_{i} - (n - 1) {\bar{x}}_{n - 1} = 0

Therefore:

\sum_{i = 1}^{n - 1} (x_{i} - {\bar{x}}_{n})^{2} = M_{2, n - 1} + (n - 1) δ^{2}

Step 5: Substitute $δ = \frac{x_{n} - {\bar{x}}_{n - 1}}{n}$ :

(n - 1) δ^{2} = (n - 1) \cdot \frac{(x_{n} - {\bar{x}}_{n - 1})^{2}}{n^{2}} = \frac{n - 1}{n^{2}} (x_{n} - {\bar{x}}_{n - 1})^{2}

Step 6: Handle the $n$ -th term. Since ${\bar{x}}_{n} = {\bar{x}}_{n - 1} + δ$ :

x_{n} - {\bar{x}}_{n} = x_{n} - {\bar{x}}_{n - 1} - δ = (x_{n} - {\bar{x}}_{n - 1}) - \frac{x_{n} - {\bar{x}}_{n - 1}}{n} = \frac{n - 1}{n} (x_{n} - {\bar{x}}_{n - 1})

So:

(x_{n} - {\bar{x}}_{n})^{2} = \frac{(n - 1)^{2}}{n^{2}} (x_{n} - {\bar{x}}_{n - 1})^{2}

Step 7: Combine all terms:

M_{2, n} = M_{2, n - 1} + \frac{n - 1}{n^{2}} (x_{n} - {\bar{x}}_{n - 1})^{2} + \frac{(n - 1)^{2}}{n^{2}} (x_{n} - {\bar{x}}_{n - 1})^{2}

Factor out $(x_{n} - {\bar{x}}_{n - 1})^{2}$ :

M_{2, n} = M_{2, n - 1} + (x_{n} - {\bar{x}}_{n - 1})^{2} \cdot \frac{(n - 1) + (n - 1)^{2}}{n^{2}}

= M_{2, n - 1} + (x_{n} - {\bar{x}}_{n - 1})^{2} \cdot \frac{(n - 1) (1 + n - 1)}{n^{2}} = M_{2, n - 1} + (x_{n} - {\bar{x}}_{n - 1})^{2} \cdot \frac{(n - 1) n}{n^{2}}

= M_{2, n - 1} + \frac{n - 1}{n} (x_{n} - {\bar{x}}_{n - 1})^{2}

Step 8: Rewrite into the symmetric Welford form. From Step 6, $(x_{n} - {\bar{x}}_{n}) = \frac{n - 1}{n} (x_{n} - {\bar{x}}_{n - 1})$ , therefore:

(x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n}) = (x_{n} - {\bar{x}}_{n - 1}) \cdot \frac{n - 1}{n} (x_{n} - {\bar{x}}_{n - 1}) = \frac{n - 1}{n} (x_{n} - {\bar{x}}_{n - 1})^{2}

This is exactly the correction term from Step 7. Hence:

M_{2, n} = M_{2, n - 1} + (x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n}) ◼

[!NOTE] Why This Is Stable
The product $(x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n})$ involves deviations from the mean, not raw values. These deviations are small even when $x_{i}$ is large, so no cancellation occurs when accumulating into $M_{2, n}$ . In floating-point arithmetic, both factors are computed via subtraction of quantities of similar magnitude, preserving relative precision.

[!NOTE] Non-Negativity Guarantee
Since ${\bar{x}}_{n}$ always lies between ${\bar{x}}_{n - 1}$ and $x_{n}$ , both factors $(x_{n} - {\bar{x}}_{n - 1})$ and $(x_{n} - {\bar{x}}_{n})$ have the same sign (or one is zero). Therefore the correction term $(x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n}) \geq 0$ always, ensuring $M_{2, n} \geq 0$ — the computed variance can never become negative, unlike the naïve formula.

2.3 Equivalence of Two Forms

The recurrence has two algebraically equivalent forms, both useful in practice:

Form A (product of deviations — numerically preferred):

M_{2, n} = M_{2, n - 1} + (x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n})

Form B (scaled squared deviation — analytically convenient):

M_{2, n} = M_{2, n - 1} + \frac{n - 1}{n} (x_{n} - {\bar{x}}_{n - 1})^{2}

Form A requires two subtractions and one multiplication. Form B requires one subtraction, one squaring, and one division. In practice, Form A is preferred because it avoids the explicit division by $n$ in the correction term, reducing floating-point rounding.

2.4 Final Variance and Standard Deviation

From $M_{2, n}$ , the variance estimates are obtained by a single division:

σ_{population}^{2} = \frac{M_{2, n}}{n}, s_{sample}^{2} = \frac{M_{2, n}}{n - 1}

where $s^{2}$ uses Bessel’s correction ( $n - 1$ instead of $n$ ) to yield an unbiased estimator of the population variance.

σ = \sqrt{\frac{M_{2, n}}{n}}, s = \sqrt{\frac{M_{2, n}}{n - 1}}

3. Algorithm

3.1 Basic Welford Algorithm

class WelfordOnline:
    """Welford's online algorithm for running mean and variance."""
    
    def __init__(self):
        self.n = 0
        self.mean = 0.0
        self.M2 = 0.0  # Sum of squared deviations
    
    def update(self, x):
        """Process a new data point."""
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean
        self.M2 += delta * delta2
    
    @property
    def variance(self):
        """Population variance."""
        return self.M2 / self.n if self.n > 0 else 0.0
    
    @property
    def sample_variance(self):
        """Sample variance (Bessel's correction)."""
        return self.M2 / (self.n - 1) if self.n > 1 else 0.0
    
    @property
    def std(self):
        """Population standard deviation."""
        return math.sqrt(self.variance)

3.2 Batched Update (Vectorized)

For processing mini-batches in deep learning:

class WelfordBatched:
    """Welford's algorithm with batched (vectorized) updates."""
    
    def __init__(self, shape):
        self.n = 0
        self.mean = torch.zeros(shape)
        self.M2 = torch.zeros(shape)
    
    def update(self, x_batch):
        """
        Update statistics with a batch of data.
        
        Args:
            x_batch: Tensor of shape (batch_size, *shape)
        """
        batch_size = x_batch.shape[0]
        
        # Batch mean and variance
        batch_mean = x_batch.mean(dim=0)
        batch_M2 = ((x_batch - batch_mean) ** 2).sum(dim=0)
        
        # Parallel Welford merge
        delta = batch_mean - self.mean
        total_n = self.n + batch_size
        
        self.mean = self.mean + delta * (batch_size / total_n)
        self.M2 = (self.M2 + batch_M2 
                   + delta ** 2 * (self.n * batch_size / total_n))
        self.n = total_n

4. Parallel and Distributed Welford

4.1 Chan’s Parallel Algorithm

When data is split across multiple workers (distributed training), Welford statistics can be merged using Chan et al.'s formula (1983).

Setup: Given two disjoint subsets $A = {a_{1}, \dots, a_{n_{A}}}$ and $B = {b_{1}, \dots, b_{n_{B}}}$ with precomputed statistics $(n_{A}, {\bar{x}}_{A}, M_{2, A})$ and $(n_{B}, {\bar{x}}_{B}, M_{2, B})$ .

Goal: Compute $(n_{A B}, {\bar{x}}_{A B}, M_{2, A B})$ for $A \cup B$ without re-accessing the raw data.

Merged count and mean:

n_{A B} = n_{A} + n_{B}

{\bar{x}}_{A B} = \frac{n_{A} {\bar{x}}_{A} + n_{B} {\bar{x}}_{B}}{n_{A} + n_{B}}

Derivation of merged $M_{2}$ :

Start from the definition:

M_{2, A B} = \sum_{i \in A} (a_{i} - {\bar{x}}_{A B})^{2} + \sum_{j \in B} (b_{j} - {\bar{x}}_{A B})^{2}

For the $A$ -part, write $a_{i} - {\bar{x}}_{A B} = (a_{i} - {\bar{x}}_{A}) + ({\bar{x}}_{A} - {\bar{x}}_{A B})$ and expand:

\sum_{i \in A} (a_{i} - {\bar{x}}_{A B})^{2} = \sum_{i \in A} (a_{i} - {\bar{x}}_{A})^{2} + 2 ({\bar{x}}_{A} - {\bar{x}}_{A B}) \underset{= 0}{\underset{⏟}{\sum_{i \in A} (a_{i} - {\bar{x}}_{A})}} + n_{A} ({\bar{x}}_{A} - {\bar{x}}_{A B})^{2}

= M_{2, A} + n_{A} ({\bar{x}}_{A} - {\bar{x}}_{A B})^{2}

By the same argument for $B$ :

\sum_{j \in B} (b_{j} - {\bar{x}}_{A B})^{2} = M_{2, B} + n_{B} ({\bar{x}}_{B} - {\bar{x}}_{A B})^{2}

Therefore:

M_{2, A B} = M_{2, A} + M_{2, B} + n_{A} ({\bar{x}}_{A} - {\bar{x}}_{A B})^{2} + n_{B} ({\bar{x}}_{B} - {\bar{x}}_{A B})^{2}

Now simplify the correction terms. Define $Δ = {\bar{x}}_{A} - {\bar{x}}_{B}$ . From the merged mean formula:

{\bar{x}}_{A} - {\bar{x}}_{A B} = {\bar{x}}_{A} - \frac{n_{A} {\bar{x}}_{A} + n_{B} {\bar{x}}_{B}}{n_{A} + n_{B}} = \frac{n_{B} ({\bar{x}}_{A} - {\bar{x}}_{B})}{n_{A} + n_{B}} = \frac{n_{B}}{n_{A B}} Δ

{\bar{x}}_{B} - {\bar{x}}_{A B} = - \frac{n_{A}}{n_{A B}} Δ

Substitute:

n_{A} ({\bar{x}}_{A} - {\bar{x}}_{A B})^{2} + n_{B} ({\bar{x}}_{B} - {\bar{x}}_{A B})^{2} = n_{A} {(\frac{n_{B} Δ}{n_{A B}})}^{2} + n_{B} {(\frac{n_{A} Δ}{n_{A B}})}^{2}

= \frac{n_{A} n_{B}^{2} + n_{B} n_{A}^{2}}{n_{A B}^{2}} Δ^{2} = \frac{n_{A} n_{B} (n_{B} + n_{A})}{n_{A B}^{2}} Δ^{2} = \frac{n_{A} n_{B}}{n_{A B}} Δ^{2}

Final result:

M_{2, A B} = M_{2, A} + M_{2, B} + \frac{n_{A} \cdot n_{B}}{n_{A} + n_{B}} ({\bar{x}}_{A} - {\bar{x}}_{B})^{2}

[!NOTE] Interpretation of the Correction Term
The extra term $\frac{n_{A} n_{B}}{n_{A} + n_{B}} Δ^{2}$ is the between-group sum of squares — it accounts for the variance introduced by the difference in group means. This is the same decomposition used in ANOVA: ${SS}_{total} = {SS}_{within} + {SS}_{between}$ .

def merge_welford(stats_a, stats_b):
    """
    Merge two Welford statistics (Chan's parallel algorithm).
    
    Args:
        stats_a, stats_b: Tuples of (n, mean, M2)
    
    Returns:
        Merged (n, mean, M2)
    """
    n_a, mean_a, m2_a = stats_a
    n_b, mean_b, m2_b = stats_b
    
    n = n_a + n_b
    delta = mean_a - mean_b
    mean = (n_a * mean_a + n_b * mean_b) / n
    m2 = m2_a + m2_b + delta ** 2 * (n_a * n_b / n)
    
    return n, mean, m2

4.2 Distributed Training Application

In Distributed Data Parallel (DDP) training, each GPU computes local batch statistics, which are then merged via Chan’s algorithm:

GPU 0: (n₀, μ₀, M₂₀) ──┐
GPU 1: (n₁, μ₁, M₂₁) ──┤── Merge (Chan) ──→ Global (n, μ, M₂)
GPU 2: (n₂, μ₂, M₂₂) ──┤
GPU 3: (n₃, μ₃, M₂₃) ──┘

Advantage over all-reduce of sums:

All-reduce of $\sum x$ and $\sum x^{2}$ : suffers from catastrophic cancellation
Chan’s merge: numerically stable, identical results regardless of GPU count

5. Welford for Covariance and Correlation

5.1 Online Covariance — Derivation

Definition: The co-deviation sum for two streams ${x_{i}}$ and ${y_{i}}$ is:

C_{n} = \sum_{i = 1}^{n} (x_{i} - {\bar{x}}_{n}) (y_{i} - {\bar{y}}_{n})

Derivation of the recurrence: Split the sum as before:

C_{n} = \sum_{i = 1}^{n - 1} (x_{i} - {\bar{x}}_{n}) (y_{i} - {\bar{y}}_{n}) + (x_{n} - {\bar{x}}_{n}) (y_{n} - {\bar{y}}_{n})

Using the identities from Section 2:

x_{i} - {\bar{x}}_{n} = (x_{i} - {\bar{x}}_{n - 1}) - δ_{x}, δ_{x} = \frac{x_{n} - {\bar{x}}_{n - 1}}{n}

y_{i} - {\bar{y}}_{n} = (y_{i} - {\bar{y}}_{n - 1}) - δ_{y}, δ_{y} = \frac{y_{n} - {\bar{y}}_{n - 1}}{n}

Expanding the product:

\sum_{i = 1}^{n - 1} (x_{i} - {\bar{x}}_{n}) (y_{i} - {\bar{y}}_{n}) = C_{n - 1} - δ_{x} \underset{= 0}{\underset{⏟}{\sum_{i = 1}^{n - 1} (y_{i} - {\bar{y}}_{n - 1})}} - δ_{y} \underset{= 0}{\underset{⏟}{\sum_{i = 1}^{n - 1} (x_{i} - {\bar{x}}_{n - 1})}} + (n - 1) δ_{x} δ_{y}

= C_{n - 1} + (n - 1) δ_{x} δ_{y}

For the $n$ -th term, from Section 2.2 Step 6:

x_{n} - {\bar{x}}_{n} = \frac{n - 1}{n} (x_{n} - {\bar{x}}_{n - 1}), y_{n} - {\bar{y}}_{n} = \frac{n - 1}{n} (y_{n} - {\bar{y}}_{n - 1})

Combining:

C_{n} = C_{n - 1} + (n - 1) δ_{x} δ_{y} + \frac{(n - 1)^{2}}{n^{2}} (x_{n} - {\bar{x}}_{n - 1}) (y_{n} - {\bar{y}}_{n - 1})

Substituting $δ_{x} δ_{y} = \frac{(x_{n} - {\bar{x}}_{n - 1}) (y_{n} - {\bar{y}}_{n - 1})}{n^{2}}$ :

C_{n} = C_{n - 1} + \frac{n - 1 + (n - 1)^{2}}{n^{2}} (x_{n} - {\bar{x}}_{n - 1}) (y_{n} - {\bar{y}}_{n - 1})

= C_{n - 1} + \frac{n - 1}{n} (x_{n} - {\bar{x}}_{n - 1}) (y_{n} - {\bar{y}}_{n - 1})

Rewriting in the symmetric Welford form (analogous to Section 2.3):

C_{n} = C_{n - 1} + (x_{n} - {\bar{x}}_{n - 1}) (y_{n} - {\bar{y}}_{n})

Note the asymmetry: the first factor uses the old mean ${\bar{x}}_{n - 1}$ while the second uses the new mean ${\bar{y}}_{n}$ . This is not a typo — it is the algebraically correct form.

The covariance estimates are:

Cov (x, y) = \frac{C_{n}}{n} (population), \frac{C_{n}}{n - 1} (sample)

class WelfordCovariance:
    """Online covariance via Welford's algorithm."""
    
    def __init__(self):
        self.n = 0
        self.mean_x = 0.0
        self.mean_y = 0.0
        self.C = 0.0  # Co-deviation sum
    
    def update(self, x, y):
        self.n += 1
        dx = x - self.mean_x
        self.mean_x += dx / self.n
        dy = y - self.mean_y  # Note: uses OLD mean_y
        self.mean_y += (y - self.mean_y) / self.n
        self.C += dx * (y - self.mean_y)  # NEW mean_y
    
    @property
    def covariance(self):
        return self.C / self.n if self.n > 0 else 0.0
    
    @property
    def correlation(self):
        # Requires also tracking M2_x and M2_y
        return self.C / math.sqrt(self.M2_x * self.M2_y) if self.n > 0 else 0.0

5.2 Online Covariance Matrix

For $d$ -dimensional data $x_{i} \in R^{d}$ , the full covariance matrix generalizes the scalar case via an outer product:

C_{n} = C_{n - 1} + (x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n})^{⊤}

Structure: At each step, this is a rank-1 update — the outer product of two $d$ -dimensional vectors. The computational cost is $O (d^{2})$ per observation, and the storage is $O (d^{2})$ for the full covariance matrix.

The correlation matrix is then:

R_{n} = diag (C_{n})^{- 1 / 2} C_{n} diag (C_{n})^{- 1 / 2}

6. Weighted and Exponentially-Weighted Variants

6.1 Weighted Welford

For weighted observations $(x_{i}, w_{i})$ with positive weights, define:

W_{n} = \sum_{i = 1}^{n} w_{i}, {\bar{x}}_{n} = \frac{1}{W_{n}} \sum_{i = 1}^{n} w_{i} x_{i}, M_{2, n} = \sum_{i = 1}^{n} w_{i} (x_{i} - {\bar{x}}_{n})^{2}

Derivation of the weighted mean update:

{\bar{x}}_{n} = \frac{W_{n - 1} {\bar{x}}_{n - 1} + w_{n} x_{n}}{W_{n}} = \frac{W_{n - 1} {\bar{x}}_{n - 1} + w_{n} x_{n}}{W_{n - 1} + w_{n}}

Rewrite as:

{\bar{x}}_{n} = {\bar{x}}_{n - 1} + \frac{w_{n}}{W_{n}} (x_{n} - {\bar{x}}_{n - 1})

Derivation of the weighted $M_{2}$ update: Following the same algebraic strategy as Section 2.2, define $δ_{w} = {\bar{x}}_{n} - {\bar{x}}_{n - 1} = \frac{w_{n}}{W_{n}} (x_{n} - {\bar{x}}_{n - 1})$ :

M_{2, n} = \sum_{i = 1}^{n - 1} w_{i} (x_{i} - {\bar{x}}_{n})^{2} + w_{n} (x_{n} - {\bar{x}}_{n})^{2}

Expanding $(x_{i} - {\bar{x}}_{n}) = (x_{i} - {\bar{x}}_{n - 1}) - δ_{w}$ and using $\sum_{i = 1}^{n - 1} w_{i} (x_{i} - {\bar{x}}_{n - 1}) = 0$ :

\sum_{i = 1}^{n - 1} w_{i} (x_{i} - {\bar{x}}_{n})^{2} = M_{2, n - 1} + W_{n - 1} δ_{w}^{2}

For the $n$ -th term: $x_{n} - {\bar{x}}_{n} = (x_{n} - {\bar{x}}_{n - 1}) - δ_{w} = (1 - \frac{w_{n}}{W_{n}}) (x_{n} - {\bar{x}}_{n - 1}) = \frac{W_{n - 1}}{W_{n}} (x_{n} - {\bar{x}}_{n - 1})$ , so:

w_{n} (x_{n} - {\bar{x}}_{n})^{2} = w_{n} \cdot \frac{W_{n - 1}^{2}}{W_{n}^{2}} (x_{n} - {\bar{x}}_{n - 1})^{2}

Combining and simplifying:

W_{n - 1} δ_{w}^{2} + w_{n} (x_{n} - {\bar{x}}_{n})^{2} = \frac{W_{n - 1} w_{n}^{2}}{W_{n}^{2}} (x_{n} - {\bar{x}}_{n - 1})^{2} + \frac{w_{n} W_{n - 1}^{2}}{W_{n}^{2}} (x_{n} - {\bar{x}}_{n - 1})^{2}

= \frac{w_{n} W_{n - 1} (w_{n} + W_{n - 1})}{W_{n}^{2}} (x_{n} - {\bar{x}}_{n - 1})^{2} = \frac{w_{n} W_{n - 1}}{W_{n}} (x_{n} - {\bar{x}}_{n - 1})^{2}

Rewriting in the symmetric form:

M_{2, n} = M_{2, n - 1} + w_{n} (x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n})

This reduces to the standard (unweighted) Welford recurrence when $w_{i} = 1$ for all $i$ .

6.2 Exponentially Weighted Moving Variance (EWMA)

For non-stationary data (e.g., loss tracking, reward monitoring), we use exponential forgetting with decay rate $α \in (0, 1)$ :

{\bar{x}}_{t} = (1 - α) {\bar{x}}_{t - 1} + α x_{t}

The exponentially weighted variance update (West, 1979):

v_{t} = (1 - α) (v_{t - 1} + α (x_{t} - {\bar{x}}_{t - 1})^{2})

Derivation: This follows from the weighted Welford formula with exponentially decaying weights $w_{t} = (1 - α)^{T - t}$ , which yields effective $W_{t} \to \frac{1}{α}$ as $T \to \infty$ . Substituting into the weighted recurrence and normalizing gives the formula above.

[!NOTE] Connection to Adam Optimizer
The Adam optimizer maintains an exponentially weighted second moment $v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$ . While structurally similar, Adam tracks $E [g^{2}]$ (raw second moment), not $Var (g)$ (centered variance). The EWMA variance formula tracks the centered second moment, which is a strictly more informative quantity.

7. Applications in Deep Learning

7.1 [[ResNet|Batch Normalization]]

BatchNorm computes per-channel mean and variance during training:

\hat{x} = \frac{x - μ_{B}}{\sqrt{σ_{B}^{2} + ϵ}}

While standard BatchNorm uses the naïve formula (batch sizes are typically small enough), Welford’s algorithm is used in production implementations (PyTorch, TensorFlow) for numerical robustness, especially with:

Small batch sizes (e.g., batch size 1-4 in detection/segmentation)
Float16 / BFloat16 training (limited mantissa bits)
Running statistics accumulation across many batches

# PyTorch internally uses Welford for running stats
# Conceptual equivalent:
class RunningBatchNorm:
    def __init__(self, num_features, momentum=0.1):
        self.mean = torch.zeros(num_features)
        self.var = torch.ones(num_features)
        self.welford = WelfordBatched(shape=(num_features,))
        self.momentum = momentum
    
    def forward(self, x):
        if self.training:
            # Batch statistics (naïve is fine for single batch)
            batch_mean = x.mean(dim=[0, 2, 3])
            batch_var = x.var(dim=[0, 2, 3], unbiased=False)
            
            # Update running statistics (Welford-style)
            self.mean = (1 - self.momentum) * self.mean + self.momentum * batch_mean
            self.var = (1 - self.momentum) * self.var + self.momentum * batch_var
            
            return (x - batch_mean[None, :, None, None]) / \
                   torch.sqrt(batch_var[None, :, None, None] + 1e-5)
        else:
            return (x - self.mean[None, :, None, None]) / \
                   torch.sqrt(self.var[None, :, None, None] + 1e-5)

7.2 Layer Normalization & RMSNorm

In [[Vision Transformer (ViT)|Transformers]], LayerNorm requires computing variance over the feature dimension:

LayerNorm (x) = \frac{x - μ}{\sqrt{σ^{2} + ϵ}} \cdot γ + β

For large hidden dimensions ( $d = 4096 +$ ), Welford ensures stable normalization even in mixed precision.

7.3 Reinforcement Learning: Advantage Estimation

In PPO and other policy gradient methods, advantage normalization requires running statistics of returns:

{\hat{A}}_{t} = \frac{A_{t} - μ_{A}}{σ_{A} + ϵ}

Since RL agents interact with the environment over millions of timesteps, Welford’s online algorithm is the natural choice — storing all returns would be prohibitively expensive.

7.4 Gradient Monitoring and Clipping

Tracking gradient statistics for adaptive clipping:

grad_norm_clip = clip (\frac{∥ g ∥}{σ_{g}}, max_val)

Welford tracks $σ_{g}$ online without storing gradient history.

7.5 Loss Logging and Early Stopping

Monitoring training loss statistics:

Running mean of loss for smoothed curves
Running variance for detecting training instability (loss spikes)
Running covariance between loss and learning rate for adaptive scheduling

8. Comparison with Other Methods

8.1 Variance Computation Methods

Method	Passes	Online	Numerically Stable	Memory	Parallel
Two-pass (subtract mean, then square)	2	❌	✅ Stable	$O (n)$	❌
Naïve one-pass ( $\overset{―}{x^{2}} - {\bar{x}}^{2}$ )	1	✅	❌ Catastrophic cancellation	$O (1)$	✅
Welford	1	✅	✅ Stable	$O (1)$	✅ (Chan)
Youngs & Cramer	1	✅	✅ Stable	$O (1)$	✅
Kahan summation + naïve	1	✅	⚠️ Better but not guaranteed	$O (1)$	❌

8.2 When to Use What

Scenario	Recommended Method
Small dataset, fits in memory	Two-pass (simplest, stable)
Streaming data, single pass	Welford
Distributed / multi-GPU	Chan’s parallel Welford
Mixed precision (float16)	Welford (essential)
High-dimensional covariance	Welford with rank-1 updates

9. Numerical Analysis

9.1 Error Bounds

For $n$ observations with machine precision $ϵ_{mach}$ :

Method	Relative Error Bound
Naïve	$O (n κ^{2} ϵ_{mach})$ where $κ = \frac{max \| x_{i} \|}{σ}$
Welford	$O (n ϵ_{mach})$ — independent of data scale

Key insight: Welford’s error does not depend on the condition number $κ$ , making it robust regardless of data magnitude.

9.2 Catastrophic Cancellation Demonstration

import numpy as np

# Data: large mean, small variance
data = np.array([1e8 + 1, 1e8 + 2, 1e8 + 3], dtype=np.float32)

# Naïve formula
mean_sq = np.mean(data ** 2)
sq_mean = np.mean(data) ** 2
naive_var = mean_sq - sq_mean
print(f"Naïve variance:   {naive_var:.6f}")  # Often 0.0 or negative!

# Welford
welford = WelfordOnline()
for x in data:
    welford.update(float(x))
print(f"Welford variance: {welford.variance:.6f}")  # Correct: 0.666667

9.3 Float16 Considerations

In mixed-precision training (AMP), variance computation is especially fragile:

Precision	Mantissa Bits	Cancellation Threshold
float32	23 bits	$κ > 10^{3}$ problematic
float16	10 bits	$κ > 10^{1}$ problematic
bfloat16	7 bits	$κ > 10^{0}$ problematic

Welford’s algorithm is essential for float16/bfloat16 training — without it, normalization layers produce NaN or Inf values.

10. Core Formula Cards

[!QUOTE] Welford Running Mean
${\bar{x}}_{n} = {\bar{x}}_{n - 1} + \frac{x_{n} - {\bar{x}}_{n - 1}}{n}$

[!QUOTE] Welford Running Variance (Core Recurrence)
$M_{2, n} = M_{2, n - 1} + (x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n})$

[!QUOTE] Variance from $M_{2}$
$σ^{2} = \frac{M_{2, n}}{n}, s^{2} = \frac{M_{2, n}}{n - 1}$

[!QUOTE] Chan’s Parallel Merge
$M_{2, A \cup B} = M_{2, A} + M_{2, B} + \frac{n_{A} n_{B}}{n_{A} + n_{B}} ({\bar{x}}_{A} - {\bar{x}}_{B})^{2}$

[!QUOTE] Online Covariance
$C_{n} = C_{n - 1} + (x_{n} - {\bar{x}}_{n - 1}) (y_{n} - {\bar{y}}_{n})$

[!QUOTE] Weighted Welford Update
$M_{2, n} = M_{2, n - 1} + w_{n} (x_{n} - {\bar{x}}_{n - 1}) (x_{n} - {\bar{x}}_{n})$

[!QUOTE] Exponentially Weighted Variance
$v_{t} = (1 - α) (v_{t - 1} + α (x_{t} - {\bar{x}}_{t - 1})^{2})$

11. Summary

Aspect	Description
Core idea	Accumulate squared deviations from running mean, not raw squares
Key advantage	Numerically stable — immune to catastrophic cancellation
Complexity	$O (1)$ time and space per update
Parallel extension	Chan’s merge formula for distributed computation
Generalization	Covariance, correlation, weighted, exponentially-weighted
Role in DL	BatchNorm running stats, LayerNorm, RL advantage normalization, gradient monitoring
When essential	Mixed precision (float16/bfloat16), large-scale data, distributed training

Welford’s algorithm exemplifies a fundamental principle in numerical computing: reformulate to avoid subtracting large, nearly-equal quantities. By tracking deviations rather than raw sums, it achieves stability that the naïve formula cannot — a lesson that recurs throughout machine learning, from [[ResNet|BatchNorm]] to the [[Score Function|score function]] estimation in diffusion models.

[[ResNet]]
[[U-Net]]
[[Vision Transformer (ViT)]]
[[Diffusion Model]]
[[Score Function]]

Dataview Query

1
2
3

LIST
FROM #welford OR #online_algorithm OR #numerical_stability
SORT file.ctime DESC

References

Paper: Note on a Method for Calculating Corrected Sums of Squares and Products (Welford, Technometrics 1962)
Paper: Updating Mean and Variance Estimates: An Improved Method (West, Communications of the ACM 1979)
Paper: Algorithms for Computing the Sample Variance: Analysis and Recommendations (Chan, Golub, LeVeque, The American Statistician 1983)
Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe & Szegedy, ICML 2015)
Blog: Accurately computing running variance — John D. Cook
Docs: PyTorch torch.nn.BatchNorm2d implementation notes
Code: cpython/Modules/_statisticsmodule.c — Python’s statistics.variance uses Welford internally