2026-06-30

Score Function

The score function of a probability density $p (x)$ is its gradient with respect to the input: $\nabla_{x} \log p (x)$ . It points in the direction of steepest increase in log-density and is fundamental to score-based generative modeling, [[Diffusion Model|diffusion models]], and statistical inference.

1. Core Concept

1.1 Definition

For a probability density function $p (x)$ , the score function is:

s (x) = \nabla_{x} \log p (x) = \frac{\nabla_{x} p (x)}{p (x)}

Key properties:

Points in the direction of maximum probability increase
Magnitude indicates how fast probability changes
Invariant to scaling: $\nabla_{x} \log (c \cdot p (x)) = \nabla_{x} \log p (x)$ for $c > 0$

[!NOTE] Intuitive Understanding
Imagine standing on a hill in fog. The score function tells you which direction goes uphill most steeply and how steep it is, without knowing your absolute elevation.

1.2 Why Score Function?

Advantages over density estimation:

No normalization constant: $\nabla_{x} \log p (x)$ doesn’t require knowing $\int p (x) d x$
Geometric interpretation: Reveals structure of probability landscape
Sampling: Can generate samples using [[Langevin Dynamics|Langevin dynamics]]
Flexible models: Learn unnormalized densities

Applications:

[[Diffusion Model|Diffusion models]] (core component)
Score-based generative modeling
Statistical inference (Fisher information)
Optimization (natural gradient)
Anomaly detection

2. Mathematical Properties

2.1 Basic Properties

Zero mean:

E_{x \sim p (x)} [s (x)] = \int p (x) \nabla_{x} \log p (x) d x = \int \nabla_{x} p (x) d x = 0

Fisher Information Matrix:

I = E [s (x) s (x)^{⊤}] = Cov [s (x)]

Measures the amount of information that observations carry about parameters.

2.2 Score of Gaussian Distribution

For $x \sim N (μ, Σ)$ :

\log p (x) = - \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ) + const

s (x) = \nabla_{x} \log p (x) = - Σ^{- 1} (x - μ)

Special case (standard normal $N (0, I)$ ):

s (x) = - x

2.3 Score of Mixture Distribution

For mixture $p (x) = \sum_{k} w_{k} p_{k} (x)$ :

s (x) = \nabla_{x} \log \sum_{k} w_{k} p_{k} (x) = \frac{\sum_{k} w_{k} \nabla_{x} p_{k} (x)}{\sum_{k} w_{k} p_{k} (x)}

s (x) = \sum_{k} \underset{γ_{k} (x)}{\underset{⏟}{\frac{w_{k} p_{k} (x)}{\sum_{j} w_{j} p_{j} (x)}}} s_{k} (x)

where $γ_{k} (x)$ is the responsibility of component $k$ , and $s_{k} (x)$ is its score.

Interpretation: The score is a weighted average of component scores.

2.4 Score and Hessian

Hessian of log-density:

H (x) = \nabla_{x}^{2} \log p (x) = \nabla_{x} s (x)

Relationship:

E [H (x)] = - E [s (x) s (x)^{⊤}] = - I

This is a fundamental identity in statistics.

3. Score Estimation

3.1 The Challenge

Problem: We don’t know $p (x)$ , so we can’t compute $\nabla_{x} \log p (x)$ analytically.

Solution: Learn a neural network $s_{θ} (x)$ to approximate the score function.

3.2 Score Matching

Objective: Minimize Fisher divergence between true score $s (x)$ and model score $s_{θ} (x)$ :

L (θ) = \frac{1}{2} E_{p (x)} [∥ s_{θ} (x) - s (x) ∥^{2}]

Problem: Requires knowing $s (x) = \nabla_{x} \log p (x)$ , which is intractable.

3.3 Denoising Score Matching

Key insight (Hyvärinen, 2005): Add noise to data and learn to recover it.

Algorithm:

Perturb data: $\tilde{x} = x + ϵ$ , where $ϵ \sim N (0, σ^{2} I)$
Learn score of perturbed distribution: $s_{θ} (\tilde{x}) \approx \nabla_{\tilde{x}} \log p_{σ} (\tilde{x})$

Loss function:

L (θ) = \frac{1}{2} E_{x, \tilde{x}} [∥ s_{θ} (\tilde{x}) - \nabla_{\tilde{x}} \log p_{σ} (\tilde{x} ∣ x) ∥^{2}]

For Gaussian noise $\tilde{x} ∣ x \sim N (x, σ^{2} I)$ :

\nabla_{\tilde{x}} \log p_{σ} (\tilde{x} ∣ x) = - \frac{\tilde{x} - x}{σ^{2}}

Final loss:

L (θ) = \frac{1}{2 σ^{2}} E_{x, ϵ} [∥ s_{θ} (x + σ ϵ) + ϵ ∥^{2}]

3.4 Sliced Score Matching

Idea: Use random projections to avoid computing Hessian.

Loss:

L (θ) = E_{p (x), p (v)} [v^{⊤} \nabla_{x} s_{θ} (x) v + \frac{1}{2} ∥ s_{θ} (x) ∥^{2}]

where $v \sim N (0, I)$ is a random vector.

Advantage: Only requires Jacobian-vector products (efficient via autodiff).

4. Score-Based Generative Modeling

4.1 Langevin Dynamics

Core idea: Use score function to generate samples via stochastic dynamics.

Algorithm:

x_{t + 1} = x_{t} + \frac{η}{2} s_{θ} (x_{t}) + \sqrt{η} z_{t}

where:

$η$ : Step size
$z_{t} \sim N (0, I)$ : Random noise

Theorem: As $η \to 0$ and $T \to \infty$ , $x_{T} \sim p (x)$ .

# Langevin Dynamics Sampling
def langevin_sampling(score_model, x_init, n_steps, step_size):
    x = x_init
    
    for t in range(n_steps):
        # Compute score
        score = score_model(x)
        
        # Add noise
        noise = torch.randn_like(x)
        
        # Update
        x = x + (step_size / 2) * score + torch.sqrt(step_size) * noise
    
    return x

4.2 Noise Conditional Score Networks (NCSN)

Problem: Single noise level doesn’t work well for complex distributions.

Solution: Train score model with multiple noise levels.

Algorithm:

Noise schedule: $σ_{1} < σ_{2} < \dots < σ_{L}$
Perturb data: $\tilde{x} = x + σ_{i} ϵ$ for random $i$
Learn: $s_{θ} (\tilde{x}, i) \approx \nabla_{\tilde{x}} \log p_{σ_{i}} (\tilde{x})$

Loss:

L (θ) = \sum_{i = 1}^{L} λ_{i} E [∥ s_{θ} (\tilde{x}, i) - \nabla_{\tilde{x}} \log p_{σ_{i}} (\tilde{x} ∣ x) ∥^{2}]

Sampling (annealed Langevin dynamics):

def annealed_langevin_sampling(score_model, n_steps_per_scale, noise_schedule):
    x = torch.randn_like(data_shape)  # Start from noise
    
    for sigma in reversed(noise_schedule):
        for t in range(n_steps_per_scale):
            score = score_model(x, sigma)
            noise = torch.randn_like(x)
            x = x + sigma**2 * score + sigma * noise
    
    return x

4.3 Score [[Stochastic Differential Equation (SDE)|SDE]]

Unified framework (Song et al., 2021): Connect score matching with SDEs.

Forward [[Stochastic Differential Equation (SDE)|SDE]]:

d x = f (x, t) d t + g (t) d W_{t}

Reverse-time [[Stochastic Differential Equation (SDE)|SDE]]:

d x = [f (x, t) - g (t)^{2} \nabla_{x} \log p_{t} (x)] d t + g (t) d {\bar{W}}_{t}

Key insight: The reverse [[Stochastic Differential Equation (SDE)|SDE]] requires the score function $\nabla_{x} \log p_{t} (x)$ at all times $t$ .

Training:

L (θ) = E_{t, x_{0}, x_{t}} [∥ s_{θ} (x_{t}, t) - \nabla_{x_{t}} \log p_{0 t} (x_{t} ∣ x_{0}) ∥^{2}]

5. Score Function in [[Diffusion Model|Diffusion Models]]

5.1 Connection to [[Diffusion Model|DDPM]]

[[Diffusion Model|DDPM]] objective:

L = E [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}]

Relationship to score:

For $x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ$ :

\nabla_{x_{t}} \log q (x_{t} ∣ x_{0}) = - \frac{ϵ}{\sqrt{1 - {\bar{α}}_{t}}}

Therefore:

ϵ_{θ} (x_{t}, t) = - \sqrt{1 - {\bar{α}}_{t}} s_{θ} (x_{t}, t)

Key insight: [[Diffusion Model|DDPM]] learns noise prediction, which is equivalent to learning the score function!

5.2 Unified View

Framework	Learns	Relationship
Score Matching	$s_{θ} (x) \approx \nabla_{x} \log p (x)$	Direct score estimation
[[Diffusion Model\|DDPM]]	$ϵ_{θ} (x_{t}, t)$	$ϵ_{θ} = - σ_{t} s_{θ}$
Score [[Stochastic Differential Equation (SDE)\|SDE]]	$s_{θ} (x, t) \approx \nabla_{x} \log p_{t} (x)$	Time-dependent score

5.3 [[Probability Flow ODE]]

The [[Probability Flow ODE]] explicitly uses the score function:

d x = [f (x, t) - \frac{1}{2} g (t)^{2} \nabla_{x} \log p_{t} (x)] d t

Interpretation:

$f (x, t)$ : Drift term (deterministic dynamics)
$- \frac{1}{2} g (t)^{2} \nabla_{x} \log p_{t} (x)$ : Score-guided correction toward high-density regions

6. Advanced Topics

6.1 Fisher Divergence

Definition:

D_{F} (p ∥ q) = \frac{1}{2} E_{p (x)} [∥ \nabla_{x} \log p (x) - \nabla_{x} \log q (x) ∥^{2}]

Properties:

Not symmetric: $D_{F} (p ∥ q) \neq D_{F} (q ∥ p)$
Zero iff $p = q$ (up to normalization)
Related to KL divergence via Taylor expansion

6.2 Stein’s Identity

Theorem: For any function $h (x)$ :

E_{p (x)} [\nabla_{x} h (x) + h (x) \nabla_{x} \log p (x)] = 0

Proof:

\int \nabla_{x} h (x) p (x) d x = - \int h (x) \nabla_{x} p (x) d x = - \int h (x) \nabla_{x} \log p (x) p (x) d x

Applications:

Stein variational gradient descent (SVGD)
Goodness-of-fit tests
Variational inference

6.3 Stein Variational Gradient Descent (SVGD)

Idea: Use score function to transport particles to target distribution.

Update rule for particle $x_{i}$ :

x_{i} \leftarrow x_{i} + ϵ \sum_{j = 1}^{n} [k (x_{j}, x_{i}) \nabla_{x_{j}} \log p (x_{j}) + \nabla_{x_{j}} k (x_{j}, x_{i})]

where $k (\cdot, \cdot)$ is a kernel function.

Advantage: Non-parametric, deterministic particle-based sampling.

6.4 Score and Information Geometry

Fisher information metric:

g_{i j} (θ) = E [\frac{\partial \log p (x; θ)}{\partial θ_{i}} \frac{\partial \log p (x; θ)}{\partial θ_{j}}]

Defines a Riemannian geometry on the space of probability distributions.

Natural gradient:

{\tilde{\nabla}}_{θ} L = I (θ)^{- 1} \nabla_{θ} L

Accounts for the geometry of the statistical manifold.

7. Practical Implementation

7.1 Network Architectures

Common choices:

U-Net (for images):
- Multi-scale processing
- Skip connections
- Attention mechanisms
MLP (for low-dimensional data):
- Simple and efficient
- Good for toy examples
Transformer (for sequences):
- Global receptive field
- Self-attention

Time/noise conditioning:

class ScoreNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.time_embedding = SinusoidalEmbedding(dim=128)
        self.backbone = UNet()
    
    def forward(self, x, sigma):
        # Embed noise level
        t_emb = self.time_embedding(sigma)
        
        # Predict score
        score = self.backbone(x, t_emb)
        
        return score

7.2 Training Best Practices

1. Noise schedule design:

Geometric: $σ_{i} = σ_{min} (σ_{max} / σ_{min})^{i / L}$
Cover wide range: $σ_{min} \approx 0.01$ , $σ_{max} \approx 1.0$

2. Loss weighting:

$λ_{i} = σ_{i}^{2}$ (balances contribution from different scales)
Importance sampling for critical noise levels

3. Regularization:

Gradient clipping
Weight decay
EMA (exponential moving average) of parameters

4. Evaluation:

Sample quality (FID, IS)
Likelihood estimation
Score accuracy (if ground truth available)

7.3 Sampling Tips

Langevin dynamics parameters:

Step size: $η \propto σ^{2}$
Steps per noise level: 10-100
Noise schedule: 10-50 levels

Common issues:

Too large step size: Instability, divergence
Too small step size: Slow mixing, poor samples
Too few steps: Incomplete convergence

8. Applications

8.1 Image Generation

Score-based models:

NCSN (Noise Conditional Score Network)
NCSN++ (improved architecture)
Score [[Stochastic Differential Equation (SDE)|SDE]] (continuous-time)

Advantages:

High-quality samples
Exact likelihood computation
Continuous latent space

8.2 [[Diffusion Model|Diffusion Models]]

Role of score:

Forward process: Add noise
Reverse process: Learn score to denoise
Sampling: Use score for [[Langevin Dynamics|Langevin dynamics]] or ODE integration

Key papers:

[[Diffusion Model|DDPM]]: Implicit score learning via noise prediction
Score [[Stochastic Differential Equation (SDE)|SDE]]: Explicit score matching framework
[[DPM-Solver]]: Fast ODE solver using score function

8.3 Audio Synthesis

Score-based audio generation:

Waveform-level modeling
High-fidelity audio
Faster than autoregressive models

8.4 Molecular Generation

Applications:

Drug discovery
Material design
Protein folding

Advantages:

Continuous representation
Exact likelihood
Controllable generation

8.5 Anomaly Detection

Idea: Anomalies have low score magnitude (flat probability landscape).

Method:

Train score model on normal data
Compute score norm $∥ s_{θ} (x) ∥$ for test points
Flag points with small score norm as anomalies

8.6 Energy-Based Models

Connection: Score function of energy-based model $p (x) = \frac{1}{Z} e^{- E (x)}$ :

\nabla_{x} \log p (x) = - \nabla_{x} E (x)

Applications:

Image inpainting
Super-resolution
Compositional generation

9.1 Score Matching vs Variational Inference

Aspect	Score Matching	Variational Inference
Objective	Match gradients	Maximize ELBO
Normalization	Not required	Required
Flexibility	High (unnormalized)	Constrained (bound)
Computation	Score estimation	Optimization

9.2 Score-Based vs Likelihood-Based

Aspect	Score-Based	Likelihood-Based
Target	$\nabla \log p (x)$	$p (x)$
Normalization	Not needed	Required
Training	Score matching	Maximum likelihood
Sampling	[[Langevin Dynamics]]	Direct or inverse

9.3 Generative Model Comparison

Model	Uses Score?	Sampling	Likelihood	Training
GAN	No	Fast (1 step)	Intractable	Adversarial
VAE	No	Fast (1 step)	Lower bound	ELBO
Normalizing Flow	No	Parallel	Exact	Maximum likelihood
Score-Based	Yes	[[Langevin Dynamics]]	Tractable	Score matching

10. Core Formula Cards

[!QUOTE] Score Function Definition
$s (x) = \nabla_{x} \log p (x) = \frac{\nabla_{x} p (x)}{p (x)}$

[!QUOTE] Score of Gaussian
$x \sim N (μ, Σ) ⟹ s (x) = - Σ^{- 1} (x - μ)$

[!QUOTE] Denoising Score Matching Loss
$L (θ) = \frac{1}{2 σ^{2}} E [∥ s_{θ} (x + σ ϵ) + ϵ ∥^{2}]$

[!QUOTE] Langevin Dynamics
$x_{t + 1} = x_{t} + \frac{η}{2} s_{θ} (x_{t}) + \sqrt{η} z_{t}$

[!QUOTE] Reverse-Time [[Stochastic Differential Equation (SDE)|SDE]]
$d x = [f (x, t) - g (t)^{2} \nabla_{x} \log p_{t} (x)] d t + g (t) d {\bar{W}}_{t}$

[!QUOTE] Score-Noise Relationship ([[Diffusion Model|DDPM]])
$ϵ_{θ} (x_{t}, t) = - \sqrt{1 - {\bar{α}}_{t}} s_{θ} (x_{t}, t)$

[!QUOTE] Fisher Information Matrix
$I = E [s (x) s (x)^{⊤}]$

[!QUOTE] Stein’s Identity
$E_{p (x)} [\nabla_{x} h (x) + h (x) \nabla_{x} \log p (x)] = 0$

[[Diffusion Model]]
[[Probability Flow ODE]]
[[Stochastic Differential Equation (SDE)]]
[[Fokker-Planck Equation]]
[[Kolmogorov Equations]]
[[DDIM]]
[[DPM-Solver]]
[[Flow Matching]]
[[Wiener Process|Wiener Process]]
[[Langevin Dynamics]]
[[Denoising Score Matching]]
[[Score SDE]]
[[U-Net]]
[[Fisher Information]]
[[Energy-Based Model]]
[[Variational Autoencoder (VAE)]]
[[Martingale]]

Dataview Query

1
2
3

LIST
FROM #score_function OR #score_matching OR #diffusion_model
SORT file.ctime DESC

References

Paper: Generative Modeling by Estimating Gradients of the Data Distribution (Song & Ermon, 2019)
Paper: Score-Based Generative Modeling through SDEs (Song et al., 2021)
Paper: Denoising Diffusion Probabilistic Models (Ho et al., 2020)
Paper: A Connection between Score Matching and Denoising Autoencoders (Vincent, 2011)
Paper: Estimation of Non-Normalized Statistical Models by Score Matching (Hyvärinen, 2005)
Blog: Generative Modeling by Estimating Gradients - Lilian Weng
Course: CS236 Deep Generative Models (Stanford)

Score Function

1. Core Concept

1.1 Definition

1.2 Why Score Function?

2. Mathematical Properties

2.1 Basic Properties

2.2 Score of Gaussian Distribution

2.3 Score of Mixture Distribution

2.4 Score and Hessian

3. Score Estimation

3.1 The Challenge

3.2 Score Matching

3.3 Denoising Score Matching

3.4 Sliced Score Matching

4. Score-Based Generative Modeling

4.1 Langevin Dynamics

4.2 Noise Conditional Score Networks (NCSN)

4.3 Score [[Stochastic Differential Equation (SDE)|SDE]]

5. Score Function in [[Diffusion Model|Diffusion Models]]

5.1 Connection to [[Diffusion Model|DDPM]]

5.2 Unified View

5.3 [[Probability Flow ODE]]

6. Advanced Topics

6.1 Fisher Divergence

6.2 Stein’s Identity

6.3 Stein Variational Gradient Descent (SVGD)

6.4 Score and Information Geometry

7. Practical Implementation

7.1 Network Architectures

7.2 Training Best Practices

7.3 Sampling Tips

8. Applications

8.1 Image Generation

8.2 [[Diffusion Model|Diffusion Models]]

8.3 Audio Synthesis

8.4 Molecular Generation

8.5 Anomaly Detection

8.6 Energy-Based Models

9. Comparison with Related Methods

9.1 Score Matching vs Variational Inference

9.2 Score-Based vs Likelihood-Based

9.3 Generative Model Comparison

10. Core Formula Cards

Related Concepts

Dataview Query

References