Score Function

The score function of a probability density p(x) is its gradient with respect to the input: xlogp(x) . It points in the direction of steepest increase in log-density and is fundamental to score-based generative modeling, [[Diffusion Model|diffusion models]], and statistical inference.


1. Core Concept

1.1 Definition

For a probability density function p(x) , the score function is:

s(x)=xlogp(x)=xp(x)p(x)

Key properties:

  • Points in the direction of maximum probability increase
  • Magnitude indicates how fast probability changes
  • Invariant to scaling: xlog(cp(x))=xlogp(x) for c>0

[!NOTE] Intuitive Understanding
Imagine standing on a hill in fog. The score function tells you which direction goes uphill most steeply and how steep it is, without knowing your absolute elevation.

1.2 Why Score Function?

Advantages over density estimation:

  1. No normalization constant: xlogp(x) doesn’t require knowing p(x)dx
  2. Geometric interpretation: Reveals structure of probability landscape
  3. Sampling: Can generate samples using [[Langevin Dynamics|Langevin dynamics]]
  4. Flexible models: Learn unnormalized densities

Applications:

  • [[Diffusion Model|Diffusion models]] (core component)
  • Score-based generative modeling
  • Statistical inference (Fisher information)
  • Optimization (natural gradient)
  • Anomaly detection

2. Mathematical Properties

2.1 Basic Properties

Zero mean:

Exp(x)[s(x)]=p(x)xlogp(x)dx=xp(x)dx=0

Fisher Information Matrix:

I=E[s(x)s(x)]=Cov[s(x)]

Measures the amount of information that observations carry about parameters.

2.2 Score of Gaussian Distribution

For xN(μ,Σ) :

logp(x)=12(xμ)Σ1(xμ)+const s(x)=xlogp(x)=Σ1(xμ)

Special case (standard normal N(0,I) ):

s(x)=x

2.3 Score of Mixture Distribution

For mixture p(x)=kwkpk(x) :

s(x)=xlogkwkpk(x)=kwkxpk(x)kwkpk(x) s(x)=kwkpk(x)jwjpj(x)γk(x)sk(x)

where γk(x) is the responsibility of component k , and sk(x) is its score.

Interpretation: The score is a weighted average of component scores.

2.4 Score and Hessian

Hessian of log-density:

H(x)=x2logp(x)=xs(x)

Relationship:

E[H(x)]=E[s(x)s(x)]=I

This is a fundamental identity in statistics.


3. Score Estimation

3.1 The Challenge

Problem: We don’t know p(x) , so we can’t compute xlogp(x) analytically.

Solution: Learn a neural network sθ(x) to approximate the score function.

3.2 Score Matching

Objective: Minimize Fisher divergence between true score s(x) and model score sθ(x) :

L(θ)=12Ep(x)[sθ(x)s(x)2]

Problem: Requires knowing s(x)=xlogp(x) , which is intractable.

3.3 Denoising Score Matching

Key insight (Hyvärinen, 2005): Add noise to data and learn to recover it.

Algorithm:

  1. Perturb data: x~=x+ϵ , where ϵN(0,σ2I)
  2. Learn score of perturbed distribution: sθ(x~)x~logpσ(x~)

Loss function:

L(θ)=12Ex,x~[sθ(x~)x~logpσ(x~x)2]

For Gaussian noise x~xN(x,σ2I) :

x~logpσ(x~x)=x~xσ2

Final loss:

L(θ)=12σ2Ex,ϵ[sθ(x+σϵ)+ϵ2]

3.4 Sliced Score Matching

Idea: Use random projections to avoid computing Hessian.

Loss:

L(θ)=Ep(x),p(v)[vxsθ(x)v+12sθ(x)2]

where vN(0,I) is a random vector.

Advantage: Only requires Jacobian-vector products (efficient via autodiff).


4. Score-Based Generative Modeling

4.1 Langevin Dynamics

Core idea: Use score function to generate samples via stochastic dynamics.

Algorithm:

xt+1=xt+η2sθ(xt)+ηzt

where:

  • η : Step size
  • ztN(0,I) : Random noise

Theorem: As η0 and T , xTp(x) .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Langevin Dynamics Sampling
def langevin_sampling(score_model, x_init, n_steps, step_size):
x = x_init

for t in range(n_steps):
# Compute score
score = score_model(x)

# Add noise
noise = torch.randn_like(x)

# Update
x = x + (step_size / 2) * score + torch.sqrt(step_size) * noise

return x

4.2 Noise Conditional Score Networks (NCSN)

Problem: Single noise level doesn’t work well for complex distributions.

Solution: Train score model with multiple noise levels.

Algorithm:

  1. Noise schedule: σ1<σ2<<σL
  2. Perturb data: x~=x+σiϵ for random i
  3. Learn: sθ(x~,i)x~logpσi(x~)

Loss:

L(θ)=i=1LλiE[sθ(x~,i)x~logpσi(x~x)2]

Sampling (annealed Langevin dynamics):

1
2
3
4
5
6
7
8
9
10
def annealed_langevin_sampling(score_model, n_steps_per_scale, noise_schedule):
x = torch.randn_like(data_shape) # Start from noise

for sigma in reversed(noise_schedule):
for t in range(n_steps_per_scale):
score = score_model(x, sigma)
noise = torch.randn_like(x)
x = x + sigma**2 * score + sigma * noise

return x

4.3 Score [[Stochastic Differential Equation (SDE)|SDE]]

Unified framework (Song et al., 2021): Connect score matching with SDEs.

Forward [[Stochastic Differential Equation (SDE)|SDE]]:

dx=f(x,t)dt+g(t)dWt

Reverse-time [[Stochastic Differential Equation (SDE)|SDE]]:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dW¯t

Key insight: The reverse [[Stochastic Differential Equation (SDE)|SDE]] requires the score function xlogpt(x) at all times t .

Training:

L(θ)=Et,x0,xt[sθ(xt,t)xtlogp0t(xtx0)2]

5. Score Function in [[Diffusion Model|Diffusion Models]]

5.1 Connection to [[Diffusion Model|DDPM]]

[[Diffusion Model|DDPM]] objective:

L=E[ϵϵθ(xt,t)2]

Relationship to score:

For xt=α¯tx0+1α¯tϵ :

xtlogq(xtx0)=ϵ1α¯t

Therefore:

ϵθ(xt,t)=1α¯tsθ(xt,t)

Key insight: [[Diffusion Model|DDPM]] learns noise prediction, which is equivalent to learning the score function!

5.2 Unified View

Framework Learns Relationship
Score Matching sθ(x)xlogp(x) Direct score estimation
[[Diffusion Model|DDPM]] ϵθ(xt,t) ϵθ=σtsθ
Score [[Stochastic Differential Equation (SDE)|SDE]] sθ(x,t)xlogpt(x) Time-dependent score

5.3 [[Probability Flow ODE]]

The [[Probability Flow ODE]] explicitly uses the score function:

dx=[f(x,t)12g(t)2xlogpt(x)]dt

Interpretation:

  • f(x,t) : Drift term (deterministic dynamics)
  • 12g(t)2xlogpt(x) : Score-guided correction toward high-density regions

6. Advanced Topics

6.1 Fisher Divergence

Definition:

DF(pq)=12Ep(x)[xlogp(x)xlogq(x)2]

Properties:

  • Not symmetric: DF(pq)DF(qp)
  • Zero iff p=q (up to normalization)
  • Related to KL divergence via Taylor expansion

6.2 Stein’s Identity

Theorem: For any function h(x) :

Ep(x)[xh(x)+h(x)xlogp(x)]=0

Proof:

xh(x)p(x)dx=h(x)xp(x)dx=h(x)xlogp(x)p(x)dx

Applications:

  • Stein variational gradient descent (SVGD)
  • Goodness-of-fit tests
  • Variational inference

6.3 Stein Variational Gradient Descent (SVGD)

Idea: Use score function to transport particles to target distribution.

Update rule for particle xi :

xixi+ϵj=1n[k(xj,xi)xjlogp(xj)+xjk(xj,xi)]

where k(,) is a kernel function.

Advantage: Non-parametric, deterministic particle-based sampling.

6.4 Score and Information Geometry

Fisher information metric:

gij(θ)=E[logp(x;θ)θilogp(x;θ)θj]

Defines a Riemannian geometry on the space of probability distributions.

Natural gradient:

~θL=I(θ)1θL

Accounts for the geometry of the statistical manifold.


7. Practical Implementation

7.1 Network Architectures

Common choices:

  1. U-Net (for images):

    • Multi-scale processing
    • Skip connections
    • Attention mechanisms
  2. MLP (for low-dimensional data):

    • Simple and efficient
    • Good for toy examples
  3. Transformer (for sequences):

    • Global receptive field
    • Self-attention

Time/noise conditioning:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class ScoreNetwork(nn.Module):
def __init__(self):
super().__init__()
self.time_embedding = SinusoidalEmbedding(dim=128)
self.backbone = UNet()

def forward(self, x, sigma):
# Embed noise level
t_emb = self.time_embedding(sigma)

# Predict score
score = self.backbone(x, t_emb)

return score

7.2 Training Best Practices

1. Noise schedule design:

  • Geometric: σi=σmin(σmax/σmin)i/L
  • Cover wide range: σmin0.01 , σmax1.0

2. Loss weighting:

  • λi=σi2 (balances contribution from different scales)
  • Importance sampling for critical noise levels

3. Regularization:

  • Gradient clipping
  • Weight decay
  • EMA (exponential moving average) of parameters

4. Evaluation:

  • Sample quality (FID, IS)
  • Likelihood estimation
  • Score accuracy (if ground truth available)

7.3 Sampling Tips

Langevin dynamics parameters:

  • Step size: ησ2
  • Steps per noise level: 10-100
  • Noise schedule: 10-50 levels

Common issues:

  • Too large step size: Instability, divergence
  • Too small step size: Slow mixing, poor samples
  • Too few steps: Incomplete convergence

8. Applications

8.1 Image Generation

Score-based models:

  • NCSN (Noise Conditional Score Network)
  • NCSN++ (improved architecture)
  • Score [[Stochastic Differential Equation (SDE)|SDE]] (continuous-time)

Advantages:

  • High-quality samples
  • Exact likelihood computation
  • Continuous latent space

8.2 [[Diffusion Model|Diffusion Models]]

Role of score:

  • Forward process: Add noise
  • Reverse process: Learn score to denoise
  • Sampling: Use score for [[Langevin Dynamics|Langevin dynamics]] or ODE integration

Key papers:

  • [[Diffusion Model|DDPM]]: Implicit score learning via noise prediction
  • Score [[Stochastic Differential Equation (SDE)|SDE]]: Explicit score matching framework
  • [[DPM-Solver]]: Fast ODE solver using score function

8.3 Audio Synthesis

Score-based audio generation:

  • Waveform-level modeling
  • High-fidelity audio
  • Faster than autoregressive models

8.4 Molecular Generation

Applications:

  • Drug discovery
  • Material design
  • Protein folding

Advantages:

  • Continuous representation
  • Exact likelihood
  • Controllable generation

8.5 Anomaly Detection

Idea: Anomalies have low score magnitude (flat probability landscape).

Method:

  1. Train score model on normal data
  2. Compute score norm sθ(x) for test points
  3. Flag points with small score norm as anomalies

8.6 Energy-Based Models

Connection: Score function of energy-based model p(x)=1ZeE(x) :

xlogp(x)=xE(x)

Applications:

  • Image inpainting
  • Super-resolution
  • Compositional generation

9.1 Score Matching vs Variational Inference

Aspect Score Matching Variational Inference
Objective Match gradients Maximize ELBO
Normalization Not required Required
Flexibility High (unnormalized) Constrained (bound)
Computation Score estimation Optimization

9.2 Score-Based vs Likelihood-Based

Aspect Score-Based Likelihood-Based
Target logp(x) p(x)
Normalization Not needed Required
Training Score matching Maximum likelihood
Sampling [[Langevin Dynamics]] Direct or inverse

9.3 Generative Model Comparison

Model Uses Score? Sampling Likelihood Training
GAN No Fast (1 step) Intractable Adversarial
VAE No Fast (1 step) Lower bound ELBO
Normalizing Flow No Parallel Exact Maximum likelihood
Score-Based Yes [[Langevin Dynamics]] Tractable Score matching

10. Core Formula Cards

[!QUOTE] Score Function Definition

s(x)=xlogp(x)=xp(x)p(x)

[!QUOTE] Score of Gaussian

xN(μ,Σ)s(x)=Σ1(xμ)

[!QUOTE] Denoising Score Matching Loss

L(θ)=12σ2E[sθ(x+σϵ)+ϵ2]

[!QUOTE] Langevin Dynamics

xt+1=xt+η2sθ(xt)+ηzt

[!QUOTE] Reverse-Time [[Stochastic Differential Equation (SDE)|SDE]]

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dW¯t

[!QUOTE] Score-Noise Relationship ([[Diffusion Model|DDPM]])

ϵθ(xt,t)=1α¯tsθ(xt,t)

[!QUOTE] Fisher Information Matrix

I=E[s(x)s(x)]

[!QUOTE] Stein’s Identity

Ep(x)[xh(x)+h(x)xlogp(x)]=0

  • [[Diffusion Model]]
  • [[Probability Flow ODE]]
  • [[Stochastic Differential Equation (SDE)]]
  • [[Fokker-Planck Equation]]
  • [[Kolmogorov Equations]]
  • [[DDIM]]
  • [[DPM-Solver]]
  • [[Flow Matching]]
  • [[Wiener Process|Wiener Process]]
  • [[Langevin Dynamics]]
  • [[Denoising Score Matching]]
  • [[Score SDE]]
  • [[U-Net]]
  • [[Fisher Information]]
  • [[Energy-Based Model]]
  • [[Variational Autoencoder (VAE)]]
  • [[Martingale]]

Dataview Query

1
2
3
LIST
FROM #score_function OR #score_matching OR #diffusion_model
SORT file.ctime DESC

References

  • Paper: Generative Modeling by Estimating Gradients of the Data Distribution (Song & Ermon, 2019)
  • Paper: Score-Based Generative Modeling through SDEs (Song et al., 2021)
  • Paper: Denoising Diffusion Probabilistic Models (Ho et al., 2020)
  • Paper: A Connection between Score Matching and Denoising Autoencoders (Vincent, 2011)
  • Paper: Estimation of Non-Normalized Statistical Models by Score Matching (Hyvärinen, 2005)
  • Blog: Generative Modeling by Estimating Gradients - Lilian Weng
  • Course: CS236 Deep Generative Models (Stanford)