Score Function
The score function of a probability density
1. Core Concept
1.1 Definition
For a probability density function
Key properties:
- Points in the direction of maximum probability increase
- Magnitude indicates how fast probability changes
- Invariant to scaling:
for
[!NOTE] Intuitive Understanding
Imagine standing on a hill in fog. The score function tells you which direction goes uphill most steeply and how steep it is, without knowing your absolute elevation.
1.2 Why Score Function?
Advantages over density estimation:
- No normalization constant:
doesn’t require knowing - Geometric interpretation: Reveals structure of probability landscape
- Sampling: Can generate samples using [[Langevin Dynamics|Langevin dynamics]]
- Flexible models: Learn unnormalized densities
Applications:
- [[Diffusion Model|Diffusion models]] (core component)
- Score-based generative modeling
- Statistical inference (Fisher information)
- Optimization (natural gradient)
- Anomaly detection
2. Mathematical Properties
2.1 Basic Properties
Zero mean:
Fisher Information Matrix:
Measures the amount of information that observations carry about parameters.
2.2 Score of Gaussian Distribution
For
Special case (standard normal
2.3 Score of Mixture Distribution
For mixture
where
Interpretation: The score is a weighted average of component scores.
2.4 Score and Hessian
Hessian of log-density:
Relationship:
This is a fundamental identity in statistics.
3. Score Estimation
3.1 The Challenge
Problem: We don’t know
Solution: Learn a neural network
3.2 Score Matching
Objective: Minimize Fisher divergence between true score
Problem: Requires knowing
3.3 Denoising Score Matching
Key insight (Hyvärinen, 2005): Add noise to data and learn to recover it.
Algorithm:
- Perturb data:
, where - Learn score of perturbed distribution:
Loss function:
For Gaussian noise
Final loss:
3.4 Sliced Score Matching
Idea: Use random projections to avoid computing Hessian.
Loss:
where
Advantage: Only requires Jacobian-vector products (efficient via autodiff).
4. Score-Based Generative Modeling
4.1 Langevin Dynamics
Core idea: Use score function to generate samples via stochastic dynamics.
Algorithm:
where:
-
: Step size -
: Random noise
Theorem: As
1 | # Langevin Dynamics Sampling |
4.2 Noise Conditional Score Networks (NCSN)
Problem: Single noise level doesn’t work well for complex distributions.
Solution: Train score model with multiple noise levels.
Algorithm:
- Noise schedule:
- Perturb data:
for random - Learn:
Loss:
Sampling (annealed Langevin dynamics):
1 | def annealed_langevin_sampling(score_model, n_steps_per_scale, noise_schedule): |
4.3 Score [[Stochastic Differential Equation (SDE)|SDE]]
Unified framework (Song et al., 2021): Connect score matching with SDEs.
Forward [[Stochastic Differential Equation (SDE)|SDE]]:
Reverse-time [[Stochastic Differential Equation (SDE)|SDE]]:
Key insight: The reverse [[Stochastic Differential Equation (SDE)|SDE]] requires the score function
Training:
5. Score Function in [[Diffusion Model|Diffusion Models]]
5.1 Connection to [[Diffusion Model|DDPM]]
[[Diffusion Model|DDPM]] objective:
Relationship to score:
For
Therefore:
Key insight: [[Diffusion Model|DDPM]] learns noise prediction, which is equivalent to learning the score function!
5.2 Unified View
| Framework | Learns | Relationship |
|---|---|---|
| Score Matching |
|
Direct score estimation |
| [[Diffusion Model|DDPM]] |
|
|
| Score [[Stochastic Differential Equation (SDE)|SDE]] |
|
Time-dependent score |
5.3 [[Probability Flow ODE]]
The [[Probability Flow ODE]] explicitly uses the score function:
Interpretation:
-
: Drift term (deterministic dynamics) -
: Score-guided correction toward high-density regions
6. Advanced Topics
6.1 Fisher Divergence
Definition:
Properties:
- Not symmetric:
- Zero iff
(up to normalization) - Related to KL divergence via Taylor expansion
6.2 Stein’s Identity
Theorem: For any function
Proof:
Applications:
- Stein variational gradient descent (SVGD)
- Goodness-of-fit tests
- Variational inference
6.3 Stein Variational Gradient Descent (SVGD)
Idea: Use score function to transport particles to target distribution.
Update rule for particle
where
Advantage: Non-parametric, deterministic particle-based sampling.
6.4 Score and Information Geometry
Fisher information metric:
Defines a Riemannian geometry on the space of probability distributions.
Natural gradient:
Accounts for the geometry of the statistical manifold.
7. Practical Implementation
7.1 Network Architectures
Common choices:
-
U-Net (for images):
- Multi-scale processing
- Skip connections
- Attention mechanisms
-
MLP (for low-dimensional data):
- Simple and efficient
- Good for toy examples
-
Transformer (for sequences):
- Global receptive field
- Self-attention
Time/noise conditioning:
1 | class ScoreNetwork(nn.Module): |
7.2 Training Best Practices
1. Noise schedule design:
- Geometric:
- Cover wide range:
,
2. Loss weighting:
-
(balances contribution from different scales) - Importance sampling for critical noise levels
3. Regularization:
- Gradient clipping
- Weight decay
- EMA (exponential moving average) of parameters
4. Evaluation:
- Sample quality (FID, IS)
- Likelihood estimation
- Score accuracy (if ground truth available)
7.3 Sampling Tips
Langevin dynamics parameters:
- Step size:
- Steps per noise level: 10-100
- Noise schedule: 10-50 levels
Common issues:
- Too large step size: Instability, divergence
- Too small step size: Slow mixing, poor samples
- Too few steps: Incomplete convergence
8. Applications
8.1 Image Generation
Score-based models:
- NCSN (Noise Conditional Score Network)
- NCSN++ (improved architecture)
- Score [[Stochastic Differential Equation (SDE)|SDE]] (continuous-time)
Advantages:
- High-quality samples
- Exact likelihood computation
- Continuous latent space
8.2 [[Diffusion Model|Diffusion Models]]
Role of score:
- Forward process: Add noise
- Reverse process: Learn score to denoise
- Sampling: Use score for [[Langevin Dynamics|Langevin dynamics]] or ODE integration
Key papers:
- [[Diffusion Model|DDPM]]: Implicit score learning via noise prediction
- Score [[Stochastic Differential Equation (SDE)|SDE]]: Explicit score matching framework
- [[DPM-Solver]]: Fast ODE solver using score function
8.3 Audio Synthesis
Score-based audio generation:
- Waveform-level modeling
- High-fidelity audio
- Faster than autoregressive models
8.4 Molecular Generation
Applications:
- Drug discovery
- Material design
- Protein folding
Advantages:
- Continuous representation
- Exact likelihood
- Controllable generation
8.5 Anomaly Detection
Idea: Anomalies have low score magnitude (flat probability landscape).
Method:
- Train score model on normal data
- Compute score norm
for test points - Flag points with small score norm as anomalies
8.6 Energy-Based Models
Connection: Score function of energy-based model
Applications:
- Image inpainting
- Super-resolution
- Compositional generation
9. Comparison with Related Methods
9.1 Score Matching vs Variational Inference
| Aspect | Score Matching | Variational Inference |
|---|---|---|
| Objective | Match gradients | Maximize ELBO |
| Normalization | Not required | Required |
| Flexibility | High (unnormalized) | Constrained (bound) |
| Computation | Score estimation | Optimization |
9.2 Score-Based vs Likelihood-Based
| Aspect | Score-Based | Likelihood-Based |
|---|---|---|
| Target |
|
|
| Normalization | Not needed | Required |
| Training | Score matching | Maximum likelihood |
| Sampling | [[Langevin Dynamics]] | Direct or inverse |
9.3 Generative Model Comparison
| Model | Uses Score? | Sampling | Likelihood | Training |
|---|---|---|---|---|
| GAN | No | Fast (1 step) | Intractable | Adversarial |
| VAE | No | Fast (1 step) | Lower bound | ELBO |
| Normalizing Flow | No | Parallel | Exact | Maximum likelihood |
| Score-Based | Yes | [[Langevin Dynamics]] | Tractable | Score matching |
10. Core Formula Cards
[!QUOTE] Score Function Definition
[!QUOTE] Score of Gaussian
[!QUOTE] Denoising Score Matching Loss
[!QUOTE] Langevin Dynamics
[!QUOTE] Reverse-Time [[Stochastic Differential Equation (SDE)|SDE]]
[!QUOTE] Score-Noise Relationship ([[Diffusion Model|DDPM]])
[!QUOTE] Fisher Information Matrix
[!QUOTE] Stein’s Identity
Related Concepts
- [[Diffusion Model]]
- [[Probability Flow ODE]]
- [[Stochastic Differential Equation (SDE)]]
- [[Fokker-Planck Equation]]
- [[Kolmogorov Equations]]
- [[DDIM]]
- [[DPM-Solver]]
- [[Flow Matching]]
- [[Wiener Process|Wiener Process]]
- [[Langevin Dynamics]]
- [[Denoising Score Matching]]
- [[Score SDE]]
- [[U-Net]]
- [[Fisher Information]]
- [[Energy-Based Model]]
- [[Variational Autoencoder (VAE)]]
- [[Martingale]]
Dataview Query
1 | LIST |
References
- Paper: Generative Modeling by Estimating Gradients of the Data Distribution (Song & Ermon, 2019)
- Paper: Score-Based Generative Modeling through SDEs (Song et al., 2021)
- Paper: Denoising Diffusion Probabilistic Models (Ho et al., 2020)
- Paper: A Connection between Score Matching and Denoising Autoencoders (Vincent, 2011)
- Paper: Estimation of Non-Normalized Statistical Models by Score Matching (Hyvärinen, 2005)
- Blog: Generative Modeling by Estimating Gradients - Lilian Weng
- Course: CS236 Deep Generative Models (Stanford)