2026-06-30

Neural ODE

Neural ODE is a continuous-depth deep learning paradigm that parameterizes the hidden state dynamics of a neural network as an ordinary differential equation (ODE). Instead of stacking discrete layers, it defines a continuous transformation flow, enabling memory-efficient training via the adjoint sensitivity method and natural handling of irregular time series.

1. Core Concept

1.1 From Residual Networks to ODEs

A ResNet block updates the hidden state $h_{t}$ at layer $t$ :

h_{t + 1} = h_{t} + f_{θ} (h_{t}, t)

Taking the limit of infinitely many infinitesimally small steps:

\frac{d h (t)}{d t} = f_{θ} (h (t), t)

This transforms a discrete sequence of layers into a continuous ODE parameterized by $f_{θ}$ .

[!NOTE] Key Insight
A Neural ODE is not a new architecture but a continuous reinterpretation of residual networks. Where ResNets have discrete “depth,” Neural ODEs have continuous “integration time.”

1.2 Definition

For an input $x$ , the Neural ODE defines the output as the solution of an initial value problem:

z (0) = x, \frac{d z (t)}{d t} = f_{θ} (z (t), t)

The final output is $z (T)$ , obtained by integrating the ODE from $t = 0$ to $t = T$ :

z (T) = z (0) + \int_{0}^{T} f_{θ} (z (t), t) d t

2. Mathematical Foundation

2.1 ODE as a Continuous-Depth Model

Discrete (ResNet)	Continuous (Neural ODE)
$h_{t + 1} = h_{t} + f_{θ} (h_{t}, t)$	$\frac{d h (t)}{d t} = f_{θ} (h (t), t)$
Fixed number of layers $L$	ODE solve time interval $[0, T]$
Discrete depth	Continuous depth
Standard backprop	Adjoint sensitivity

2.2 Neural ODE Block

# Conceptual Neural ODE block
class NeuralODEBlock(nn.Module):
    def __init__(self, func):
        super().__init__()
        self.func = func  # f_theta(z, t) — the dynamics network
    
    def forward(self, x, integration_time=1.0):
        # Solve ODE: dz/dt = f(z, t), z(0) = x
        z0 = x
        t = torch.tensor([0.0, integration_time])
        z_final = odeint(self.func, z0, t)
        return z_final[-1]  # z(T)

2.3 ODE Solver Details

Any block-box ODE solver can be used. Common choices:

Solver	Order	Adaptive	Best For
Euler	1st	No	Simple problems
Midpoint	2nd	No	Moderate accuracy
RK4	4th	No	High accuracy, non-stiff
DOPRI5	5th	Yes	General purpose (default)
DOPRI8	8th	Yes	Very high accuracy
BDF	Variable	Yes	Stiff ODEs
[[DPM-Solver]]	Specialized	No	[[Diffusion Model\|Diffusion]] sampling

[!TIP] Solver Selection
For most Neural ODE applications, DOPRI5 (dormand-prince, adaptive 5th-order) provides an excellent balance of accuracy and speed. Use BDF for stiff dynamics.

3. Adjoint Sensitivity Method

3.1 The Memory Problem

Training Neural ODEs by backpropagating through all ODE solver steps would require storing every intermediate state — equivalent to storing all hidden states of an infinitely deep network.

3.2 Adjoint Method Solution

The adjoint method computes gradients without storing intermediate states by solving a second ODE backward in time.

Adjoint state $a (t) = \frac{\partial L}{\partial z (t)}$ :

\frac{d a (t)}{d t} = - a (t)^{⊤} \frac{\partial f_{θ} (z (t), t)}{\partial z}

This adjoint ODE is solved backward from $t = T$ to $t = 0$ , with initial condition:

a (T) = \frac{\partial L}{\partial z (T)}

Parameter gradient:

\frac{d L}{d θ} = - \int_{T}^{0} a (t)^{⊤} \frac{\partial f_{θ} (z (t), t)}{\partial θ} d t

3.3 Augmented ODE System

To compute all gradients in a single backward pass, solve an augmented ODE:

\frac{d}{d t} [\begin{matrix} z (t) \\ a (t) \\ \frac{d L}{d θ} \end{matrix}] = [\begin{matrix} f_{θ} (z, t) \\ - a^{⊤} \frac{\partial f}{\partial z} \\ - a^{⊤} \frac{\partial f}{\partial θ} \end{matrix}]

with initial conditions:

$z (T)$ : forward pass result (saved)
$a (T) = \frac{\partial L}{\partial z (T)}$
$\frac{d L}{d θ} (T) = 0$

[!NOTE] Memory Efficiency
The adjoint method uses $O (1)$ memory with respect to depth — a constant amount regardless of how many ODE solver steps are taken, making it feasible to train effectively “infinitely deep” networks.

3.4 Gradient Computation Algorithm

# Adjoint method pseudocode
def neural_ode_backward(z_T, func, loss_grad):
    """
    Compute gradients via adjoint method.
    z_T: Final state from forward pass z(T)
    func: Dynamics function f_theta
    loss_grad: dL/dz(T)
    """
    # Define augmented dynamics
    def aug_dynamics(t, aug_state):
        z, a, _ = aug_state  # Unpack
        with torch.enable_grad():
            z.requires_grad_(True)
            f = func(z, t)
        
        # Compute vector-Jacobian products
        a_df_dz = torch.autograd.grad(f, z, a, retain_graph=True)[0]
        a_df_dtheta = torch.autograd.grad(
            f, func.parameters(), a, retain_graph=True
        )
        
        return [
            f,              # dz/dt = f(z, t)
            -a_df_dz,       # da/dt = -a^T ∂f/∂z
            -a_df_dtheta_flat  # d/dt (dL/dθ)
        ]
    
    # Solve backward in time
    aug_init = (z_T, loss_grad, zero_grad)
    t_span = torch.tensor([T, 0.0])  # Reverse time
    solution = odeint(aug_dynamics, aug_init, t_span)
    
    return solution[-1][2]  # dL/dθ at t=0

3.5 Comparison: Standard Backprop vs Adjoint

Aspect	Standard Backprop	Adjoint Method
Memory	$O (L)$ (linear in depth)	$O (1)$ (constant)
Computation	1 backward pass	Solves additional ODE
Accuracy	Exact (machine precision)	Numerical (tolerance-controlled)
Implementation	Simple (autodiff)	Complex (custom backward)
Best for	Shallow networks	Deep/continuous networks

4. Continuous Normalizing Flows

4.1 From Discrete to Continuous Flows

Discrete Normalizing Flow: A sequence of invertible transformations.

x_{k} = f_{k} \circ f_{k - 1} \circ \dots \circ f_{1} (x_{0})

Continuous Normalizing Flow (CNF): A continuous transformation defined by a Neural ODE.

\frac{d z (t)}{d t} = f_{θ} (z (t), t), z (0) \sim p_{0}

The density evolves according to:

\frac{\partial \log p (z (t))}{\partial t} = - tr (\frac{\partial f_{θ}}{\partial z})

4.2 Likelihood Computation

Instantaneous Change of Variables:

\log p (z (T)) = \log p (z (0)) - \int_{0}^{T} tr (\frac{\partial f_{θ} (z (t), t)}{\partial z}) d t

For high-dimensional data, the trace is estimated using Hutchinson’s estimator:

tr (J) = E_{ϵ \sim N (0, I)} [ϵ^{⊤} J ϵ]

# Hutchinson's trace estimator
def trace_estimate(f, z, t):
    """Estimate divergence tr(∂f/∂z) using Hutchinson's estimator"""
    eps = torch.randn_like(z)
    f_z = f(z, t)
    eps_dfdz = torch.autograd.grad(f_z, z, eps, create_graph=True)[0]
    return torch.sum(eps * eps_dfdz)

4.3 Training a Continuous Normalizing Flow

def cnf_loss(func, x):
    """Train CNF to maximize likelihood of data x"""
    # Sample from prior
    z0 = torch.randn_like(x)
    
    # Define augmented forward dynamics (state + log-density)
    def aug_dynamics(t, state):
        z, log_p = state
        with torch.enable_grad():
            z.requires_grad_(True)
            f_val = func(z, t)
            divergence = trace_estimate(func, z, t)
        return [f_val, -divergence]
    
    # Solve ODE from t=0 to t=T
    t_span = torch.tensor([0.0, 1.0])
    init = (z0, torch.zeros(batch_size, 1))
    z_T, delta_logp = odeint(aug_dynamics, init, t_span)[-1]
    
    # Prior log-density
    log_p_T = -0.5 * torch.sum(z_T**2, dim=1) - 0.5 * math.log(2*math.pi) * dim
    
    # Loss = - log p(x)
    loss = -(log_p_T + delta_logp).mean()
    return loss

5.1 Neural ODE vs ResNet

Aspect	ResNet	Neural ODE
Depth	Discrete layers	Continuous time
Parameter count	$O (L \times params)$	$O (params)$ (shared)
Memory (training)	$O (L)$	$O (1)$
Adaptive computation	Fixed layers	Adaptive steps
Time series	Fixed intervals	Arbitrary times
Tuning	Choose $L$	Choose tolerance

5.2 Neural ODE vs [[Probability Flow ODE]]

Aspect	[[Probability Flow ODE\|Probability Flow ODE]]	Neural ODE
Origin	From [[Diffusion Model\|diffusion models]]	From continuous-depth networks
Velocity field	$f (t) x - \frac{1}{2} g (t)^{2} \nabla \log p_{t} (x)$	Learned $f_{θ} (x, t)$
Training	Score matching	End-to-end (adjoint backprop)
Purpose	Generative modeling	General continuous dynamics
Likelihood	Exact via divergence	Exact via adjoint + trace

5.3 Neural ODE vs [[Stochastic Differential Equation (SDE)|SDE]]

Aspect	Neural ODE	[[Stochastic Differential Equation (SDE)\|SDE]]
Noise term	None (deterministic)	$g (t) d W_{t}$ (stochastic)
Trajectories	Smooth and unique	Random, non-differentiable
Reversibility	Exactly reversible	Requires reverse-time [[Stochastic Differential Equation (SDE)\|SDE]]
Gradient flow	Adjoint method	Score matching / adjoint (Neural [[Stochastic Differential Equation (SDE)\|SDE]])

5.4 Neural ODE vs [[Flow Matching]]

Aspect	Neural ODE (CNF)	[[Flow Matching\|Flow Matching]]
Training	Likelihood-based (trace needed)	Regression-based (simulation-free)
Computational cost	High (trace estimation)	Low (MSE loss)
Scalability	Limited	High
Velocity field	Learned during training	Directly regressed

6. Applications

6.1 Continuous Normalizing Flows

Density estimation: Exact likelihood via instantaneous change of variables
Generative modeling: Sample from data distribution via CNF
Variational inference: Flexible posterior approximations

6.2 Time Series and Irregular Sampling

Neural ODEs naturally handle irregularly-sampled time series:

1
2
3

# Predicting at arbitrary time points
t_obs = torch.tensor([0.0, 0.5, 1.2, 2.0, 3.7])  # Irregular times
z_solution = odeint(func, z0, t_obs)

Applications:

Medical records with irregular check-ups
Financial data with missing observations
Climate data with varying sensor intervals
Latent ODEs for time series modeling

6.3 Generative Modeling with Diffusion

Neural ODEs connect to [[Diffusion Model|diffusion models]] through:

[[Probability Flow ODE]]: The deterministic counterpart to an [[Stochastic Differential Equation (SDE)|SDE]] is a special Neural ODE
Sampling: ODE solvers (including Neural ODE solvers) accelerate diffusion sampling
[[DPM-Solver]]: Exploits the semi-linear structure for fast ODE integration

6.4 Other Applications

Application	Neural ODE Role	Advantage
Image Classification	Replace ResNet blocks	Adaptive computation, fewer params
Physical Simulation	Learn system dynamics	Continuous, physically interpretable
Reinforcement Learning	Model environment dynamics	Handle continuous-time environments
Molecular Dynamics	Simulate particle motion	Energy conservation, reversibility
Computer Graphics	Neural rendering flow	Smooth interpolation

7. Practical Implementation

7.1 Complete Neural ODE Class

import torch
import torch.nn as nn
from torchdiffeq import odeint, odeint_adjoint

class NeuralODE(nn.Module):
    """Generic Neural ODE wrapper."""
    
    def __init__(self, func, method='dopri5', rtol=1e-3, atol=1e-6):
        """
        func: Dynamics network f_theta(z, t)
        method: ODE solver ('dopri5', 'rk4', 'euler', etc.)
        rtol, atol: Solver tolerances
        """
        super().__init__()
        self.func = func
        self.method = method
        self.rtol = rtol
        self.atol = atol
    
    def forward(self, z0, integration_time=1.0):
        """Solve ODE from t=0 to t=integration_time."""
        t = torch.tensor([0.0, integration_time], device=z0.device)
        solution = odeint(
            self.func, z0, t,
            method=self.method,
            rtol=self.rtol, atol=self.atol
        )
        return solution[-1]

# Example: ODE function (MLP)
class ODEFunc(nn.Module):
    def __init__(self, dim, hidden_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim + 1, hidden_dim),  # +1 for time
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, dim)
        )
    
    def forward(self, t, z):
        # Concatenate time to every point in batch
        t_expanded = t.expand(z.shape[0], 1)
        tz = torch.cat([z, t_expanded], dim=1)
        return self.net(tz)

# Assemble model
func = ODEFunc(dim=784, hidden_dim=128)
model = NeuralODE(func, method='dopri5')

7.2 Training with the Adjoint Method

# Training loop with adjoint method
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(num_epochs):
    for batch in dataloader:
        # Forward: solve ODE (stores only z_T, not intermediate states)
        z_T = model(batch)
        
        # Compute loss at final state
        loss = loss_fn(z_T, targets)
        
        # Backward: adjoint method computes gradients
        loss.backward()
        
        optimizer.step()
        optimizer.zero_grad()

7.3 Numerical Stability Tips

1. Solver Tolerance Tuning:

1
2
3

# Tighter tolerances → more accurate gradients but slower
model = NeuralODE(func, rtol=1e-3, atol=1e-6)  # Good default
model = NeuralODE(func, rtol=1e-4, atol=1e-7)  # High accuracy (slower)

2. Stiff ODE Handling:

Use implicit solvers (BDF)
Add spectral normalization to $f_{θ}$ to limit Lipschitz constant
Gradient clipping to prevent exploding dynamics

3. Time Regularization:

# Encourage smooth dynamics with regularization
def time_reg(func, z, t):
    """Penalize large temporal derivatives"""
    df_dt = torch.autograd.grad(func(z, t).sum(), t, create_graph=True)[0]
    return torch.mean(df_dt ** 2)

7.4 Debugging Checklist

[ ] Check ODE solver convergence (number of function evaluations per step)
[ ] Monitor forward pass stability (no NaN/Inf in $z (T)$ )
[ ] Verify adjoint gradients match finite differences
[ ] Test with simple dynamics (e.g., $f (z, t) = - z$ ) for sanity check
[ ] Profile memory usage (adjoint should use constant memory)
[ ] Check gradient norm — does it explode/vanish over integration time?

8. Advantages and Limitations

8.1 Advantages

Advantage	Description
Memory efficiency	$O (1)$ memory via adjoint method
Adaptive computation	ODE solver adapts step size based on dynamics
Natural time series	Handle irregularly-sampled data effortlessly
Parameter efficiency	Parameters shared across continuous depth
Invertibility	Natural for CNFs (just integrate backward)
Theoretical elegance	Connection to differential equations, dynamical systems

8.2 Limitations

Limitation	Mitigation
Training speed	Adjoint method slower than standard backprop for shallow nets
Numerical issues	Stiff ODEs can cause solver to take excessive steps
Depth limitations	Very deep dynamics require tight tolerances
Representation power	Simple linear ODEs have limited expressivity; need more complex $f_{θ}$
Tuning required	Solver tolerance adds hyperparameters

[!WARNING] Not Always Better
For problems where discrete layers suffice (e.g., standard image classification), ResNets outperform Neural ODEs in training speed and final accuracy. Neural ODEs shine for problems with continuous dynamics or irregular measurements.

9. Theoretical Analysis

9.1 Expressivity

Neural ODEs can represent any orientation-preserving diffeomorphism (smooth invertible map connected to identity). This limits expressivity compared to ResNets, which do not need to preserve orientation.

Guaranteed Properties:

Trajectories never cross (uniqueness of ODE solutions)
Continuous dependence on initial conditions
Deterministic and reversible

9.2 Trajectory Regularity

The regularity of Neural ODE trajectories depends on the dynamics function $f_{θ}$ :

If $f_{θ}$ is $C^{k}$ -smooth, trajectories are $C^{k + 1}$ -smooth
Lipschitz constant of $f_{θ}$ controls how “stiff” the ODE becomes

9.3 Augmented Neural ODEs

To increase expressivity, augment the state with zeros:

\frac{d}{d t} [\begin{matrix} z (t) \\ a (t) \end{matrix}] = [\begin{matrix} f_{θ} (z, t) \\ g_{θ} (z, a, t) \end{matrix}]

This allows Neural ODEs to represent a richer class of functions without losing the continuous-depth benefits.

10. Core Formula Cards

[!QUOTE] Neural ODE Definition
$\frac{d z (t)}{d t} = f_{θ} (z (t), t), z (0) = x$

[!QUOTE] Forward Pass
$z (T) = x + \int_{0}^{T} f_{θ} (z (t), t) d t$

[!QUOTE] Adjoint State Equation
$\frac{d a (t)}{d t} = - a (t)^{⊤} \frac{\partial f_{θ} (z (t), t)}{\partial z}, a (T) = \frac{\partial L}{\partial z (T)}$

[!QUOTE] Parameter Gradient (Adjoint)
$\frac{d L}{d θ} = - \int_{T}^{0} a (t)^{⊤} \frac{\partial f_{θ} (z (t), t)}{\partial θ} d t$

[!QUOTE] Instantaneous Change of Variables
$\frac{\partial \log p (z (t))}{\partial t} = - tr (\frac{\partial f_{θ} (z (t), t)}{\partial z})$

[!QUOTE] Continuous Normalizing Flow Likelihood
$\log p (z (T)) = \log p (z (0)) - \int_{0}^{T} tr (\frac{\partial f_{θ}}{\partial z}) d t$

[!QUOTE] Hutchinson’s Trace Estimator
$tr (J) = E_{ϵ \sim N (0, I)} [ϵ^{⊤} J ϵ]$

[!QUOTE] Augmented ODE for Training
$\frac{d}{d t} [\begin{matrix} z \\ a \\ \frac{d L}{d θ} \end{matrix}] = [\begin{matrix} f_{θ} (z, t) \\ - a^{⊤} \frac{\partial f}{\partial z} \\ - a^{⊤} \frac{\partial f}{\partial θ} \end{matrix}]$

11. Extensions and Variants

11.1 Neural [[Stochastic Differential Equation (SDE)|SDE]]

Extends Neural ODE to stochastic differential equations:

d z (t) = f_{θ} (z (t), t) d t + g_{θ} (z (t), t) d W_{t}

Provides uncertainty modeling and connects to [[Diffusion Model|diffusion models]].

11.2 Latent ODE

Combines Neural ODE with a variational autoencoder for time series:

Encoder: RNN encodes observations into latent initial state
Decoder: Neural ODE evolves latent state, generates observations at any times

11.3 Neural CDE (Controlled Differential Equation)

Handles time series with irregular observations as input:

d z (t) = f_{θ} (z (t)) d X (t)

where $X (t)$ is a continuous path interpolating the observations.

11.4 Augmented Neural ODE (ANODE)

Adds zero-padding dimensions to increase expressivity, allowing Neural ODEs to represent non-diffeomorphic maps.

11.5 Second-Order Neural ODE

Models acceleration rather than velocity:

\frac{d^{2} z (t)}{d t^{2}} = f_{θ} (z (t), \frac{d z (t)}{d t}, t)

Useful for physical systems (Newtonian mechanics) and smoother trajectories.

12. Connection to Diffusion Models

12.1 [[Probability Flow ODE]] as a Neural ODE

The deterministic counterpart to the diffusion [[Stochastic Differential Equation (SDE)|SDE]]:

\frac{d x}{d t} = f (t) x - \frac{1}{2} g (t)^{2} \nabla_{x} \log p_{t} (x)

is a Neural ODE where the vector field is determined by the [[Score Function|score function]] $s_{θ} (x, t) \approx \nabla_{x} \log p_{t} (x)$ .

12.2 [[Score Function]] as the ODE Function

In diffusion models:

$f_{θ} (x, t) = f (t) x + g (t) s_{θ} (x, t)$ (for the [[Probability Flow ODE]])
The “Neural ODE function” is built from the pre-trained score model
No additional training needed — just plug in the score network

12.3 [[Flow Matching]] and Neural ODEs

[[Flow Matching]] directly learns the Neural ODE vector field without simulation:

L = E [∥ v_{θ} (x_{t}, t) - u_{t} (x_{t} ∣ x_{1}) ∥^{2}]

This is a simulation-free approach to training Neural ODE-based generative models.

[[Continuous Normalizing Flow]]
[[Probability Flow ODE]]
[[Diffusion Model]]
[[Flow Matching]]
[[Stochastic Differential Equation (SDE)]]
[[Score Function]]
[[DPM-Solver]]
[[Optimal Transport]]
[[ResNet]]
[[Recurrent Neural Network (RNN)]]
[[Langevin Dynamics]]
[[Fokker-Planck Equation]]

Dataview Query

1
2
3

LIST
FROM #neural_ode OR #continuous_depth OR #adjoint_method
SORT file.ctime DESC

References

Paper: Neural Ordinary Differential Equations (Chen et al., NeurIPS 2018 — Best Paper)
Paper: Scalable Reversible Generative Models with Free-form Continuous Dynamics (Grathwohl et al., 2019 — FFJORD)
Paper: Augmented Neural ODEs (Dupont et al., NeurIPS 2019)
Paper: Latent ODEs for Irregularly-Sampled Time Series (Rubanova et al., NeurIPS 2019)
Paper: Neural Controlled Differential Equations for Irregular Time Series (Kidger et al., NeurIPS 2020)
Paper: Score-Based Generative Modeling through SDEs (Song et al., 2021)
Paper: [[Flow Matching]] for Generative Modeling (Lipman et al., 2023)
Library: https://github.com/rtqichen/torchdiffeq
Blog: Neural ODEs: Breakdown of the Core Idea — Lilian Weng
Blog: Understanding Neural ODEs — Jonty Sinai
Course: CS236 Deep Generative Models (Stanford)
Talk: Neural ODEs — David Duvenaud (NeurIPS 2018)