EDM claims a general perspective to understand Diffusion Models around SDE & ODE and offers a frame to include everything. To some aspect, this paper is the most important diffusion model theory paper to me, and I highly recommend reading this paper in depth after reading DDPM, Understanding Diffusion Models: A Unified Perspective, Score-Based Generative Modeling through Stochastic Differential Equations.
The dist. of a dataset $P_{data}(x)$, in which samples can be changed alonewith $t$. This change can be described by an SDE:
$$ \mathrm{d}x=f(t)x\mathrm{d}t+g(t)\mathrm{d}w $$
Here, $f$ and $g$ are predefined. And the principle behind is to ensure that the forward process can transform the samples into a Gaussian noise over time. This SDE has a corresponding reverse-time SDE and a reverse-time ODE (this blog will focus mainly on ODEs for the existence of various ODE solvers). The ODE goes like:
$$ \mathrm{d}x=[f(t)x-\frac{1}{2}g^2(t)\nabla_x\log p_t(x)]\mathrm{d}t $$
By estimating the unknown score function $\log p_t(x)$ using neural networks, we can solve the reverse process to transform the noises into samples.
Therefore, Diffusion Models consist of two steps: score function estimation via score matching (training), and sample generation using ODE solvers (inference).
We design $f, g$ to achieve $x_{t}=s_{t}x_{0}+s_{t}\sigma_{t}\epsilon$, where $x_0$ is the sample, $x_t$ is its noisy version at time $t$, and $\sigma_t$ is the deviation of noise. As time increases, the noise should dominate the sample until $x_t$ becomes gaussian.
To satisfy the requirement, we compute$f(t)=\dot{s}{t}/s{t}, g(t)=s_{t}\sqrt{2\dot{\sigma}{t}\sigma{t}}$ (proof), where $\dot{s}_{t}$ is the time derivative. In this way, adjusting $s_t, \sigma_t$ controls how dist. of $x_t$ will change, which benefits and simplifies training and inference.
So far, there’s tons of variables related to $t$, making things messed up. Since we want to clearly indicate noise levels when giving a $t$, EDM uses $\sigma$ only, which directly represents the amount of noise added to the data. And we can substitute $f, g$ with $s_t, \sigma_t$:
$$ \begin{aligned}p_t(x)&=\int p_{0t}(x|x_0)p_\mathrm{data}(x_0)\mathrm{d}x_0\\&=s_{t}^{-d}p(\frac{x}{s_{t}};\sigma_{t})\end{aligned} $$
then we have $\nabla_{x}\log p_{t}(x)=\nabla_{x}\log p(x/s_{t};\sigma_{t})$, and substitute it into ODE:
$$ \mathrm{d}x=[\frac{\dot{s}_t}{s_t}x-s_t^2\dot{\sigma}_t\sigma_t\nabla_x\log p(\frac{x}{s_t};\sigma_t)]\mathrm{d}t $$
if we set $s_t \equiv1$, we will see $\mathrm{d}x=-\dot{\sigma}_t\sigma_t\nabla_x\log p(x;\sigma_t)\mathrm{d}t$.
Therefore, we can use different $s_t, \sigma_t$ to represent different diffusion process in different diffusion papers.