Elucidating the Design Space of Diffusion-Based Generative Models

EDM claims a general perspective to understand Diffusion Models around SDE & ODE and offers a frame to include everything. To some aspect, this paper is the most important diffusion model theory paper to me, and I highly recommend reading this paper in depth after reading DDPM, Understanding Diﬀusion Models: A Unified Perspective, Score-Based Generative Modeling through Stochastic Differential Equations.

Background

The dist. of a dataset $P_{data}(x)$, in which samples can be changed alonewith $t$. This change can be described by an SDE:

$$ \mathrm{d}x=f(t)x\mathrm{d}t+g(t)\mathrm{d}w $$

Here, $f$ and $g$ are predefined. And the principle behind is to ensure that the forward process can transform the samples into a Gaussian noise over time. This SDE has a corresponding reverse-time SDE and a reverse-time ODE (this blog will focus mainly on ODEs for the existence of various ODE solvers). The ODE goes like:

$$ \mathrm{d}x=[f(t)x-\frac{1}{2}g^2(t)\nabla_x\log p_t(x)]\mathrm{d}t $$

By estimating the unknown score function $\log p_t(x)$ using neural networks, we can solve the reverse process to transform the noises into samples.

Therefore, Diffusion Models consist of two steps: score function estimation via score matching (training), and sample generation using ODE solvers (inference).

How to design $f$ & $g$

We design $f, g$ to achieve $x_{t}=s_{t}x_{0}+s_{t}\sigma_{t}\epsilon$, where $x_0$ is the sample, $x_t$ is its noisy version at time $t$, and $\sigma_t$ is the deviation of noise. As time increases, the noise should dominate the sample until $x_t$ becomes gaussian.

To satisfy the requirement, we compute$f(t)=\dot{s}{t}/s{t}, g(t)=s_{t}\sqrt{2\dot{\sigma}{t}\sigma{t}}$ (proof), where $\dot{s}_{t}$ is the time derivative. In this way, adjusting $s_t, \sigma_t$ controls how dist. of $x_t$ will change, which benefits and simplifies training and inference.

So far, there’s tons of variables related to $t$, making things messed up. Since we want to clearly indicate noise levels when giving a $t$, EDM uses $\sigma$ only, which directly represents the amount of noise added to the data. And we can substitute $f, g$ with $s_t, \sigma_t$:

$$ \begin{aligned}p_t(x)&=\int p_{0t}(x|x_0)p_\mathrm{data}(x_0)\mathrm{d}x_0\\&=s_{t}^{-d}p(\frac{x}{s_{t}};\sigma_{t})\end{aligned} $$

proof

then we have $\nabla_{x}\log p_{t}(x)=\nabla_{x}\log p(x/s_{t};\sigma_{t})$, and substitute it into ODE:

$$ \mathrm{d}x=[\frac{\dot{s}_t}{s_t}x-s_t^2\dot{\sigma}_t\sigma_t\nabla_x\log p(\frac{x}{s_t};\sigma_t)]\mathrm{d}t $$

if we set $s_t \equiv1$, we will see $\mathrm{d}x=-\dot{\sigma}_t\sigma_t\nabla_x\log p(x;\sigma_t)\mathrm{d}t$.

Therefore, we can use different $s_t, \sigma_t$ to represent different diffusion process in different diffusion papers.