FM revealed the relationship between flow $\phi_t$, vector field $u_t$ and probability density path (object $p_t$ and simple gaussian $p_0$), defined the objective of flow matching technique: use a neural network to approximate $u_t$, and find a possible way to define a $u_t$.
Think of a small ball floating in a 3D space. It starts at a point we call $x_0$. The ball's movement follows a specific pattern called a velocity field $u_t(x)$. Here, $t$ is just the time, and $x$ shows where the ball is. If we know where the ball starts, we can figure out where it will be at any time $x(t)$ by using an ODE:
$$ \frac{d\boldsymbol{x}}{dt}=\boldsymbol{u}_t(\boldsymbol{x}),\quad\boldsymbol{\phi}_0(\boldsymbol{x})=\boldsymbol{x} $$
$\boldsymbol{\phi}_t(\boldsymbol{x})$ is the solution to this ODE, representing the position at time $t$ when starting from $x_0$.
Let's expand our view from one ball to many balls. Imagine many balls spread out in space, each with a starting position described by a distribution $p_0(x)$. As these balls move according to our rules, their distribution changes over time. We call this new distribution at any time $t$ as $p_t(x)$. While we don't need to know exactly what this distribution looks like, we do know it follows the continuity equation:
$$ \frac{dp_t}{dt}=-\nabla_x(\boldsymbol{u}_t(\boldsymbol{x})p_t(\boldsymbol{x})) $$
If we can find such a flow, vector field, and probability distribution, then we can naturally define a diffusion process.
Let's extend this idea beyond just 3D space to work with any number of dimensions: the density function of the balls is a possibility dist. . As time passes, these balls move around, and their pattern changes from one shape to another. This is the basic idea of flow matching and similar modeling methods: we want to change distribution $A$ into another pattern distribution $B$.
To make this more concrete: we start with an initial dist. $p_0$ and want to reach a final dist. $p_1$. At any point in between, we call it dist. $p_t(x)$, and corresponding $u_t(x)$: $[0,1]\times\mathbb{R}^n\to\mathbb{R}^n$.
Once we know $u_t$, initial samples $x$, we can derive $p_1$ by solving the ODE above. FM introduced to use a network to approximate by optimize the following:
$$ \mathbb{L}{fm}(\theta)=\mathbb{E}{t,\boldsymbol{x}\sim p_t(\boldsymbol{x})}\begin{bmatrix}\|\boldsymbol{v}_\theta(\boldsymbol{x},t)-\boldsymbol{u}_t(\boldsymbol{x})\|^2\end{bmatrix} $$
$p_t(x)$ and $u_t(x)$ are hard to find a deterministic solution. But, we can use $p_t(x|x_1)$ and $u_t(x|x_1)$ (suppose $u_t(x|x_1)$ has a corresponding $p_t(x|x_1)$ )to estimate $p_t(x)$ and $u_t(x)$ while keeping continuity equation if we define the following (proof):