Follmer’s drift, Ito’s lemma, and the log-Sobolev inequality

1. Construction of Föllmer’s drift

In a previous post, we saw how an entropy-optimal drift process could be used to prove the Brascamp-Lieb inequalities. Our main tool was a result of Föllmer that we now recall and justify. Afterward, we will use it to prove the Gaussian log-Sobolev inequality.

Consider {f : \mathbb R^n \rightarrow \mathbb R_+} with {\int f \,d\gamma_n = 1} , where {\gamma_n} is the standard Gaussian measure on {\mathbb R^n} . Let {\{B_t\}} denote an {n} -dimensional Brownian motion with {B_0=0} . We consider all processes of the form

\displaystyle  W_t = B_t + \int_0^t v_s\,ds\,, \ \ \ \ \ (1)

where {\{v_s\}} is a progressively measurable drift and such that {W_1} has law {f\,d\gamma_n} .

Theorem 1 (Föllmer) It holds that

\displaystyle  D(f d\gamma_n \,\|\, d\gamma_n) = \min D(W_{[0,1]} \,\|\, B_{[0,1]}) = \min \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt\,,

where the minima are over all processes of the form (1).

Proof: In the preceding post (Lemma 2), we have already seen that for any drift of the form (1), it holds that

\displaystyle  D(f d\gamma_n \,\|\,d\gamma_n) \leq \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt = D(W_{[0,1]} \,\|\, B_{[0,1]})\,,

thus we need only exhibit a drift {\{v_t\}} achieving equality.

We define

\displaystyle  v_t = \nabla \log P_{1-t} f(W_t) = \frac{\nabla P_{1-t} f(W_t)}{P_{1-t} f(W_t)}\,,

where {\{P_t\}} is the Brownian semigroup defined by

\displaystyle  P_t f(x) = \mathop{\mathbb E}[f(x + B_t)]\,.

As we saw in the previous post (Lemma 2), the chain rule yields

\displaystyle  D(W_{[0,1]} \,\|\, B_{[0,1]}) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt\,. \ \ \ \ \ (2)

We are left to show that {W_1} has law {f \,d\gamma_n} and {D(W_{[0,1]} \,\|\, B_{[0,1]}) = D(f d\gamma_n \,\|\,d\gamma_n)} .

We will prove the first fact using Girsanov’s theorem to argue about the change of measure between {\{W_t\}} and {\{B_t\}} . As in the previous post, we will argue somewhat informally using the heuristic that the law of {dB_t} is a Gaussian random variable in {\mathbb R^n} with covariance {dt \cdot I} . Itô’s formula states that this heuristic is justified (see our use of the formula below).

The following lemma says that, given any sample path {\{W_s : s \in [0,t]\}} of our process up to time {s} , the probability that Brownian motion (without drift) would have “done the same thing” is {\frac{1}{M_t}} .

Remark 1 I chose to present various steps in the next proof at varying levels of formality. The arguments have the same structure as corresponding formal proofs, but I thought (perhaps naïvely) that this would be instructive.

Lemma 2 Let {\mu_t} denote the law of {\{W_s : s \in [0,t]\}} . If we define

\displaystyle  M_t = \exp\left(-\int_0^t \langle v_s,dB_s\rangle - \frac12 \int_0^t \|v_s\|^2\,ds\right)\,,

then under the measure {\nu_t} given by

\displaystyle  d\nu_t = M_t \,d\mu_t\,,

the process {\{W_s : s \in [0,t]\}} has the same law as {\{B_s : s \in [0,t]\}} .

Proof: We argue by analogy with the discrete proof. First, let us define the infinitesimal “transition kernel” of Brownian motion using our heuristic that {dB_t} has covariance {dt \cdot I} :

\displaystyle  p(x,y) = \frac{e^{-\|x-y\|^2/2dt}}{(2\pi dt)^{n/2}}\,.

We can also compute the (time-inhomogeneous) transition kernel {q_t} of {\{W_t\}} :

\displaystyle  q_t(x,y) = \frac{e^{-\|v_t dt + x - y\|^2/2dt}}{(2\pi dt)^{n/2}} = p(x,y) e^{-\frac12 \|v_t\|^2 dt} e^{-\langle v_t, x-y\rangle}\,.

Here we are using that {dW_t = dB_t + v_t\,dt} and {v_t} is deterministic conditioned on the past, thus the law of {dW_t} is a normal with mean {v_t\,dt} and covariance {dt \cdot I} .

To avoid confusion of derivatives, let’s use {\alpha_t} for the density of {\mu_t} and {\beta_t} for the density of Brownian motion (recall that these are densities on paths). Now let us relate the density {\alpha_{t+dt}} to the density {\alpha_{t}} . We use here the notations {\{\hat W_t, \hat v_t, \hat B_t\}} to denote a (non-random) sample path of {\{W_t\}} :

\displaystyle  \begin{array}{lll}  \alpha_{t+dt}(\hat W_{[0,t+dt]}) &= \alpha_t(\hat W_{[0,t]}) q_t(\hat W_t, \hat W_{t+dt}) \\ &= \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{-\frac12 \|\hat v_t\|^2\,dt-\langle \hat v_t,\hat W_t-\hat W_{t+dt}\rangle} \\ &= \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{-\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t,d \hat W_t\rangle} \\ &= \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t, d \hat B_t\rangle}\,, \end{array}

where the last line uses {d\hat W_t = d\hat B_t + \hat v_t\,dt} .

Now by “heuristic” induction, we can assume {\alpha_t(\hat W_{[0,t]})=\frac{1}{M_t} \beta_t(\hat W_{[0,t]})} , yielding

\displaystyle  \begin{array}{lll}  \alpha_{t+dt}(\hat W_{[0,t+dt]}) &= \frac{1}{M_t} \beta_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t, d \hat B_t\rangle} \\ &= \frac{1}{M_{t+dt}} \beta_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) \\ &= \frac{1}{M_{t+dt}} \beta_{t+dt}(\hat W_{[0,t+dt]})\,. \end{array}

In the last line, we used the fact that {p} is the infinitesimal transition kernel for Brownian motion. \Box

Now we will show that

\displaystyle  P_{1-t} f(W_t) = \exp\left(\frac12 \int_0^t \|v_s\|^2\,ds + \int_0^t \langle v_s, dB_s\rangle\right) = \frac{1}{M_t}\,. \ \ \ \ \ (3)

From Lemma 2, it will follow that {W_t} has the law {(P_{1-t} f)\cdot d\nu_t} where {d\nu_t} is the law of {B_t} . In particular, {W_1} has the law {f\,d\nu_1 = f\,d\gamma_n} which was our first goal.

Given our preceding less formal arguments, let us use a proper stochastic calculus argument to establish (3). To do that we need a way to calculate

\displaystyle  d \log P_{1-t} f(W_t) \quad \textrm{``}= \log P_{1-t-dt} f(W_{t+dt}) - \log P_{1-t} f(W_t)\textrm{''} \ \ \ \ \ (4)

Notice that this involves both time and space derivatives.

Itô’s lemma. Suppose we have a continuously differentiable function {F : \mathbb R \times [0,1] \rightarrow \mathbb R} that we write as {F(x,t)} where {x} is a space variable and {t} is a time variable. We can expand {d F} via its Taylor series:

\displaystyle  d F = \partial_t F \,dt + \partial_x F\,dx + \frac12 \partial_x^2 F\,dx^2 + \frac12 \partial_x \partial_t F\,dx\,dt + \cdots\,.

Normally we could eliminate the terms {dx^2, dx\, dt} , etc. since they are lower order as {dx,dt \rightarrow 0} . But recall that for Brownian motion we have the heuristic {\mathop{\mathbb E}[dB_t^2]=dt} . Thus we cannot eliminate the second-order space derivative if we plan to plug in {x=B_t} (or {x=W_t} , a process driven by Brownian motion). Itô’s lemma says that this consideration alone gives us the correct result:

\displaystyle  d F(W_t,t) = \partial_t F(W_t,t)\,dt + \partial_x F(W_t,t)\,dW_t + \frac12 \partial_x^2 F(W_t,t)\,dt\,.

This generalizes in a straightforward way to the higher dimensional setting {F : \mathbb R^n \times [0,1] \rightarrow \mathbb R} .

With Itô’s lemma in hand, let us continue to calculate the derivative

\displaystyle  \begin{array}{lll}  d P_{1-t} f(W_t) &= - \Delta P_{1-t} f(W_t)\,dt + \langle \nabla P_{1-t} f(W_t), dW_t\rangle + \Delta P_{1-t} f(W_t) \,dt \\ &= \langle \nabla P_{1-t} f(W_t), dW_t\rangle \\ &= P_{1-t} f(W_t) \,\langle v_t, dW_t\rangle\,. \end{array}

For the time derivative (the first term), we have employed the heat equation

\displaystyle  \partial_t P_{1-t} f = - \Delta P_{1-t} f\,,

where {\Delta = \frac12 \sum_{i=1}^n \partial_{x_i}^2} is the Laplacian on {\mathbb R^n} .

Note that the heat equation was already contained in our “infinitesimal density” {p} in the proof of Lemma 2, or in the representation {P_t = e^{t \Delta}} , and Itô’s lemma was also contained in our heuristic that {dB_t} has covariance {dt \cdot I} .

Using Itô’s formula again yields

\displaystyle  d \log P_{1-t} f(W_t) = \langle v_t, dW_t\rangle - \frac12 \|v_t\|^2\,dt = \frac12 \|v_t\|^2\,dt + \langle v_t,dB_t\rangle\,.

giving our desired conclusion (3).

Our final task is to establish optimality: {D\left(W_{[0,1]} \,\|\, B_{[0,1]}\right) = D(W_1\,\|\,B_1)} . We apply the formula (3):

\displaystyle  D(W_1\,\|\,B_1) = \mathop{\mathbb E}[\log f(W_1)] = \mathop{\mathbb E}\left[\frac12 \int_0^1 \|v_t\|^2\,dt\right],

where we used {\mathop{\mathbb E}[\langle v_t,dB_t\rangle]=0} . Combined with (2), this completes the proof of the theorem. \Box

2. The Gaussian log-Sobolev inequality

Consider again a measurable {f : \mathbb R^n \rightarrow \mathbb R_+} with {\int f\,d\gamma_n=1} . Let us define {\mathrm{Ent}_{\gamma_n}(f) = D(f\,d\gamma_n \,\|\,d\gamma_n)} . Then the classical log-Sobolev inequality in Gaussian space asserts that

\displaystyle  \mathrm{Ent}_{\gamma_n}(f) \leq \frac12 \int \frac{\|\nabla f\|^2}{f}\,d\gamma_n\,. \ \ \ \ \ (5)

First, we discuss the correct way to interpret this. Define the Ornstein-Uhlenbeck semi-group {\{U_t\}} by its action

\displaystyle  U_t f(x) = \mathop{\mathbb E}[f(e^{-t} x + \sqrt{1-e^{-2t}} B_1)]\,.

This is the natural stationary diffusion process on Gaussian space. For every measurable {f} , we have

\displaystyle  U_t f \rightarrow \int f d\gamma_n \quad \textrm{ as } t \to \infty

or equivalently

\displaystyle  \mathrm{Ent}_{\gamma_n}(U_t f) \rightarrow 0 \quad \textrm{ as } t \to \infty

The log-Sobolev inequality yields quantitative convergence in the relative entropy distance as follows: Define the Fisher information

\displaystyle  I(f) = \int \frac{\|\nabla f\|^2}{f} \,d\gamma_n\,.

One can check that

\displaystyle  \frac{d}{dt} \mathrm{Ent}_{\gamma_n} (U_t f)\Big|_{t=0} = - I(f)\,,

thus the Fisher information describes the instantaneous decay of the relative entropy of {f} under diffusion.

So we can rewrite the log-Sobolev inequality as:

\displaystyle  - \frac{d}{dt} \mathrm{Ent}_{\gamma_n}(U_t f)\Big|_{t=0} \geq 2 \mathrm{Ent}_{\gamma_n}(f)\,.

This expresses the intuitive fact that when the relative entropy is large, its rate of decay toward equilibrium is faster.

Martingale property of the optimal drift. Now for the proof of (5). Let {dW_t = dB_t + v_t\,dt} be the entropy-optimal process with {W_1 \sim f \,d\gamma_n} . We need one more fact about {\{v_t\}} : The optimal drift is a martingale, i.e. {\mathop{\mathbb E}[v_t \mid v_s] = v_s} for {s < t} .

Let’s give two arguments to support this.

Argument one: Brownian bridges. First, note that by the chain rule for relative entropy, we have:

\displaystyle  D(W_{[0,1]} \,\|\, B_{[0,1]}) = D(W_1 \,\|\, B_1) + \int D(W_{[0,1]} \,\|\, B_{[0,1]} \mid W_1=B_1=x) f(x) d\gamma_n(x)\,.

But from optimality, we know that the latter expectation is zero. Therefore {f \,d\gamma_n} -almost surely, we have

\displaystyle  D(W_{[0,1]} \,\| B_{[0,1]} \mid W_1=B_1=x) = 0\,.

This implies that if we condition on the endpoint {x} , then {W_{[0,1]}} is a Brownian bridge (i.e., a Brownian motion conditioned to start at {0} and end at {x} ).

This implies that {\mathop{\mathbb E}[v_t \mid v_s, W_1=x] = v_s} , as one can check that a Brownian bridge {\{\hat B_t\}} with endpoint {x} is described by the drift process {d\hat B_t = dB_t + \frac{x-\hat B_t}{1-t}\,dt} , and

\displaystyle  \mathop{\mathbb E}\left[\frac{x-\hat B_t}{1-t} \,\Big|\, B_{[0,s]}\right] = \frac{x-\hat B_s}{1-s}\,.

That seemed complicated. There is a simpler way to see this: Given {\hat B_s} and any bridge {\gamma} from {\hat B_s} to {x} , every “permutation” of the infinitesimal steps in {\gamma} has the same law (by commutativity, they all land at {x} ). Thus the marginal law of {dB_t + v_t\,dt} at every point {t \geq s} should be the same. In particular,

\displaystyle  \mathop{\mathbb E}[v_t\,dt \mid v_s] = \mathop{\mathbb E}[dB_t + v_t\,dt \mid v_s] = \mathop{\mathbb E}[dB_s + v_s \,ds \mid v_s] = v_s\,ds\,.

Argument two: Change of measure. There is a more succinct (though perhaps more opaque) way to see that {\{v_t\}} is a martingale. Note that the process {\nabla P_{1-t} f(B_t) = P_{1-t} \nabla f(B_t)} is a Doob martingale. But we have {v_t = \frac{\nabla P_{1-t} f(W_t)}{P_{1-t} f(W_t)}} and we also know that {\frac{1}{P_{1-t} f(W_t)} = \frac{1}{M_t}} is precisely the change of measure that makes {\{W_t\}} into Brownian motion.

Proof of the log-Sobolev inequality. In any case, now we are ready for the proof of (5). It also comes straight from Lehec’s paper. Since {\{v_t\}} is a martingale, we have {\mathop{\mathbb E}\,\|v_t\|^2 \leq \mathop{\mathbb E}\,\|v_1\|^2} . So by Theorem 1:

\displaystyle  \mathrm{Ent}_{\gamma_n}(f) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt \leq \frac12 \mathop{\mathbb E}\,\|v_1\|^2 = \frac12 \mathop{\mathbb E}\, \frac{\|\nabla f(W_1)\|^2}{f(W_1)^2} = \frac12 \mathop{\mathbb E}\, \frac{\|\nabla f(B_1)\|^2}{f(B_1)}\,.

The latter quantity is \frac12 I(f)  . In the last equality, we used the fact that {\frac{1}{f(W_1)}} is precisely the change of measure that turns {\{W_t\}} into Brownian motion.

Entropy optimality on path space

After Boaz posted on the mother of all inequalities, it seemed about the right time to get around to the next series of posts on entropy optimality. The approach is the same as before, but now we consider entropy optimality on a path space. After finding an appropriate entropy-maximizer, the Brascamp-Lieb inequality will admit a gorgeous one-line proof. Our argument is taken from the beautiful paper of Lehec.

For simplicity, we start first with an entropy optimization on a discrete path space. Then we move on to Brownian motion.

1.1 Entropy optimality on discrete path spaces

Consider a finite state space {\Omega} and a transition kernel {p : \Omega \times \Omega \rightarrow [0,1]} . Also fix some time {T \geq 0} .

Let {\mathcal P_T} denote the space of all paths {\gamma : \{0,1,\ldots,T\} \rightarrow \Omega} . There is a natural measure {\mu_{\mathcal P}} on {\mathcal P_T} coming from the transition kernel:

\displaystyle  \mu_{\mathcal P}(\gamma) = \prod_{t=0}^{T-1} p\left(\gamma(t), \gamma(t+1)\right)\,.

Now suppose we are given a starting point {x_0 \in \Omega} , and a target distribution specified by a function {f : \Omega \rightarrow {\mathbb R}_+} scaled so that {\mathop{\mathbb E}[f(X_T) \mid X_0 = x_0]=1} . If we let {\nu_T} denote the law of {X_T \mid X_0 = x_0} , then this simply says that {f} is a density with respect to {\nu_T} . One should think about {\nu_T} as the natural law at time {T} (given {X_0=x_0} ), and {f \nu_T} describes a perturbation of this law.

Let us finally define the set {\mathcal M_T(f; x_0)} of all measures {\mu} on {\mathcal P_T} that start at {x_0} and end at {f \nu_T} , i.e. those measures satisfying

\displaystyle  \mu\left(\{\gamma : \gamma(0)=x_0\}\right) = 1\,,

and for every {x \in \Omega} ,

\displaystyle  f(x) \nu_T(x) = \sum_{\gamma \in \mathcal P : \gamma(T)=x} \mu(\gamma)\,.

Now we can consider the entropy optimization problem:

\displaystyle  \min \left\{ D(\mu \,\|\, \mu_{\mathcal P}) : \mu \in \mathcal M_T(f;x_0) \right\}\,. \ \ \ \ \ (1)

One should verify that, like many times before, we are minimizing the relative entropy over a polytope.

One can think of the optimization as simply computing the most likely way for a mass of particles sitting at {x_0} to end up in the distribution {f \nu_T} at time {T} .

The optimal solution {\mu^*} exists and is unique. Moreover, we can describe it explicitly: {\mu^*} is given by a time-inhomogeneous Markov chain. For {0 \leq t \leq T-1} , this chain has transition kernel

\displaystyle  q_t(x,y) = p(x,y) \frac{H_{T-t-1} f(y)}{H_{T-t} f(x)}\,, \ \ \ \ \ (2)

where {H_t} is the heat semigroup of our chain {\{X_t\}} , i.e.

\displaystyle  H_t f(x) = \mathop{\mathbb E}[f(X_t) \mid X_0 = x]\,.

Let {\{W_t\}} denote the time-inhomogeneous chain with transition kernels {\{q_t\}} and {W_0=x_0} and let {\mu} denote the law of the random path {\{W_0, \ldots, W_T\}} . We will now verify that {\mu} is the optimal solution to (1).

We first need to confirm that {\mu \in \mathcal M_T(f;x_0)} , i.e. that {W_T} has law {f \nu_T} . To this end, we will verify inductively that {W_t} has law {(H_{T-t} f)\cdot \nu_t} . For {t=0} , this follows by definition. For the inductive step:

\displaystyle  \begin{array}{lll}  \displaystyle\mathop{\mathbb P}[W_{t+1}=y] &= \sum_{x \in \Omega} \Pr[W_t=x] \cdot p(x,y) \frac{H_{T-t-1} f(y)}{H_{T-t} f(x)} \\ \displaystyle&= \sum_{x \in \Omega} H_{T-t} f(x) \nu_t(x) p(x,y) \frac{H_{T-t-1} f(y)}{H_{T-t} f(x)} \\ \displaystyle&= \sum_{x \in \Omega} \nu_t(x) p(x,y) H_{T-t-1}f(y) \\ \displaystyle & = H_{T-t-1} f(y) \nu_{t+1}(y)\,. \end{array}

We have confirmed that {\mu \in \mathcal M_T(f;x_0)} . Let us now verify its optimality by writing

\displaystyle  D(f \nu_T \,\|\,\nu_T) = \mathop{\mathbb E}_{\nu_T} [f \log f] = \mathop{\mathbb E}[\log f(W_T)]\,,

where the final equality uses the fact we just proved: {W_T} has law {f \nu_T} . Continuing, we have

\displaystyle  \mathop{\mathbb E}[\log f(W_T)] = \sum_{t=0}^{T-1} \mathop{\mathbb E}\left[\log \frac{H_{T-t-1} f(W_{t+1})}{H_{T-t} f(W_t)}\right] = \sum_{t=0}^{T-1} \mathop{\mathbb E} \left[D(q_t(W_t, \cdot) \,\|\, p(W_t,\cdot))\right]\,,

where the final inequality uses the definition of {q_t} in (2). The latter quantity is precisely {D(\mu \,\|\, \mu_{\mathcal P})} by the chain rule for relative entropy.

Exercise: One should check that if {\{A_t\}} and {\{B_t\}} are two time-inhomogeneous Markov chains on {\Omega} with respective transition kernels {a_t} and {b_t} then indeed the chain rule for relative entropy yields

\displaystyle  D(\{A_0, \ldots, A_T\} \,\|\, \{B_0, \ldots, B_T\}) = \sum_{t=0}^{T-1} \mathop{\mathbb E}\left[D\left(a_t(A_t, \cdot)\,\|\,b_t(A_t,\cdot)\right)\right]\,. \ \ \ \ \ (3)

We conclude that

\displaystyle  D(f \nu_T \,\|\, \nu_T) = D(\mu \,\|\,\mu_{\mathcal P})\,,

and from this one immediately concludes that {\mu=\mu^*} . Indeed, for any measure {\mu' \in \mathcal M_T(f;x_0)} , we must have {D(\mu' \,\|\,\mu_{\mathcal P}) \geq D(f \nu_T \,\|\,\nu_T)} . This follows because {f \nu_T} is the law of the endpoint of a path drawn from {\mu'} and {\nu_T} is the law of the endpoint of a path drawn from {\mu} . The relative entropy between the endpoints is certainly less than along the entire path. (This intuitive fact can again be proved via the chain rule for relative entropy by conditioning on the endpoint of the path.)

1.2. The Brownian version

Let us now do the same thing for processes driven by Brownian motion in {\mathbb R^n} . Let {\{B_t : t \in [0,1]\}} be a Brownian motion with {B_0=0} . Let {\gamma_n} be the standard Gaussian measure and recall that {B_1} has law {\gamma_n} .

We recall that if we have two measures {\mu} and {\nu} on {\mathbb R^n} such that {\nu} is absolutely continuous with respect to {\mu} , we define the relative entropy

\displaystyle D(\nu\,\|\,\mu) = \int d\nu \log \frac{d\nu}{d\mu}

Our “path space” will consist of drift processes {\{W_t : t \in [0,1]\}} of the form

\displaystyle  W_t = B_t + \int_0^t u_s\,ds\,, \ \ \ \ \ (4)

where {\{u_s\}} denotes the drift. We require that {\{u_s\}} is progressively measurable, i.e. that the law of {u_s} is determined by the past up to time {s} , and that {\mathop{\mathbb E} \int_0^1 \|u_s\|^2 \,ds < \infty} . Note that we can write such a process in differential notation as

\displaystyle  dW_t = dB_t + u_t\,dt\,,

with {W_0=0} .

Fix a smooth density {f : \mathbb R^n \rightarrow {\mathbb R}_+} with {\int f \,d\gamma_n =1} . In analogy with the discrete setting, let us use {\mathcal M(f)} to denote the set of processes {\{W_t\}} that can be realized in the form (4) and such that {W_0 = 0} and {W_1} has law {f d\gamma_n} .

Let us also use the shorthand {W_{[0,1]} = \{W_t : t\in [0,1]\}} to represent the entire path of the process. Again, we will consider the entropy optimization problem:

\displaystyle  \min \left\{ \vphantom{\bigoplus} D\left(W_{[0,1]} \,\|\, B_{[0,1]}\right) : W_{[0,1]} \in \mathcal M(f) \right\}\,. \ \ \ \ \ (5)

As in the discrete setting, this problem has a unique optimal solution (in the sense of stochastic processes). Here is the main result.

Theorem 1 (Föllmer) If {\{ W_t = B_t + \int_0^t u_s\,ds : t \in [0,1]\}} is the optimal solution to (5), then

\displaystyle  D\left(W_{[0,1]}\,\|\,B_{[0,1]}\right) = D(W_1 \,\|\, B_1) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|u_t\|^2\,dt\,.

Just as for the discrete case, one should think of this as asserting that the optimal process only uses as much entropy as is needed for the difference in laws at the endpoint. The RHS should be thought of as an integral over the expected relative entropy generated at time {t} (just as in the chain rule expression (3)).

The reason for the quadratic term is the usual relative entropy approximation for infinitesimal perturbations. For instance, consider the relative entropy between a binary random variable with expected value {\tfrac12 (1-\varepsilon)} and a binary random variable with expected value {\tfrac12} :

\displaystyle  \frac12(1-\varepsilon) \log (1-\varepsilon) + \frac12 (1+\varepsilon) \log (1+\varepsilon) \approx \frac12 \varepsilon^2\,.

I am going to delay the proof of Theorem 1 to the next post because doing it in an elementary way will require some discussion of Ito calculus. For now, let us prove the following.

Lemma 2 For any process {W_{[0,1]} \in \mathcal M(f)} given by a drift {\{u_t : t\in[0,1]\}} , it holds that

\displaystyle  D(W_1 \,\|\, B_1) \leq D(W_{[0,1]} \,\|\, B_{[0,1]}) =\frac12 \int_0^1 \mathop{\mathbb E}\,\|u_t\|^2\,dt\,.

Proof: The proof will be somewhat informal. It can be done easily using Girsanov’s theorem, but we try to keep the presentation here elementary and in correspondence with the discrete version above.

Let us first use the chain rule for relative entropy to calculate

\displaystyle  D\left(W_{[0,1]} \,\|\,B_{[0,1]}\right) = \int_0^1 \mathop{\mathbb E}\left[D( dW_t \,\|\, dB_t)\right] = \int_0^1 \mathop{\mathbb E}\left[D(dB_t + u_t\,dt \,\|\,dB_t)\right]\,. \ \ \ \ \ (6)

Note that {dB_t} has the law of a standard {n} -dimensional of covariance {dt \cdot I} .

If {Z} is an {n} -dimensional Gaussian with covariance {\sigma^2 \cdot I} and {u \in \mathbb R^n} , then

\displaystyle  \begin{array}{lll}  D(Z + u \,\|\, Z) &= \mathop{\mathbb E}\left[\log \frac{e^{-\|Z\|^2/2\sigma^2}}{e^{-\|u-Z\|^2/2\sigma^2}}\right] \\ &= \mathop{\mathbb E}\left[\frac{\|u\|^2}{2\sigma^2} + \frac{\langle u,Z\rangle}{\sigma^2}\right] \\ &= \frac{\|u\|^2}{2\sigma^2}\,. \end{array}

Therefore:

\displaystyle  D(dB_t + u_t\,dt \,\|\,dB_t) = \mathop{\mathbb E} \left[\frac{\|u_t\|^2 dt^2}{2 dt}\mid \mathcal F_t\right] =\frac12 \mathop{\mathbb E}\left[\|u_t\|^2\,dt \mid \mathcal F_t\right]\,,

where the latter expectation is understood to be conditioned on the past \mathcal F_t  up to time {t} .

In particular, plugging this into (6), we have

\displaystyle  D\left(W_{[0,1]} \,\|\,B_{[0,1]}\right) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|u_t\|^2\,dt\,. \ \ \ \ \ (7)

\Box

1.3. Brascamp-Lieb

The proof is taken directly from Lehec. We will use the entropic formulation of Brascamp-Lieb due to Carlen and Cordero-Erausquin.

Let {E} be a Euclidean space with subspaces {E_1, E_2, \ldots, E_m} . Let {P_i} denote the orthogonal projection onto {E_i} . Now suppose that for positive numbers {c_1, c_2, \ldots, c_m > 0} , we have

\displaystyle  \sum_{i=1}^m c_i P_i = \mathrm{id}_E\,. \ \ \ \ \ (8)

By (8), we have for all {x \in E} :

\displaystyle \|x\|^2 = \left\langle x,\sum_{i=1}^m c_i P_i x\right\rangle = \sum_{i=1}^m c_i\|P_i x\|^2\,.

The latter equality uses the fact that each {P_i} is an orthogonal projection.

Let {Z} denote a standard Gaussian on {E} , and let {Z_i} denote a standard Gaussian on {E_i} for each {i=1,2,\ldots, m} .

Theorem 3 (Carlen & Cordero-Erausquin version of Brascamp-Lieb) For any random vector {X \in E} , it holds that

\displaystyle  D(X \,\|\, Z) \geq \sum_{i=1}^m c_i D(P_i X \,\|\, Z_i)\,.

Proof: Let {\{W_t : t \in [0,1]\}} with {dW_t = dB_t + v_t\,dt} denote the entropy-optimal drift process such that {W_1} has the law of {X} . Then by Theorem 1,

\displaystyle  D(X\,\|\,Z) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt = \frac12 \int_0^1 \sum_{i=1}^m c_i \mathop{\mathbb E}\,\|P_i v_t\|^2\,dt \geq \sum_{i=1}^m c_i D(P_i X \,\|\, Z_i)\,,

where the latter inequality uses Lemma 2 and the fact that {P_i W_1} has law {P_i X} . \Box