# Follmer’s drift, Ito’s lemma, and the log-Sobolev inequality

1. Construction of Föllmer’s drift

In a previous post, we saw how an entropy-optimal drift process could be used to prove the Brascamp-Lieb inequalities. Our main tool was a result of Föllmer that we now recall and justify. Afterward, we will use it to prove the Gaussian log-Sobolev inequality.

Consider ${f : \mathbb R^n \rightarrow \mathbb R_+}$ with ${\int f \,d\gamma_n = 1}$, where ${\gamma_n}$ is the standard Gaussian measure on ${\mathbb R^n}$. Let ${\{B_t\}}$ denote an ${n}$-dimensional Brownian motion with ${B_0=0}$. We consider all processes of the form

$\displaystyle W_t = B_t + \int_0^t v_s\,ds\,, \ \ \ \ \ (1)$

where ${\{v_s\}}$ is a progressively measurable drift and such that ${W_1}$ has law ${f\,d\gamma_n}$.

Theorem 1 (Föllmer) It holds that

$\displaystyle D(f d\gamma_n \,\|\, d\gamma_n) = \min D(W_{[0,1]} \,\|\, B_{[0,1]}) = \min \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt\,,$

where the minima are over all processes of the form (1).

Proof: In the preceding post (Lemma 2), we have already seen that for any drift of the form (1), it holds that

$\displaystyle D(f d\gamma_n \,\|\,d\gamma_n) \leq \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt = D(W_{[0,1]} \,\|\, B_{[0,1]})\,,$

thus we need only exhibit a drift ${\{v_t\}}$ achieving equality.

We define

$\displaystyle v_t = \nabla \log P_{1-t} f(W_t) = \frac{\nabla P_{1-t} f(W_t)}{P_{1-t} f(W_t)}\,,$

where ${\{P_t\}}$ is the Brownian semigroup defined by

$\displaystyle P_t f(x) = \mathop{\mathbb E}[f(x + B_t)]\,.$

As we saw in the previous post (Lemma 2), the chain rule yields

$\displaystyle D(W_{[0,1]} \,\|\, B_{[0,1]}) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt\,. \ \ \ \ \ (2)$

We are left to show that ${W_1}$ has law ${f \,d\gamma_n}$ and ${D(W_{[0,1]} \,\|\, B_{[0,1]}) = D(f d\gamma_n \,\|\,d\gamma_n)}$.

We will prove the first fact using Girsanov’s theorem to argue about the change of measure between ${\{W_t\}}$ and ${\{B_t\}}$. As in the previous post, we will argue somewhat informally using the heuristic that the law of ${dB_t}$ is a Gaussian random variable in ${\mathbb R^n}$ with covariance ${dt \cdot I}$. Itô’s formula states that this heuristic is justified (see our use of the formula below).

The following lemma says that, given any sample path ${\{W_s : s \in [0,t]\}}$ of our process up to time ${s}$, the probability that Brownian motion (without drift) would have “done the same thing” is ${\frac{1}{M_t}}$.

Remark 1 I chose to present various steps in the next proof at varying levels of formality. The arguments have the same structure as corresponding formal proofs, but I thought (perhaps naïvely) that this would be instructive.

Lemma 2 Let ${\mu_t}$ denote the law of ${\{W_s : s \in [0,t]\}}$. If we define

$\displaystyle M_t = \exp\left(-\int_0^t \langle v_s,dB_s\rangle - \frac12 \int_0^t \|v_s\|^2\,ds\right)\,,$

then under the measure ${\nu_t}$ given by

$\displaystyle d\nu_t = M_t \,d\mu_t\,,$

the process ${\{W_s : s \in [0,t]\}}$ has the same law as ${\{B_s : s \in [0,t]\}}$.

Proof: We argue by analogy with the discrete proof. First, let us define the infinitesimal “transition kernel” of Brownian motion using our heuristic that ${dB_t}$ has covariance ${dt \cdot I}$:

$\displaystyle p(x,y) = \frac{e^{-\|x-y\|^2/2dt}}{(2\pi dt)^{n/2}}\,.$

We can also compute the (time-inhomogeneous) transition kernel ${q_t}$ of ${\{W_t\}}$:

$\displaystyle q_t(x,y) = \frac{e^{-\|v_t dt + x - y\|^2/2dt}}{(2\pi dt)^{n/2}} = p(x,y) e^{-\frac12 \|v_t\|^2 dt} e^{-\langle v_t, x-y\rangle}\,.$

Here we are using that ${dW_t = dB_t + v_t\,dt}$ and ${v_t}$ is deterministic conditioned on the past, thus the law of ${dW_t}$ is a normal with mean ${v_t\,dt}$ and covariance ${dt \cdot I}$.

To avoid confusion of derivatives, let’s use ${\alpha_t}$ for the density of ${\mu_t}$ and ${\beta_t}$ for the density of Brownian motion (recall that these are densities on paths). Now let us relate the density ${\alpha_{t+dt}}$ to the density ${\alpha_{t}}$. We use here the notations ${\{\hat W_t, \hat v_t, \hat B_t\}}$ to denote a (non-random) sample path of ${\{W_t\}}$:

$\displaystyle \begin{array}{lll} \alpha_{t+dt}(\hat W_{[0,t+dt]}) &= \alpha_t(\hat W_{[0,t]}) q_t(\hat W_t, \hat W_{t+dt}) \\ &= \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{-\frac12 \|\hat v_t\|^2\,dt-\langle \hat v_t,\hat W_t-\hat W_{t+dt}\rangle} \\ &= \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{-\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t,d \hat W_t\rangle} \\ &= \alpha_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t, d \hat B_t\rangle}\,, \end{array}$

where the last line uses ${d\hat W_t = d\hat B_t + \hat v_t\,dt}$.

Now by “heuristic” induction, we can assume ${\alpha_t(\hat W_{[0,t]})=\frac{1}{M_t} \beta_t(\hat W_{[0,t]})}$, yielding

$\displaystyle \begin{array}{lll} \alpha_{t+dt}(\hat W_{[0,t+dt]}) &= \frac{1}{M_t} \beta_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) e^{\frac12 \|\hat v_t\|^2\,dt+\langle \hat v_t, d \hat B_t\rangle} \\ &= \frac{1}{M_{t+dt}} \beta_t(\hat W_{[0,t]}) p(\hat W_t, \hat W_{t+dt}) \\ &= \frac{1}{M_{t+dt}} \beta_{t+dt}(\hat W_{[0,t+dt]})\,. \end{array}$

In the last line, we used the fact that ${p}$ is the infinitesimal transition kernel for Brownian motion. $\Box$

Now we will show that

$\displaystyle P_{1-t} f(W_t) = \exp\left(\frac12 \int_0^t \|v_s\|^2\,ds + \int_0^t \langle v_s, dB_s\rangle\right) = \frac{1}{M_t}\,. \ \ \ \ \ (3)$

From Lemma 2, it will follow that ${W_t}$ has the law ${(P_{1-t} f)\cdot d\nu_t}$ where ${d\nu_t}$ is the law of ${B_t}$. In particular, ${W_1}$ has the law ${f\,d\nu_1 = f\,d\gamma_n}$ which was our first goal.

Given our preceding less formal arguments, let us use a proper stochastic calculus argument to establish (3). To do that we need a way to calculate

$\displaystyle d \log P_{1-t} f(W_t) \quad \textrm{}= \log P_{1-t-dt} f(W_{t+dt}) - \log P_{1-t} f(W_t)\textrm{''} \ \ \ \ \ (4)$

Notice that this involves both time and space derivatives.

Itô’s lemma. Suppose we have a continuously differentiable function ${F : \mathbb R \times [0,1] \rightarrow \mathbb R}$ that we write as ${F(x,t)}$ where ${x}$ is a space variable and ${t}$ is a time variable. We can expand ${d F}$ via its Taylor series:

$\displaystyle d F = \partial_t F \,dt + \partial_x F\,dx + \frac12 \partial_x^2 F\,dx^2 + \frac12 \partial_x \partial_t F\,dx\,dt + \cdots\,.$

Normally we could eliminate the terms ${dx^2, dx\, dt}$, etc. since they are lower order as ${dx,dt \rightarrow 0}$. But recall that for Brownian motion we have the heuristic ${\mathop{\mathbb E}[dB_t^2]=dt}$. Thus we cannot eliminate the second-order space derivative if we plan to plug in ${x=B_t}$ (or ${x=W_t}$, a process driven by Brownian motion). Itô’s lemma says that this consideration alone gives us the correct result:

$\displaystyle d F(W_t,t) = \partial_t F(W_t,t)\,dt + \partial_x F(W_t,t)\,dW_t + \frac12 \partial_x^2 F(W_t,t)\,dt\,.$

This generalizes in a straightforward way to the higher dimensional setting ${F : \mathbb R^n \times [0,1] \rightarrow \mathbb R}$.

With Itô’s lemma in hand, let us continue to calculate the derivative

$\displaystyle \begin{array}{lll} d P_{1-t} f(W_t) &= - \Delta P_{1-t} f(W_t)\,dt + \langle \nabla P_{1-t} f(W_t), dW_t\rangle + \Delta P_{1-t} f(W_t) \,dt \\ &= \langle \nabla P_{1-t} f(W_t), dW_t\rangle \\ &= P_{1-t} f(W_t) \,\langle v_t, dW_t\rangle\,. \end{array}$

For the time derivative (the first term), we have employed the heat equation

$\displaystyle \partial_t P_{1-t} f = - \Delta P_{1-t} f\,,$

where ${\Delta = \frac12 \sum_{i=1}^n \partial_{x_i}^2}$ is the Laplacian on ${\mathbb R^n}$.

Note that the heat equation was already contained in our “infinitesimal density” ${p}$ in the proof of Lemma 2, or in the representation ${P_t = e^{t \Delta}}$, and Itô’s lemma was also contained in our heuristic that ${dB_t}$ has covariance ${dt \cdot I}$.

Using Itô’s formula again yields

$\displaystyle d \log P_{1-t} f(W_t) = \langle v_t, dW_t\rangle - \frac12 \|v_t\|^2\,dt = \frac12 \|v_t\|^2\,dt + \langle v_t,dB_t\rangle\,.$

giving our desired conclusion (3).

Our final task is to establish optimality: ${D\left(W_{[0,1]} \,\|\, B_{[0,1]}\right) = D(W_1\,\|\,B_1)}$. We apply the formula (3):

$\displaystyle D(W_1\,\|\,B_1) = \mathop{\mathbb E}[\log f(W_1)] = \mathop{\mathbb E}\left[\frac12 \int_0^1 \|v_t\|^2\,dt\right],$

where we used ${\mathop{\mathbb E}[\langle v_t,dB_t\rangle]=0}$. Combined with (2), this completes the proof of the theorem. $\Box$

2. The Gaussian log-Sobolev inequality

Consider again a measurable ${f : \mathbb R^n \rightarrow \mathbb R_+}$ with ${\int f\,d\gamma_n=1}$. Let us define ${\mathrm{Ent}_{\gamma_n}(f) = D(f\,d\gamma_n \,\|\,d\gamma_n)}$. Then the classical log-Sobolev inequality in Gaussian space asserts that

$\displaystyle \mathrm{Ent}_{\gamma_n}(f) \leq \frac12 \int \frac{\|\nabla f\|^2}{f}\,d\gamma_n\,. \ \ \ \ \ (5)$

First, we discuss the correct way to interpret this. Define the Ornstein-Uhlenbeck semi-group ${\{U_t\}}$ by its action

$\displaystyle U_t f(x) = \mathop{\mathbb E}[f(e^{-t} x + \sqrt{1-e^{-2t}} B_1)]\,.$

This is the natural stationary diffusion process on Gaussian space. For every measurable ${f}$, we have

$\displaystyle U_t f \rightarrow \int f d\gamma_n \quad \textrm{ as } t \to \infty$

or equivalently

$\displaystyle \mathrm{Ent}_{\gamma_n}(U_t f) \rightarrow 0 \quad \textrm{ as } t \to \infty$

The log-Sobolev inequality yields quantitative convergence in the relative entropy distance as follows: Define the Fisher information

$\displaystyle I(f) = \int \frac{\|\nabla f\|^2}{f} \,d\gamma_n\,.$

One can check that

$\displaystyle \frac{d}{dt} \mathrm{Ent}_{\gamma_n} (U_t f)\Big|_{t=0} = - I(f)\,,$

thus the Fisher information describes the instantaneous decay of the relative entropy of ${f}$ under diffusion.

So we can rewrite the log-Sobolev inequality as:

$\displaystyle - \frac{d}{dt} \mathrm{Ent}_{\gamma_n}(U_t f)\Big|_{t=0} \geq 2 \mathrm{Ent}_{\gamma_n}(f)\,.$

This expresses the intuitive fact that when the relative entropy is large, its rate of decay toward equilibrium is faster.

Martingale property of the optimal drift. Now for the proof of (5). Let ${dW_t = dB_t + v_t\,dt}$ be the entropy-optimal process with ${W_1 \sim f \,d\gamma_n}$. We need one more fact about ${\{v_t\}}$: The optimal drift is a martingale, i.e. ${\mathop{\mathbb E}[v_t \mid v_s] = v_s}$ for ${s < t}$.

Let’s give two arguments to support this.

Argument one: Brownian bridges. First, note that by the chain rule for relative entropy, we have:

$\displaystyle D(W_{[0,1]} \,\|\, B_{[0,1]}) = D(W_1 \,\|\, B_1) + \int D(W_{[0,1]} \,\|\, B_{[0,1]} \mid W_1=B_1=x) f(x) d\gamma_n(x)\,.$

But from optimality, we know that the latter expectation is zero. Therefore ${f \,d\gamma_n}$-almost surely, we have

$\displaystyle D(W_{[0,1]} \,\| B_{[0,1]} \mid W_1=B_1=x) = 0\,.$

This implies that if we condition on the endpoint ${x}$, then ${W_{[0,1]}}$ is a Brownian bridge (i.e., a Brownian motion conditioned to start at ${0}$ and end at ${x}$).

This implies that ${\mathop{\mathbb E}[v_t \mid v_s, W_1=x] = v_s}$, as one can check that a Brownian bridge ${\{\hat B_t\}}$ with endpoint ${x}$ is described by the drift process ${d\hat B_t = dB_t + \frac{x-\hat B_t}{1-t}\,dt}$, and

$\displaystyle \mathop{\mathbb E}\left[\frac{x-\hat B_t}{1-t} \,\Big|\, B_{[0,s]}\right] = \frac{x-\hat B_s}{1-s}\,.$

That seemed complicated. There is a simpler way to see this: Given ${\hat B_s}$ and any bridge ${\gamma}$ from ${\hat B_s}$ to ${x}$, every “permutation” of the infinitesimal steps in ${\gamma}$ has the same law (by commutativity, they all land at ${x}$). Thus the marginal law of ${dB_t + v_t\,dt}$ at every point ${t \geq s}$ should be the same. In particular,

$\displaystyle \mathop{\mathbb E}[v_t\,dt \mid v_s] = \mathop{\mathbb E}[dB_t + v_t\,dt \mid v_s] = \mathop{\mathbb E}[dB_s + v_s \,ds \mid v_s] = v_s\,ds\,.$

Argument two: Change of measure. There is a more succinct (though perhaps more opaque) way to see that ${\{v_t\}}$ is a martingale. Note that the process ${\nabla P_{1-t} f(B_t) = P_{1-t} \nabla f(B_t)}$ is a Doob martingale. But we have ${v_t = \frac{\nabla P_{1-t} f(W_t)}{P_{1-t} f(W_t)}}$ and we also know that ${\frac{1}{P_{1-t} f(W_t)} = \frac{1}{M_t}}$ is precisely the change of measure that makes ${\{W_t\}}$ into Brownian motion.

Proof of the log-Sobolev inequality. In any case, now we are ready for the proof of (5). It also comes straight from Lehec’s paper. Since ${\{v_t\}}$ is a martingale, we have ${\mathop{\mathbb E}\,\|v_t\|^2 \leq \mathop{\mathbb E}\,\|v_1\|^2}$. So by Theorem 1:

$\displaystyle \mathrm{Ent}_{\gamma_n}(f) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt \leq \frac12 \mathop{\mathbb E}\,\|v_1\|^2 = \frac12 \mathop{\mathbb E}\, \frac{\|\nabla f(W_1)\|^2}{f(W_1)^2} = \frac12 \mathop{\mathbb E}\, \frac{\|\nabla f(B_1)\|^2}{f(B_1)}\,.$

The latter quantity is $\frac12 I(f)$. In the last equality, we used the fact that ${\frac{1}{f(W_1)}}$ is precisely the change of measure that turns ${\{W_t\}}$ into Brownian motion.

# Entropy optimality on path space

After Boaz posted on the mother of all inequalities, it seemed about the right time to get around to the next series of posts on entropy optimality. The approach is the same as before, but now we consider entropy optimality on a path space. After finding an appropriate entropy-maximizer, the Brascamp-Lieb inequality will admit a gorgeous one-line proof. Our argument is taken from the beautiful paper of Lehec.

For simplicity, we start first with an entropy optimization on a discrete path space. Then we move on to Brownian motion.

1.1 Entropy optimality on discrete path spaces

Consider a finite state space ${\Omega}$ and a transition kernel ${p : \Omega \times \Omega \rightarrow [0,1]}$. Also fix some time ${T \geq 0}$.

Let ${\mathcal P_T}$ denote the space of all paths ${\gamma : \{0,1,\ldots,T\} \rightarrow \Omega}$. There is a natural measure ${\mu_{\mathcal P}}$ on ${\mathcal P_T}$ coming from the transition kernel:

$\displaystyle \mu_{\mathcal P}(\gamma) = \prod_{t=0}^{T-1} p\left(\gamma(t), \gamma(t+1)\right)\,.$

Now suppose we are given a starting point ${x_0 \in \Omega}$, and a target distribution specified by a function ${f : \Omega \rightarrow {\mathbb R}_+}$ scaled so that ${\mathop{\mathbb E}[f(X_T) \mid X_0 = x_0]=1}$. If we let ${\nu_T}$ denote the law of ${X_T \mid X_0 = x_0}$, then this simply says that ${f}$ is a density with respect to ${\nu_T}$. One should think about ${\nu_T}$ as the natural law at time ${T}$ (given ${X_0=x_0}$), and ${f \nu_T}$ describes a perturbation of this law.

Let us finally define the set ${\mathcal M_T(f; x_0)}$ of all measures ${\mu}$ on ${\mathcal P_T}$ that start at ${x_0}$ and end at ${f \nu_T}$, i.e. those measures satisfying

$\displaystyle \mu\left(\{\gamma : \gamma(0)=x_0\}\right) = 1\,,$

and for every ${x \in \Omega}$,

$\displaystyle f(x) \nu_T(x) = \sum_{\gamma \in \mathcal P : \gamma(T)=x} \mu(\gamma)\,.$

Now we can consider the entropy optimization problem:

$\displaystyle \min \left\{ D(\mu \,\|\, \mu_{\mathcal P}) : \mu \in \mathcal M_T(f;x_0) \right\}\,. \ \ \ \ \ (1)$

One should verify that, like many times before, we are minimizing the relative entropy over a polytope.

One can think of the optimization as simply computing the most likely way for a mass of particles sitting at ${x_0}$ to end up in the distribution ${f \nu_T}$ at time ${T}$.

The optimal solution ${\mu^*}$ exists and is unique. Moreover, we can describe it explicitly: ${\mu^*}$ is given by a time-inhomogeneous Markov chain. For ${0 \leq t \leq T-1}$, this chain has transition kernel

$\displaystyle q_t(x,y) = p(x,y) \frac{H_{T-t-1} f(y)}{H_{T-t} f(x)}\,, \ \ \ \ \ (2)$

where ${H_t}$ is the heat semigroup of our chain ${\{X_t\}}$, i.e.

$\displaystyle H_t f(x) = \mathop{\mathbb E}[f(X_t) \mid X_0 = x]\,.$

Let ${\{W_t\}}$ denote the time-inhomogeneous chain with transition kernels ${\{q_t\}}$ and ${W_0=x_0}$ and let ${\mu}$ denote the law of the random path ${\{W_0, \ldots, W_T\}}$. We will now verify that ${\mu}$ is the optimal solution to (1).

We first need to confirm that ${\mu \in \mathcal M_T(f;x_0)}$, i.e. that ${W_T}$ has law ${f \nu_T}$. To this end, we will verify inductively that ${W_t}$ has law ${(H_{T-t} f)\cdot \nu_t}$. For ${t=0}$, this follows by definition. For the inductive step:

$\displaystyle \begin{array}{lll} \displaystyle\mathop{\mathbb P}[W_{t+1}=y] &= \sum_{x \in \Omega} \Pr[W_t=x] \cdot p(x,y) \frac{H_{T-t-1} f(y)}{H_{T-t} f(x)} \\ \displaystyle&= \sum_{x \in \Omega} H_{T-t} f(x) \nu_t(x) p(x,y) \frac{H_{T-t-1} f(y)}{H_{T-t} f(x)} \\ \displaystyle&= \sum_{x \in \Omega} \nu_t(x) p(x,y) H_{T-t-1}f(y) \\ \displaystyle & = H_{T-t-1} f(y) \nu_{t+1}(y)\,. \end{array}$

We have confirmed that ${\mu \in \mathcal M_T(f;x_0)}$. Let us now verify its optimality by writing

$\displaystyle D(f \nu_T \,\|\,\nu_T) = \mathop{\mathbb E}_{\nu_T} [f \log f] = \mathop{\mathbb E}[\log f(W_T)]\,,$

where the final equality uses the fact we just proved: ${W_T}$ has law ${f \nu_T}$. Continuing, we have

$\displaystyle \mathop{\mathbb E}[\log f(W_T)] = \sum_{t=0}^{T-1} \mathop{\mathbb E}\left[\log \frac{H_{T-t-1} f(W_{t+1})}{H_{T-t} f(W_t)}\right] = \sum_{t=0}^{T-1} \mathop{\mathbb E} \left[D(q_t(W_t, \cdot) \,\|\, p(W_t,\cdot))\right]\,,$

where the final inequality uses the definition of ${q_t}$ in (2). The latter quantity is precisely ${D(\mu \,\|\, \mu_{\mathcal P})}$ by the chain rule for relative entropy.

Exercise: One should check that if ${\{A_t\}}$ and ${\{B_t\}}$ are two time-inhomogeneous Markov chains on ${\Omega}$ with respective transition kernels ${a_t}$ and ${b_t}$ then indeed the chain rule for relative entropy yields

$\displaystyle D(\{A_0, \ldots, A_T\} \,\|\, \{B_0, \ldots, B_T\}) = \sum_{t=0}^{T-1} \mathop{\mathbb E}\left[D\left(a_t(A_t, \cdot)\,\|\,b_t(A_t,\cdot)\right)\right]\,. \ \ \ \ \ (3)$

We conclude that

$\displaystyle D(f \nu_T \,\|\, \nu_T) = D(\mu \,\|\,\mu_{\mathcal P})\,,$

and from this one immediately concludes that ${\mu=\mu^*}$. Indeed, for any measure ${\mu' \in \mathcal M_T(f;x_0)}$, we must have ${D(\mu' \,\|\,\mu_{\mathcal P}) \geq D(f \nu_T \,\|\,\nu_T)}$. This follows because ${f \nu_T}$ is the law of the endpoint of a path drawn from ${\mu'}$ and ${\nu_T}$ is the law of the endpoint of a path drawn from ${\mu}$. The relative entropy between the endpoints is certainly less than along the entire path. (This intuitive fact can again be proved via the chain rule for relative entropy by conditioning on the endpoint of the path.)

1.2. The Brownian version

Let us now do the same thing for processes driven by Brownian motion in ${\mathbb R^n}$. Let ${\{B_t : t \in [0,1]\}}$ be a Brownian motion with ${B_0=0}$. Let ${\gamma_n}$ be the standard Gaussian measure and recall that ${B_1}$ has law ${\gamma_n}$.

We recall that if we have two measures ${\mu}$ and ${\nu}$ on ${\mathbb R^n}$ such that ${\nu}$ is absolutely continuous with respect to ${\mu}$, we define the relative entropy

$\displaystyle D(\nu\,\|\,\mu) = \int d\nu \log \frac{d\nu}{d\mu}$

Our “path space” will consist of drift processes ${\{W_t : t \in [0,1]\}}$ of the form

$\displaystyle W_t = B_t + \int_0^t u_s\,ds\,, \ \ \ \ \ (4)$

where ${\{u_s\}}$ denotes the drift. We require that ${\{u_s\}}$ is progressively measurable, i.e. that the law of ${u_s}$ is determined by the past up to time ${s}$, and that ${\mathop{\mathbb E} \int_0^1 \|u_s\|^2 \,ds < \infty}$. Note that we can write such a process in differential notation as

$\displaystyle dW_t = dB_t + u_t\,dt\,,$

with ${W_0=0}$.

Fix a smooth density ${f : \mathbb R^n \rightarrow {\mathbb R}_+}$ with ${\int f \,d\gamma_n =1}$. In analogy with the discrete setting, let us use ${\mathcal M(f)}$ to denote the set of processes ${\{W_t\}}$ that can be realized in the form (4) and such that ${W_0 = 0}$ and ${W_1}$ has law ${f d\gamma_n}$.

Let us also use the shorthand ${W_{[0,1]} = \{W_t : t\in [0,1]\}}$ to represent the entire path of the process. Again, we will consider the entropy optimization problem:

$\displaystyle \min \left\{ \vphantom{\bigoplus} D\left(W_{[0,1]} \,\|\, B_{[0,1]}\right) : W_{[0,1]} \in \mathcal M(f) \right\}\,. \ \ \ \ \ (5)$

As in the discrete setting, this problem has a unique optimal solution (in the sense of stochastic processes). Here is the main result.

Theorem 1 (Föllmer) If ${\{ W_t = B_t + \int_0^t u_s\,ds : t \in [0,1]\}}$ is the optimal solution to (5), then

$\displaystyle D\left(W_{[0,1]}\,\|\,B_{[0,1]}\right) = D(W_1 \,\|\, B_1) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|u_t\|^2\,dt\,.$

Just as for the discrete case, one should think of this as asserting that the optimal process only uses as much entropy as is needed for the difference in laws at the endpoint. The RHS should be thought of as an integral over the expected relative entropy generated at time ${t}$ (just as in the chain rule expression (3)).

The reason for the quadratic term is the usual relative entropy approximation for infinitesimal perturbations. For instance, consider the relative entropy between a binary random variable with expected value ${\tfrac12 (1-\varepsilon)}$ and a binary random variable with expected value ${\tfrac12}$:

$\displaystyle \frac12(1-\varepsilon) \log (1-\varepsilon) + \frac12 (1+\varepsilon) \log (1+\varepsilon) \approx \frac12 \varepsilon^2\,.$

I am going to delay the proof of Theorem 1 to the next post because doing it in an elementary way will require some discussion of Ito calculus. For now, let us prove the following.

Lemma 2 For any process ${W_{[0,1]} \in \mathcal M(f)}$ given by a drift ${\{u_t : t\in[0,1]\}}$, it holds that

$\displaystyle D(W_1 \,\|\, B_1) \leq D(W_{[0,1]} \,\|\, B_{[0,1]}) =\frac12 \int_0^1 \mathop{\mathbb E}\,\|u_t\|^2\,dt\,.$

Proof: The proof will be somewhat informal. It can be done easily using Girsanov’s theorem, but we try to keep the presentation here elementary and in correspondence with the discrete version above.

Let us first use the chain rule for relative entropy to calculate

$\displaystyle D\left(W_{[0,1]} \,\|\,B_{[0,1]}\right) = \int_0^1 \mathop{\mathbb E}\left[D( dW_t \,\|\, dB_t)\right] = \int_0^1 \mathop{\mathbb E}\left[D(dB_t + u_t\,dt \,\|\,dB_t)\right]\,. \ \ \ \ \ (6)$

Note that ${dB_t}$ has the law of a standard ${n}$-dimensional of covariance ${dt \cdot I}$.

If ${Z}$ is an ${n}$-dimensional Gaussian with covariance ${\sigma^2 \cdot I}$ and ${u \in \mathbb R^n}$, then

$\displaystyle \begin{array}{lll} D(Z + u \,\|\, Z) &= \mathop{\mathbb E}\left[\log \frac{e^{-\|Z\|^2/2\sigma^2}}{e^{-\|u-Z\|^2/2\sigma^2}}\right] \\ &= \mathop{\mathbb E}\left[\frac{\|u\|^2}{2\sigma^2} + \frac{\langle u,Z\rangle}{\sigma^2}\right] \\ &= \frac{\|u\|^2}{2\sigma^2}\,. \end{array}$

Therefore:

$\displaystyle D(dB_t + u_t\,dt \,\|\,dB_t) = \mathop{\mathbb E} \left[\frac{\|u_t\|^2 dt^2}{2 dt}\mid \mathcal F_t\right] =\frac12 \mathop{\mathbb E}\left[\|u_t\|^2\,dt \mid \mathcal F_t\right]\,,$

where the latter expectation is understood to be conditioned on the past $\mathcal F_t$ up to time ${t}$.

In particular, plugging this into (6), we have

$\displaystyle D\left(W_{[0,1]} \,\|\,B_{[0,1]}\right) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|u_t\|^2\,dt\,. \ \ \ \ \ (7)$

$\Box$

1.3. Brascamp-Lieb

The proof is taken directly from Lehec. We will use the entropic formulation of Brascamp-Lieb due to Carlen and Cordero-Erausquin.

Let ${E}$ be a Euclidean space with subspaces ${E_1, E_2, \ldots, E_m}$. Let ${P_i}$ denote the orthogonal projection onto ${E_i}$. Now suppose that for positive numbers ${c_1, c_2, \ldots, c_m > 0}$, we have

$\displaystyle \sum_{i=1}^m c_i P_i = \mathrm{id}_E\,. \ \ \ \ \ (8)$

By (8), we have for all ${x \in E}$:

$\displaystyle \|x\|^2 = \left\langle x,\sum_{i=1}^m c_i P_i x\right\rangle = \sum_{i=1}^m c_i\|P_i x\|^2\,.$

The latter equality uses the fact that each ${P_i}$ is an orthogonal projection.

Let ${Z}$ denote a standard Gaussian on ${E}$, and let ${Z_i}$ denote a standard Gaussian on ${E_i}$ for each ${i=1,2,\ldots, m}$.

Theorem 3 (Carlen & Cordero-Erausquin version of Brascamp-Lieb) For any random vector ${X \in E}$, it holds that

$\displaystyle D(X \,\|\, Z) \geq \sum_{i=1}^m c_i D(P_i X \,\|\, Z_i)\,.$

Proof: Let ${\{W_t : t \in [0,1]\}}$ with ${dW_t = dB_t + v_t\,dt}$ denote the entropy-optimal drift process such that ${W_1}$ has the law of ${X}$. Then by Theorem 1,

$\displaystyle D(X\,\|\,Z) = \frac12 \int_0^1 \mathop{\mathbb E}\,\|v_t\|^2\,dt = \frac12 \int_0^1 \sum_{i=1}^m c_i \mathop{\mathbb E}\,\|P_i v_t\|^2\,dt \geq \sum_{i=1}^m c_i D(P_i X \,\|\, Z_i)\,,$

where the latter inequality uses Lemma 2 and the fact that ${P_i W_1}$ has law ${P_i X}$. $\Box$

# Lecture notes for the Summer School on Combinatorial Optimization at HIM

As part of a summer school associated to the Hausdorff Institute for Mathematics Program on Combinatorial Optimization, I will be giving some lectures on “Semi-definite extended formulations and sums of squares.”

I wanted to post here a draft of the lecture notes. These extend and complete the series of posts here on non-negative and psd rank and lifts of polytopes. They also incorporate many corrections, and have exercises of varying levels of difficulty. The bibliographic references are sparse at the moment because I am posting them from somewhere in the Adriatic (where wifi is also sparse).

# Entropy optimality: Analytic psd rank and John’s theorem

Recall that our goal is to sketch a proof of the following theorem, where the notation is from the last post. I will assume a knowledge of the three posts on polyhedral lifts and non-negative rank, as our argument will proceed by analogy.

Theorem 1 For every ${m \geq 1}$ and ${g : \{0,1\}^m \rightarrow \mathbb R_+}$, there exists a constant ${C(g)}$ such that the following holds. For every ${n \geq 2m}$,

$\displaystyle 1+n^{d/2} \geq \mathrm{rank}_{\mathsf{psd}}(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{(d-1)/2}\,. \ \ \ \ \ (1)$

where ${d = \deg_{\mathsf{sos}}(g).}$

In this post, we will see how John’s theorem can be used to transform a psd factorization into one of a nicer analytic form. Using this, we will be able to construct a convex body that contains an approximation to every non-negative matrix of small psd rank.

1.1. Finite-dimensional operator norms

Let ${H}$ denote a finite-dimensional Euclidean space over ${\mathbb R}$ equipped with inner product ${\langle \cdot,\cdot\rangle}$ and norm ${|\cdot|}$. For a linear operator ${A : H \rightarrow H}$, we define the operator, trace, and Frobenius norms by

$\displaystyle \|A\| = \max_{x \neq 0} \frac{|Ax|}{x},\quad \|A\|_* = \mathrm{Tr}(\sqrt{A^T A}),\quad \|A\|_F = \sqrt{\mathrm{Tr}(A^T A)}\,.$

Let ${\mathcal M(H)}$ denote the set of self-adjoint linear operators on ${H}$. Note that for ${A \in \mathcal M(H)}$, the preceding three norms are precisely the ${\ell_{\infty}}$, ${\ell_1}$, and ${\ell_2}$ norms of the eigenvalues of ${A}$. For ${A,B \in \mathcal M(H)}$, we use ${A \succeq 0}$ to denote that ${A}$ is positive semi-definite and ${A \succeq B}$ for ${A-B \succeq 0}$. We use ${\mathcal D(H) \subseteq \mathcal M(H)}$ for the set of density operators: Those ${A \in \mathcal M(H)}$ with ${A \succeq 0}$ and ${\mathrm{Tr}(A)=1}$.

One should recall that ${\mathrm{Tr}(A^T B)}$ is an inner product on the space of linear operators, and we have the operator analogs of the Hölder inequalities: ${\mathrm{Tr}(A^T B) \leq \|A\| \cdot \|B\|_*}$ and ${\mathrm{Tr}(A^T B) \leq \|A\|_F \|B\|_F}$.

1.2. Rescaling the psd factorization

As in the case of non-negative rank, consider finite sets ${X}$ and ${Y}$ and a matrix ${M : X \times Y \rightarrow \mathbb R_+}$. For the purposes of proving a lower bound on the psd rank of some matrix, we would like to have a nice analytic description.

To that end, suppose we have a rank-${r}$ psd factorization

$\displaystyle M(x,y) = \mathrm{Tr}(A(x) B(y))$

where ${A : X \rightarrow \mathcal S_+^r}$ and ${B : Y \rightarrow \mathcal S_+^r}$. The following result of Briët, Dadush and Pokutta (2013) gives us a way to “scale” the factorization so that it becomes nicer analytically. (The improved bound stated here is from an article of Fawzi, Gouveia, Parrilo, Robinson, and Thomas, and we follow their proof.)

Lemma 2 Every ${M}$ with ${\mathrm{rank}_{\mathsf{psd}}(M) \leq r}$ admits a factorization ${M(x,y)=\mathrm{Tr}(P(x) Q(y))}$ where ${P : X \rightarrow \mathcal S_+^r}$ and ${Q : Y \rightarrow \mathcal S_+^r}$ and, moreover,

$\displaystyle \max \{ \|P(x)\| \cdot \|Q(y)\| : x \in X, y \in Y \} \leq r \|M\|_{\infty}\,,$

where ${\|M\|_{\infty} = \max_{x \in X, y \in Y} M(x,y)}$.

Proof: Start with a rank-${r}$ psd factorization ${M(x,y) = \mathrm{Tr}(A(x) B(y))}$. Observe that there is a degree of freedom here, because for any invertible operator ${J}$, we get another psd factorization ${M(x,y) = \mathrm{Tr}\left(\left(J A(x) J^T\right) \cdot \left((J^{-1})^T B(y) J^{-1}\right)\right)}$.

Let ${U = \{ u \in \mathbb R^r : \exists x \in X\,\, A(x) \succeq uu^T \}}$ and ${V = \{ v \in \mathbb R^r : \exists y \in X\,\, B(y) \succeq vv^T \}}$. Set ${\Delta = \|M\|_{\infty}}$. We may assume that ${U}$ and ${V}$ both span ${\mathbb R^r}$ (else we can obtain a lower-rank psd factorization). Both sets are bounded by finiteness of ${X}$ and ${Y}$.

Let ${C=\mathrm{conv}(U)}$ and note that ${C}$ is centrally symmetric and contains the origin. Now John’s theorem tells us there exists a linear operator ${J : \mathbb R^r \rightarrow \mathbb R^r}$ such that

$\displaystyle B_{\ell_2} \subseteq J C \subseteq \sqrt{r} B_{\ell_2}\,, \ \ \ \ \ (2)$

where ${B_{\ell_2}}$ denotes the unit ball in the Euclidean norm. Let us now set ${P(x) = J A(x) J^T}$ and ${Q(y) = (J^{-1})^T B(y) J^{-1}}$.

Eigenvalues of ${P(x)}$: Let ${w}$ be an eigenvector of ${P(x)}$ normalized so the corresponding eigenvalue is ${\|w\|_2^2}$. Then ${P(x) \succeq w w^T}$, implying that ${J^{-1} w \in U}$ (here we use that ${A \succeq 0 \implies S A S^T \succeq 0}$ for any ${S}$). Since ${w = J(J^{-1} w)}$, (2) implies that ${\|w\|_2 \leq \sqrt{r}}$. We conclude that every eigenvalue of ${P(x)}$ is at most ${r}$.

Eigenvalues of ${Q(y)}$: Let ${w}$ be an eigenvector of ${Q(y)}$ normalized so that the corresponding eigenvalue is ${\|w\|_2^2}$. Then as before, we have ${Q(y) \succeq ww^T}$ and this implies ${J^T w \in V}$. Now, on the one hand we have

$\displaystyle \max_{z \in JC}\, \langle z,w\rangle \geq \|w\|_2 \ \ \ \ \ (3)$

since ${JC \supseteq B_{\ell_2}}$.

On the other hand:

$\displaystyle \max_{z \in JC}\, \langle z,w\rangle^2 = \max_{z \in C}\, \langle Jz, w\rangle^2 = \max_{z \in C}\, \langle z, J^T w\rangle^2\,. \ \ \ \ \ (4)$

Finally, observe that for any ${u \in U}$ and ${v \in V}$, we have

$\displaystyle \langle u,v\rangle^2 =\langle uu^T, vv^T\rangle \leq \max_{x \in X, y \in Y} \langle A(x), B(y)\rangle \leq \Delta\,.$

By convexity, this implies that ${\max_{z \in C}\, \langle z,v\rangle^2 \leq \Delta}$ for all ${v \in V}$, bounding the right-hand side of (4) by ${\Delta}$. Combining this with (3) yields ${\|w\|_2^2 \leq \Delta}$. We conclude that all the eigenvalues of ${Q(y)}$ are at most ${\Delta}$. $\Box$

1.3. Convex proxy for psd rank

Again, in analogy with the non-negative rank setting, we can define an “analytic psd rank” parameter for matrices ${N : X \times Y \rightarrow \mathbb R_+}$:

$\displaystyle \alpha_{\mathsf{psd}}(N) = \min \Big\{ \alpha \mid \exists A : X \rightarrow \mathcal S_+^k, B : Y \rightarrow \mathcal S_+^k\,,$

$\displaystyle \hphantom{xx} \mathop{\mathbb E}_{x \in X}[A(x)]=I,$

$\displaystyle \hphantom{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx} \|B(y)\| \leq \frac{\alpha}{k}\, \mathop{\mathbb E}_{y \in Y}[\mathrm{Tr}(B(y))] \quad \forall y \in Y$

$\displaystyle \hphantom{\qquad\qquad} \|A(x)\| \leq \alpha \quad \forall x \in X\Big\}\,.$

Note that we have implicit equipped ${X}$ and ${Y}$ with the uniform measure. The main point here is that ${k}$ can be arbitrary. One can verify that ${\alpha_{\mathsf{psd}}}$ is convex.

And there is a corresponding approximation lemma. We use ${\|N\|_{\infty}=\max_{x,y} |N(x,y)|}$ and ${\|N\|_1 = \mathop{\mathbb E}_{x,y} |N(x,y)|}$.

Lemma 3 For every non-negative matrix ${M : X \times Y \rightarrow \mathbb R_+}$ and every ${\eta \in (0,1]}$, there is a matrix ${N}$ such that ${\|M-N\|_{\infty} \leq \eta \|M\|_{\infty}}$ and

$\displaystyle \alpha_{\mathsf{psd}}(N) \leq O(\mathrm{rank}_{\mathsf{psd}}(M)) \frac{1}{\eta} \frac{\|M\|_{\infty}}{\|M\|_1}\,.$

Using Lemma 2 in a straightforward way, it is not particularly difficult to construct the approximator ${N}$. The condition ${\mathop{\mathbb E}_x [A(x)] = I}$ poses a slight difficulty that requires adding a small multiple of the identity to the LHS of the factorization (to avoid a poor condition number), but this has a correspondingly small effect on the approximation quality. Putting “Alice” into “isotropic position” is not essential, but it makes the next part of the approach (quantum entropy optimization) somewhat simpler because one is always measuring relative entropy to the maximally mixed state.

# Entropy optimality: Quantum lifts of polytopes

In these previous posts, we explored whether the cut polytope can be expressed as the linear projection of a polytope with a small number of facets (i.e., whether it has a small linear programming extended formulation).

For many cut problems, semi-definite programs (SDPs) are able to achieve better approximation ratios than LPs. The most famous example is the Goemans-Williamson ${0.878}$-approximation for MAX-CUT. The techniques of the previous posts (see the full paper for details) are able to show that no polynomial-size LP can achieve better than factor ${1/2}$.

1.1. Spectrahedral lifts

The feasible regions of LPs are polyhedra. Up to linear isomorphism, every polyhedron ${P}$ can be represented as ${P = \mathbb R_+^n \cap V}$ where ${\mathbb R_+^n}$ is the positive orthant and ${V \subseteq \mathbb R^n}$ is an affine subspace.

In this context, it makes sense to study any cones that can be optimized over efficiently. A prominent example is the positive semi-definite cone. Let us define ${\mathcal S_+^n \subseteq \mathbb R^{n^2}}$ as the set of ${n \times n}$ real, symmetric matrices with non-negative eigenvalues. A spectrahedron is the intersection ${\mathcal S_+^n \cap V}$ with an affine subspace ${V}$. The value ${n}$ is referred to as the dimension of the spectrahedron.

In analogy with the ${\gamma}$ parameter we defined for polyhedral lifts, let us define ${\bar \gamma_{\mathsf{sdp}}(P)}$ for a polytope ${P}$ to be the minimal dimension of a spectrahedron that linearly projects to ${P}$. It is an exercise to show that ${\bar \gamma_{\mathsf{sdp}}(P) \leq \bar \gamma(P)}$ for every polytope ${P}$. In other words, spectahedral lifts are at least as powerful as polyhedral lifts in this model.

In fact, they are strictly more powerful. Certainly there are many examples of this in the setting of approximation (like the Goemans-Williamson SDP mentioned earlier), but there are also recent gaps between ${\bar \gamma}$ and ${\bar \gamma_{\mathrm{sdp}}}$ for polytopes; see the work of Fawzi, Saunderson, and Parrilo.

Nevertheless, we are recently capable of proving strong lower bounds on the dimension of such lifts. Let us consider the cut polytope ${\mathrm{CUT}_n}$ as in previous posts.

Theorem 1 (L-Raghavendra-Steurer 2015) There is a constant ${c > 0}$ such that for every ${n \geq 1}$, one has ${\bar \gamma_{\mathsf{sdp}}(\mathrm{CUT}_n) \geq e^{c n^{2/13}}}$.

Our goal in this post and the next is to explain the proof of this theorem and how quantum entropy maximization plays a key role.

1.2. PSD rank and factorizations

Just as in the setting of polyhedra, there is a notion of “factorization through a cone” that characterizes the parameter ${\bar \gamma_{\mathsf{sdp}}(P)}$. Let ${M \in \mathbb R^{m \times n}_+}$ be a non-negative matrix. One defines the psd rank of ${M}$ as the quantity

$\displaystyle \mathrm{rank}_{\mathsf{psd}}(M) = \min \left\{ r : M_{ij} = \mathrm{Tr}(A_i B_j) \textrm{ for some } A_1, \ldots, A_m, B_1, \ldots, B_n \in \mathcal S_+^r\right\}\,.$

The following theorem was independently proved by Fiorini, Massar, Pokutta, Tiwary, and de Wolf and Gouveia, Parrilo, and Thomas. The proof is a direct analog of Yannakakis’ proof for non-negative rank.

Theorem 2 For every polytope ${P}$, it holds that ${\bar \gamma_{\mathsf{sdp}}(P) = \mathrm{rank}_{\mathsf{psd}}(M)}$ for any slack matrix ${M}$ of ${P}$.

Recall the class ${\mathrm{QML}_n^+}$ of non-negative quadratic multi-linear functions that are positive on ${\{0,1\}^n}$ and the matrix ${\mathcal M_n : \mathrm{QML}_n^+ \times \{0,1\}^n \rightarrow \mathbb R_+}$ given by

$\displaystyle \mathcal M_n(f,x) = f(x)\,.$

We saw previously that ${\mathcal M_n}$ is a submatrix of some slack matrix of ${\mathrm{CUT}_n}$. Thus our goal is to prove a lower bound on ${\mathrm{rank}_{\mathsf{psd}}(\mathcal M_n)}$.

1.3. Sum-of-squares certificates

Just as in the setting of non-negative matrix factorization, we can think of a low psd rank factorization of ${\mathcal M_n}$ as a small set of “axioms” that can prove the non-negativity of every function in ${\mathrm{QML}_n^+}$. But now our proof system is considerably more powerful.

For a subspace of functions ${\mathcal U \subseteq L^2(\{0,1\}^n)}$, let us define the cone

$\displaystyle \mathsf{sos}(\mathcal U) = \mathrm{cone}\left(q^2 : q \in \mathcal U\right)\,.$

This is the cone of squares of functions in ${\mathcal U}$. We will think of ${\mathcal U}$ as a set of axioms of size ${\mathrm{dim}(\mathcal U)}$ that is able to assert non-negativity of every ${f \in \mathrm{sos}(\mathcal U)}$ by writing

$\displaystyle f = \sum_{i=1}^k q_i^2$

for some ${q_1, \ldots, q_k \in \mathsf{sos}(\mathcal U)}$.

Fix a subspace ${\mathcal U}$ and let ${r = \dim(\mathcal U)}$. Fix also a basis ${q_1, \ldots, q_r : \{0,1\}^n \rightarrow \mathbb R}$ for ${\mathcal U}$.

Define ${B : \{0,1\}^n \rightarrow \mathcal S_+^r}$ by setting ${B(x)_{ij} = q_i(x) q_j(x)}$. Note that ${B(x)}$ is PSD for every ${x}$ because ${B(x) = \vec q(x) \vec q(x)^T}$ where ${\vec q(x)=(q_1(x), \ldots, q_r(x))}$.

We can write every ${p \in \mathcal U}$ as ${p = \sum_{i=1}^r \lambda_i q_i}$. Defining ${A(p^2) \in \mathcal S_+^r}$ by ${A(p^2)_{ij} = \lambda_i \lambda_j}$, we see that

$\displaystyle \mathrm{Tr}(A(p^2) Q(x)) = \sum_{i,j} \lambda_i \lambda_j q_i(x) q_j(x) = p(x)^2\,.$

Now every ${f \in \mathsf{sos}(\mathcal U)}$ can be written as ${\sum_{i=1}^k c_i p_i^2}$ for some ${k \geq 0}$ and ${\{c_i \geq 0\}}$. Therefore if we define ${A(f) = \sum_{i=1}^k c_i \Lambda(p_i^2)}$ (which is a positive sum of PSD matrices), we arrive at the representation

$\displaystyle f(x) = \mathrm{Tr}(A(f) B(x))\,.$

In conclusion, if ${\mathrm{QML}_+^n \subseteq \mathsf{sos}(\mathcal U)}$, then ${\mathrm{rank}_{\mathsf{psd}}(\mathcal M_n) \leq \dim(\mathsf{sos}(\mathcal U))}$.

By a “purification” argument, one can also conclude ${\dim(\mathsf{sos}(\mathcal U)) \leq \mathrm{rank}_{\mathsf{psd}}(\mathcal M_n)^2}$.

1.4. The canonical axioms

And just as ${d}$-juntas were the canonical axioms for our NMF proof system, there is a similar canonical family in the SDP setting: Let ${\mathcal Q_d}$ be the subspace of all degree-${d}$ multi-linear polynomials on ${\mathbb R^n}$. We have

$\displaystyle \dim(\mathcal Q_d) \leq \sum_{k=0}^d {n \choose k} \leq 1+n^d\,. \ \ \ \ \ (1)$

For a function ${f : \{0,1\}^n \rightarrow \mathbb R_+}$, one defines

$\displaystyle \deg_{\mathsf{sos}}(f) = \min \{d : f \in \mathsf{sos}(\mathcal Q_{d}) \}\,.$

(One could debate whether the definition of sum-of-squares degree should have ${d/2}$ or ${d}$. The most convincing arguments suggest that we should use membership in the cone of squares over ${\mathcal Q_{\lfloor d/2\rfloor}}$ so that the sos-degree will be at least the real-degree of the function.)

On the other hand, our choice has the following nice property.

Lemma 3 For every ${f : \{0,1\}^n \rightarrow \mathbb R_+}$, we have ${\deg_{\mathrm{sos}}(f) \leq \deg_J(f)}$.

Proof: If ${q}$ is a non-negative ${d}$-junta, then ${\sqrt{q}}$ is also a non-negative ${d}$-junta. It is elementary to see that every ${d}$-junta is polynomial of degree at most ${d}$, thus ${q}$ is the square of a polynomial of degree at most ${d}$. $\Box$

1.5. The canonical tests

As with junta-degree, there is a simple characterization of sos-degree in terms of separating functionals. Say that a functional ${\varphi : \{0,1\}^n \rightarrow \mathbb R}$ is degree-${d}$ pseudo-positive if

$\displaystyle \langle \varphi, q^2 \rangle = \mathop{\mathbb E}_{x \in \{0,1\}^n} \varphi(x) q(x)^2 \geq 0$

whenever ${q : \{0,1\}^n \rightarrow \mathbb R}$ satisfies ${\deg(q) \leq d}$ (and by ${\deg}$ here, we mean degree as a multi-linear polynomial on ${\{0,1\}^n}$).

Again, since ${\mathsf{sos}(\mathcal Q_d)}$ is a closed convex set, there is precisely one way to show non-membership there. The following characterization is elementary.

Lemma 4 For every ${f : \{0,1\}^n \rightarrow \mathbb R_+}$, it holds that ${\deg_{\mathsf{sos}}(f) > d}$ if and only if there is a degree-${d}$ pseudo-positive functional ${\varphi : \{0,1\}^n \rightarrow \mathbb R}$ such that ${\langle \varphi,f\rangle < 0}$.

1.6. The connection to psd rank

Following the analogy with non-negative rank, we have two objectives left: (1) to exhibit a function ${f \in \mathrm{QML}_n^+}$ with ${\deg_{\mathsf{sos}}(f)}$ large, and (ii) to give a connection between the sum-of-squares degree of ${f}$ and the psd rank of an associated matrix.

Notice that the function ${f(x)=(1-\sum_{i=1}^n x_i)^2}$ we used for junta-degree has ${\deg_{\mathsf{sos}}(f)=1}$, making it a poor candidate. Fortunately, Grigoriev has shown that the knapsack polynomial has large sos-degree.

Theorem 5 For every odd ${m \geq 1}$, the function

$\displaystyle f(x) = \left(\frac{m}{2} - \sum_{i=1}^n x_i\right)^2 - \frac14$

has ${\deg_{\mathsf{sos}}(f) \geq \lceil m/2\rceil}$.

Observe that this ${f}$ is non-negative over ${\{0,1\}^m}$ (because ${m}$ is odd), but it is manifestly not non-negative on ${\mathbb R^m}$.

Finally, we recall the submatrices of ${\mathcal M_n}$ defined as follows. Fix some integer ${m \geq 1}$ and a function ${g : \{0,1\}^m \rightarrow \mathbb R_+}$. Then ${M_n^g : {[n] \choose m} \times \{0,1\}^n \rightarrow \mathbb R_+}$ is given by

$\displaystyle M_n^g(S,x) = g(x|_S)\,.$

In the next post, we discuss the proof of the following theorem.

Theorem 6 (L-Raghavendra-Steurer 2015) For every ${m \geq 1}$ and ${g : \{0,1\}^m \rightarrow \mathbb R_+}$, there exists a constant ${C(g)}$ such that the following holds. For every ${n \geq 2m}$,

$\displaystyle 1+n^{d} \geq \mathrm{rank}_{\mathsf{psd}}(M_n^g) \geq C(g) \left(\frac{n}{\log n}\right)^{(d-1)/2}\,,$

where $d=\deg_{\mathsf{sos}}(g)$.

Note that the upper bound is from (1) and the non-trivial content is contained in the lower bound. As before, in conjunction with Theorem 5, this shows that $\mathrm{rank}_{\mathsf{psd}}(\mathcal M_n)$ cannot be bounded by any polynomial in $n$ and thus the same holds for $\bar \gamma_{\mathsf{sdp}}(\mathrm{CUT}_n)$.

# Entropy optimality: Non-negative rank lower bounds

Using the notation from the last post, our goal is now to prove the following theorem.

Theorem 1 For every ${m \geq 1}$ and ${g : \{0,1\}^m \rightarrow \mathbb R_+}$, there is a constant ${C=C(g)}$ such that for all ${n \geq 2m}$,

$\displaystyle \mathrm{rank}_+(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{\mathrm{deg}_J(g)-1}\,.$

1.1. Convex relaxations of non-negative rank

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets ${X}$ and ${Y}$ and a matrix ${M : X \times Y \rightarrow \mathbb R_+}$ with ${r=\mathrm{rank}_+(M)}$.

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of ${\{ N \in \mathbb R_+^{X \times Y} : \mathrm{rank}_+(N) = 1 \}}$ is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both ${X}$ and ${Y}$ with the uniform measure. Let ${\mathcal Q = \{ b : Y \rightarrow \mathbb R_+ \mid \|b\|_1 = 1\}}$ denote the set of probability densities on ${Y}$. Now define

$\displaystyle \alpha_+(N) = \min \Big\{ \alpha : \exists A \in \mathbb R_+^{X \times k}, B \in \mathbb R_+^{k \times Y} \textrm{ with } N=AB, \textrm{ and}$

$\displaystyle \qquad\qquad\qquad\qquad\qquad\{B_1, \ldots, B_k\} \subseteq \mathcal Q, \textrm{ and }$

$\displaystyle \hphantom{\{B_1, \ldots, B_k\} \subseteq \mathcal Q,} \max_{i \in [k]} \|B_i\|_{\infty} \leq \alpha, \sum_{i=1}^k \|A^{(i)}\|_{\infty} \leq \alpha\Big\}.$

Here ${\{A^{(i)}\}}$ are the columns of ${A}$ and ${\{B_i\}}$ are the rows of ${B}$. Note that now ${k}$ is unconstrained.

Observe that ${\alpha_+}$ is a convex function. To see this, given a pair ${N=AB}$ and ${N'=A'B'}$, write

$\displaystyle \frac{N+N'}{2} = \left(\begin{array}{cc} \frac12 A & \frac12 A' \end{array}\right) \left(\begin{array}{l} \vphantom{\bigoplus} B \\ \vphantom{\bigoplus} B' \end{array}\right)\,,$

witnessing the fact that ${\alpha_+(\frac12(N+N')) \leq \frac12 \alpha_+(N) + \frac12 \alpha_+(N')}$.

1.2. A truncation argument

So the set ${\{ N : \alpha_+(N) \leq c \}}$ is convex, but it’s not yet clear how this relates to ${\mathrm{rank}_+}$. We will see now that low non-negative rank matrices are close to matrices with ${\alpha_+}$ small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms ${\|M\|_1 = \mathbb E_{x,y} |M_{x,y}|}$ and ${\|M\|_{\infty} = \max_{x,y} |M_{x,y}|}$.

Lemma 2 For every non-negative ${M \in \mathbb R_+^{X \times Y}}$ with ${\mathrm{rank}_+(M) \leq r}$ and every ${\delta \in (0,1)}$, there is a matrix ${\tilde M \in \mathbb R_+^{X \times Y}}$ such that

$\displaystyle \|M-\tilde M\|_1 \leq \delta$

and

$\displaystyle \alpha_+(\tilde M) \leq \frac{r}{\delta} \|M\|_{\infty}\,.$

Proof: Suppose that ${M=AB}$ with ${A \in \mathbb R_+^{X \times r}, B \in \mathbb R_+^{r \times Y}}$, and let us interpret this factorization in the form

$\displaystyle M(x,y) = \sum_{i=1}^r A_i(x) B_i(y) \ \ \ \ \ (1)$

(where ${\{A_i\}}$ are the columns of ${A}$ and ${\{B_i\}}$ are the rows of ${B}$). By rescaling the columns of ${A}$ and the rows of ${B}$, respectively, we may assume that ${\mathop{\mathbb E}_{\mu}[B_i]=1}$ for every ${i \in [r]}$.

Let ${\Lambda = \{ i : \|B_i\|_{\infty} > \tau \}}$ denote the “bad set” of indices (we will choose ${\tau}$ momentarily). Observe that if ${i \in \Lambda}$, then

$\displaystyle \|A_i\|_{\infty} \leq \frac{\|M\|_{\infty}}{\tau}\,,$

from the representation (1) and the fact that all summands are positive.

Define the matrix ${\tilde M(x,y) = \sum_{i \notin \Lambda} A_i(x) B_i(y)}$. It follows that

$\displaystyle \|M-\tilde M\|_1 = \mathop{\mathbb E}_{x,y} \left[|M(x,y)-\tilde M(x,y)|\right] = \sum_{i \in \Lambda} \mathop{\mathbb E}_{x,y} [A_i(x) B_i(y)]\,.$

Each of the latter terms is at most ${\|A_i\|_{\infty} \|B_i\|_1 \leq \frac{\|M\|_{\infty}}{\tau}}$ and ${|\Lambda| \leq r}$, thus

$\displaystyle \|M-\tilde M\|_1 \leq r \frac{\|M\|_{\infty}}{\tau}\,.$

Next, observe that

$\displaystyle \mathop{\mathbb E}_y [M(x,y)] = \sum_{i=1}^r A_i(x) \|B_i\|_1 = \sum_{i=1}^r A_i(x)\,,$

implying that ${\|A_i\|_{\infty} \leq \|M\|_{\infty}}$ and thus ${\sum_{i=1}^r \|A_i\|_{\infty} \leq r \|M\|_{\infty}}$.

Setting ${\tau = r \|M\|_{\infty}/\delta}$ yields the statement of the lemma. $\Box$

Generally, the ratio ${\frac{\|M\|_{\infty}}{\|M\|_1}}$ will be small compared to ${r}$ (e.g., polynomial in ${n}$ vs. super-polynomial in ${n}$). Thus from now on, we will actually prove a lower bound on ${\alpha_+(M)}$. One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

1.3. The test functionals

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix ${M_n^g}$, we will now assume that ${\alpha_+(M_n^g) \leq \alpha}$. Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix ${M_n^g}$. We will consider the uniform measures on ${{[n] \choose m}}$ and ${\{0,1\}^n}$. We use ${\mathop{\mathbb E}_S}$ and ${\mathop{\mathbb E}_x}$ to denote averaging with respect to these measures.

Let ${d=\deg_J(g)-1}$. From the last post, we know there exists a ${d}$-locally positive functional ${\varphi : \{0,1\}^n \rightarrow \mathbb R}$ such that ${\beta := \mathop{\mathbb E}_x \varphi(x) g(x) < 0}$, and ${\mathop{\mathbb E}_x \varphi(x) q(x) \geq 0}$ for every ${d}$-junta ${q}$.

For ${S \subseteq [n]}$ with ${|S|=m}$, let us denote ${\varphi_S(x) = \varphi(x|_S)}$. These functionals prove lower bounds on the junta-degree of ${g}$ restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family ${\{\varphi_S\}}$ as our test functions.

1.4. Entropy maximization

Use ${\alpha_+(M_n^g) \leq \alpha}$ to write

$\displaystyle M_n^g(S,x) = \sum_{i=1}^k A_i(S) B_i(x)\,,$

where ${A_i, B_i \geq 0}$ and we have ${\|B_i\|_1=1}$ and ${\|B_i\|_{\infty} \leq \alpha}$ for all ${i \in [k]}$, and finally ${\sum_{i=1}^k \|A_i\|_{\infty} \leq \alpha}$.

First, as we observed last time, if each ${B_i}$ were a ${d}$-junta, we would have a contradiction:

$\displaystyle \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) M_n^g(S,x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^m} \varphi(y) g(y) = \beta < 0\,, \ \ \ \ \ (2)$

and yet

$\displaystyle \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) \sum_{i=1}^k A_i(S) B_i(x)]\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] \geq 0 \ \ \ \ \ (3)$

because ${\mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right] \geq 0}$ since ${\varphi}$ is ${d}$-locally positive and the function ${y \mapsto \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right]}$ is a ${d}$-junta.

So now the key step: Use entropy maximization to approximate ${B_i}$ by a junta! In future posts, we will need to consider the entire package ${\{B_1, \ldots, B_k\}}$ of functions simultaneously. But for the present lower bound, it suffices to consider each ${B_i}$ separately.

Consider the following optimization over variables ${\{\tilde B_i(x) : x \in \{0,1\}^n\}}$:

$\displaystyle \mathrm{minimize }\qquad \mathrm{Ent}(\tilde B_i) = \mathbb E [\tilde B_i \log \tilde B_i]$

$\displaystyle \qquad\textrm{ subject to } \qquad \qquad\mathbb E[\tilde B_i]=1, \quad \tilde B_i(x) \geq 0 \quad \forall x \in \{0,1\}^n$

$\displaystyle \hphantom{\bigoplus} \qquad \qquad \qquad \qquad \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right] \leq \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] + \varepsilon \qquad \forall |S|=m\,. \ \ \ \ \ (4)$

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1 There exists a function ${\tilde B_i : \{0,1\}^n \rightarrow \mathbb R_+^n}$ satisfying all the preceding constraints and of the form

$\displaystyle \tilde B_i = \frac{\exp\left(\sum_{j=1}^{M} \lambda_j \varphi_{S_j}\right)}{\mathop{\mathbb E} \exp\left(\sum_{i=1}^{M} \lambda_j \varphi_{S_j}\right)}$

such that

$\displaystyle M \leq C(g) \frac{\log \alpha}{\varepsilon^2}\,,$

where ${C(g)}$ is some constant depending only on ${g}$.

Note that ${\varphi}$ depends only on ${g}$, and thus ${\|\varphi_S\|_{\infty}}$ only depends on ${g}$ as well. Now each ${\varphi_{S_j}}$ only depends on ${m}$ variables (those in ${S_j}$ and ${|S_j|=m}$), meaning that our approximator ${\tilde B_i}$ is an ${h}$-junta for

$\displaystyle h \leq m \cdot C(g) \frac{\log \alpha}{\varepsilon^2}\,. \ \ \ \ \ (5)$

Oops. That doesn’t seem very good. The calculation in (3) needs that ${\tilde B_i}$ is a ${d}$-junta, and certainly ${d < m}$ (since ${g}$ is a function on ${\{0,1\}^m}$). Nevertheless, note that the approximator is a non-trivial junta. For instance, if ${\alpha \leq 2^{\sqrt{n}}}$, then it is an $O(\sqrt{n})$-junta, recalling that $m$ is a constant (depending on ${g}$).

1.5. Random restriction saves the day

Let’s try to apply the logic of (3) to the ${\tilde B_i}$ approximators anyway. Fix some ${i \in [k]}$ and let ${J_i}$ be the set of coordinates on which ${\tilde B_i}$ depends. Then:

$\displaystyle \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] = \mathop{\mathbb E}_S\,A_i(S) \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]$

Note that the map ${y \mapsto \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]}$ is a junta on ${J_i \cap S}$. Thus if ${|J_i \cap S| \leq d}$, then the contribution from this term is non-negative since ${\varphi}$ is ${d}$-locally positive. But ${|S|=m}$ is fixed and ${n}$ is growing, thus ${|J_i \cap S| > d}$ is quite rare! Formally,

$\displaystyle \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq - \|A_i\|_{\infty}\, \mathbb P_S[|J_i \cap S| > d] \geq - \|A_i\|_{\infty} \frac{h^d (2m)^d}{n^d} \,.$

In the last estimate, we have used a simple union bound and ${n \geq 2m}$.

Putting everything together and summing over ${i \in [k]}$, we conclude that

$\displaystyle \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq -\alpha \frac{h^d (2m)^d}{n^d}\,.$

Note that by choosing ${n}$ only moderately large, we will make this error term very small.

Moreover, since each ${\tilde B_i}$ is a feasible point of the optimization (4), we have

$\displaystyle \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right]$

$\displaystyle \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \left(-\varepsilon + \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right]\right)$

$\displaystyle \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq -\varepsilon \sum_{i=1}^k \|A_i\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.$

Almost there: Now observe that

$\displaystyle \|g\|_1 = \mathop{\mathbb E}_{S,x} [M_n^g(S,x)] = \sum_{i=1}^k \|A_i\|_1 \|B_i\|_1 = \sum_{i=1}^k \|A_i\|_1\,.$

Plugging this into the preceding line yields

$\displaystyle \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] \geq -\varepsilon \|g\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.$

Recalling now (2), we have derived a contradiction to ${\alpha_+(M) \leq \alpha}$ if we can choose the right-hand side to be bigger than ${\beta}$ (which is a negative constant depending only on ${g}$). Setting ${\varepsilon = -\beta/(2 \|g\|_1)}$, we consult (5) to see that

$\displaystyle h \leq C'(g) m \log \alpha$

for some other constant ${C'(g)}$ depending only on ${g}$. We thus arrive at a contradiction if ${\alpha = o((n/\log n)^d)}$, recalling that ${m, d}$ depend only on ${g}$. This completes the argument.

# Entropy optimality: Lifts of polytopes

Recall from the previous post that ${\mathrm{CUT}_n}$ denotes the cut polytope on the ${n}$-vertex complete graph, and ${\bar \gamma(\mathrm{CUT}_n)}$ is the smallest number of facets in any polytope that linearly projects to ${\mathrm{CUT}_n}$. Our goal is to prove a lower bound on this quantity, but first we should mention that a nearly tight lower bound is known.

Theorem 1 (Fiorini, Massar, Pokutta, Tiwari, de Wolf 2012) There is a constant ${c > 0}$ such that for every ${n \geq 1}$, ${\bar \gamma(\mathrm{CUT}_n) \geq e^{cn}}$.

We will present a somewhat weaker lower bound using entropy maximization, following our joint works with Chan, Raghavendra, and Steurer (2013) and with Raghavendra and Steurer (2015). This method is only currently capable of proving that ${\bar \gamma(\mathrm{CUT}_n) \geq e^{cn^{1/3}}}$, but it has the advantage of being generalizable—it extends well to the setting of approximate lifts and spectrahedral lifts (we’ll come to the latter in a few posts).

1.1. The entropy maximization framework

To use entropy optimality, we proceed as follows. For the sake of contradiction, we assume that ${\bar \gamma(\mathrm{CUT}_n)}$ is small.

First, we will identify the space of potential lifts of small ${\gamma}$-value with the elements of a convex set of probability measures. (This is where the connection to non-negative matrix factorization (NMF) will come into play.) Then we will choose a family of “tests” intended to capture the difficult aspects of being a valid lift of ${\mathrm{CUT}_n}$. This step is important as having more freedom (corresponding to weaker tests) will allow the entropy maximization to do more “simplification.” The idea is that the family of tests should still be sufficiently powerful to prove a lower bound on the entropy-optimal hypothesis.

Finally, we will maximize the entropy of our lift over all elements of our convex set, subject to performing well on the tests. Our hope is that the resulting lift is simple enough that we can prove it couldn’t possibly pass all the tests, leading to a contradiction.

In order to find the right set of tests, we will identify a family of canonical (approximate) lifts. This is family ${\{P_k\}}$ of polytopes so that ${\gamma(P_k) \leq O(n^k)}$ and which we expect to give the “best approximation” to ${\mathrm{CUT}_n}$ among all lifts with similar ${\gamma}$-value. We can identify this family precisely because these will be lifts that obey the natural symmetries of the cut polytope (observe that the symmetric group ${S_n}$ acts on ${\mathrm{CUT}_n}$ in the natural way).

1.2. NMF and positivity certificates

Recall the matrix ${\mathcal M_n : \mathrm{QML}^+_n \times \{0,1\}^n \rightarrow \mathbb R_+}$ given by ${\mathcal M_n(f,x)=f(x)}$, where ${\mathrm{QML}^+_n}$ is the set of all quadratic multi-linear functions that are non-negative on ${\{0,1\}^n}$. In the previous post, we argued that ${\bar \gamma(\mathrm{CUT}_{n+1}) \geq \mathrm{rank}_+(\mathcal M_n)}$.

If ${r=\mathrm{rank}_+(\mathcal M_n)}$, it means we can write

$\displaystyle f(x) = \mathcal M_n(f,x) = \sum_{i=1}^r A_i(f) B_i(x) \ \ \ \ \ (1)$

for some functions ${A_i : \mathrm{QML}^+_n \rightarrow \mathbb R_+}$ and ${B_i : \{0,1\}^n \rightarrow \mathbb R_+}$. (Here we are using a factorization ${\mathcal M_n = AB}$ where ${A_{f,i} = A_i(f)}$ and ${B_{x,i}=B_i(x)}$.)

Thus the low-rank factorization gives us a “proof system” for ${\mathrm{QML}^+_n}$. Every ${f \in \mathrm{QML}^+_n}$ can be written as a conic combination of the functions ${B_1, B_2, \ldots, B_r}$, thereby certifying its positivity (since the ${B_i}$‘s are positive functions).

If we think about natural families ${\mathcal B = \{B_i\}}$ of “axioms,” then since ${\mathrm{QML}^+_n}$ is invariant under the natural action of ${S_n}$, we might expect that our family ${\mathcal B}$ should share this invariance. Once we entertain this expectation, there are natural small families of axioms to consider: The space of non-negative ${k}$-juntas for some ${k \ll n}$.

A ${k}$-junta ${b : \{0,1\}^n \rightarrow \mathbb R}$ is a function whose value only depends on ${k}$ of its input coordinates. For a subset ${S \subseteq \{1,\ldots,n\}}$ with ${|S|=k}$ and an element ${z \in \{0,1\}^k}$, let ${q_{S,z} : \{0,1\}^n \rightarrow \{0,1\}}$ denote the function given by ${q_{S,z}(x)=1}$ if and only if ${x|_S=z}$.

We let ${\mathcal J_k = \{ q_{S,z} : |S| \leq k, z \in \{0,1\}^{|S|} \}}$. Observe that ${|\mathcal J_k| \leq O(n^k)}$. Let us also define ${\mathrm{cone}(\mathcal J_k)}$ as the set of all conic combinations of functions in ${\mathcal J_k}$. It is easy to see that ${\mathrm{cone}(\mathcal J_k)}$ contains precisely the conic combinations of non-negative ${k}$-juntas.

If it were true that ${\mathrm{QML}^+_n \subseteq \mathcal J_k}$ for some ${k}$, we could immediately conclude that ${\mathrm{rank}_+(\mathcal M_n) \leq |\mathcal J_k| \leq O(n^k)}$ by writing ${\mathcal M_n}$ in the form (1) where now ${\{B_i\}}$ ranges over the elements of ${\mathcal J_k}$ and ${\{A_i(f)\}}$ gives the corresponding non-negative coefficients that follow from ${f \in \mathcal J_k}$.

1.3. No ${k}$-junta factorization for ${k \leq n/2}$

We will now see that juntas cannot provide a small set of axioms for ${\mathrm{QML}^+_n}$.

Theorem 2 Consider the function ${f : \{0,1\}^n \rightarrow \mathbb R_+}$ given by ${f(x) = \left(1-\sum_{i=1}^n x_i\right)^2}$. Then ${f \notin \mathcal J_{\lceil n/2\rceil}}$.

Toward the proof, let’s introduce a few definitions. First, for ${f : \{0,1\}^n \rightarrow \mathbb R_+}$, define the junta degree of ${f}$ to be

$\displaystyle \deg_J(f) = \min \{ k : f \in \mathrm{cone}(\mathcal J_k) \}\,.$

Since every ${f}$ is an ${n}$-junta, we have ${\deg_J(f) \leq n}$.

Now because ${\{ f : \deg_J(f) \leq k \}}$ is a cone, there is a universal way of proving that ${\deg_J(f) > k}$. Say that a functional ${\varphi : \{0,1\}^n \rightarrow \mathbb R}$ is ${k}$-locally positive if for all ${|S| \leq k}$ and ${z \in \{0,1\}^{|S|}}$, we have

$\displaystyle \sum_{x \in \{0,1\}^n} \varphi(x) q_{S,z}(x) \geq 0\,.$

These are precisely the linear functionals separating a function ${f}$ from ${\mathrm{cone}(\mathcal J_k)}$: We have ${\mathrm{deg}_J(f) > k}$ if and only if there is a ${k}$-locally positive functional ${\varphi}$ such that ${\sum_{x \in \{0,1\}^n} \varphi(x) f(x) < 0}$. Now we are ready to prove Theorem 2.

Proof: We will prove this using an appropriate ${k}$-locally positive functional. Define

$\displaystyle \varphi(x) = \begin{cases} -1 & |x| = 0 \\ \frac{2}{n} & |x|=1 \\ 0 & |x| > 1\,, \end{cases}$

where ${|x|}$ denotes the hamming weight of ${x \in \{0,1\}^n}$.

First, observe that

$\displaystyle \sum_{x \in \{0,1\}^n} \varphi(x) = -1 + n \cdot \frac{2}{n} = 1\,.$

Now recall the the function ${f}$ from the statement of the theorem and observe that by opening up the square, we have

$\displaystyle \sum_{x \in \{0,1\}^n} \varphi(x) f(x) = \sum_{x \in \{0,1\}^n} \varphi(x) \left(1-2 \sum_i x_i + \sum_i x_i^2 + 2\sum_{i \neq j} x_i x_j\right)\ \ \ \ \ (2)$

$\displaystyle = \sum_{x \in \{0,1\}^n} \varphi(x) \left(1-\sum_i x_i\right) = -1\,.$

Finally, consider some ${S \subseteq \{1,\ldots,n\}}$ with ${|S|=k \leq n/2}$ and ${z \in \{0,1\}^k}$. If ${z=\mathbf{0}}$, then

$\displaystyle \sum_{x \in \{0,1\}^n} \varphi(x) q_{S,z}(x) = -1 + \frac{2}{n} \left(n-k\right) \geq 0\,.$

If ${|z| > 1}$, then the sum is 0. If ${|z|=1}$, then the sum is non-negative because in that case ${q_{S,z}}$ is only supported on non-negative values of ${\varphi}$. We conclude that ${\varphi}$ is ${k}$-locally positive for ${k \leq n/2}$. Combined with (2), this yields the statement of the theorem. $\Box$

1.4. From juntas to general factorizations

So far we have seen that we cannot achieve a low non-negative rank factorization of ${\mathcal M_n}$ using ${k}$-juntas for ${k \leq n/2}$.

Brief aside: If one translates this back into the setting of lifts, it says that the ${k}$-round Sherali-Adams lift of the polytope

$\displaystyle P = \left\{ x \in [0,1]^{n^2} : x_{ij}=x_{ji},\, x_{ij} \leq x_{jk} + x_{ki} \quad \forall i,j,k \in \{1,\ldots,n\}\right\}$

does not capture ${\mathrm{CUT}_n}$ for ${k \leq n/2}$.

In the next post, we will use entropy maximization to show that a non-negative factorization of ${\mathcal M_n}$ would lead to a ${k}$-junta factorization with ${k}$ small (which we just saw is impossible).

For now, let us state the theorem we will prove. We first define a submatrix of ${\mathcal M_n}$. Fix some integer ${m \geq 1}$ and a function ${g : \{0,1\}^m \rightarrow \mathbb R_+}$. Now define the matrix ${M_n^g : {[n] \choose m} \times \{0,1\}^n \rightarrow \mathbb R_+}$ given by

$\displaystyle M_n^g(S,x) = g(x|_S)\,.$

The matrix is indexed by subsets ${S \subseteq [n]}$ with ${|S|=m}$ and elements ${x \in \{0,1\}^n}$. Here, ${x|_S}$ represents the (ordered) restriction of ${x}$ to the coordinates in ${S}$.

Theorem 3 (Chan-L-Raghavendra-Steurer 2013) For every ${m \geq 1}$ and ${g : \{0,1\}^m \rightarrow \mathbb R_+}$, there is a constant ${C=C(g)}$ such that for all ${n \geq 2m}$,

$\displaystyle \mathrm{rank}_+(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{\mathrm{deg}_J(g)-1}\,.$

Note that if ${g \in \mathrm{QML}^+_m}$ then ${M_n^g}$ is a submatrix of ${\mathcal M_n}$. Since Theorem 2 furnishes a sequence of quadratic multi-linear functions ${\{g_j\}}$ with ${\mathrm{deg}_J(g_j) \rightarrow \infty}$, the preceding theorem tells us that ${\mathrm{rank}_+(\mathcal M_n)}$ cannot be bounded by any polynomial in ${n}$. A more technical version of the theorem is able to achieve a lower bound of ${e^{c n^{1/3}}}$ (see Section 7 here).