A note on the large spectrum and generalized Riesz products

I wrote a short note entitled Covering the large spectrum and generalized Riesz products that simplifies and generalizes the approach of the first few posts on Chang’s Lemma and Bloom’s variant.

The approximation statement is made in the context of general probability measures on a finite set (though it should extend at least to the compact case with no issues). The algebraic structure only comes into play when the spectral covering statements are deduced (easily) from the general approximation theorem. The proofs are also done in the general setting of finite abelian groups.

Comments are encouraged, especially about references I may have missed.

Entropy optimality: Quantum lifts of polytopes

In these previous posts, we explored whether the cut polytope can be expressed as the linear projection of a polytope with a small number of facets (i.e., whether it has a small linear programming extended formulation).

For many cut problems, semi-definite programs (SDPs) are able to achieve better approximation ratios than LPs. The most famous example is the Goemans-Williamson {0.878} -approximation for MAX-CUT. The techniques of the previous posts (see the full paper for details) are able to show that no polynomial-size LP can achieve better than factor {1/2} .

1.1. Spectrahedral lifts

The feasible regions of LPs are polyhedra. Up to linear isomorphism, every polyhedron {P} can be represented as {P = \mathbb R_+^n \cap V} where {\mathbb R_+^n} is the positive orthant and {V \subseteq \mathbb R^n} is an affine subspace.

In this context, it makes sense to study any cones that can be optimized over efficiently. A prominent example is the positive semi-definite cone. Let us define {\mathcal S_+^n \subseteq \mathbb R^{n^2}} as the set of {n \times n} real, symmetric matrices with non-negative eigenvalues. A spectrahedron is the intersection {\mathcal S_+^n \cap V} with an affine subspace {V} . The value {n} is referred to as the dimension of the spectrahedron.

In analogy with the {\gamma} parameter we defined for polyhedral lifts, let us define {\bar \gamma_{\mathsf{sdp}}(P)} for a polytope {P} to be the minimal dimension of a spectrahedron that linearly projects to {P} . It is an exercise to show that {\bar \gamma_{\mathsf{sdp}}(P) \leq \bar \gamma(P)} for every polytope {P} . In other words, spectahedral lifts are at least as powerful as polyhedral lifts in this model.

In fact, they are strictly more powerful. Certainly there are many examples of this in the setting of approximation (like the Goemans-Williamson SDP mentioned earlier), but there are also recent gaps between {\bar \gamma} and {\bar \gamma_{\mathrm{sdp}}} for polytopes; see the work of Fawzi, Saunderson, and Parrilo.

Nevertheless, we are recently capable of proving strong lower bounds on the dimension of such lifts. Let us consider the cut polytope {\mathrm{CUT}_n} as in previous posts.

Theorem 1 (L-Raghavendra-Steurer 2015) There is a constant {c > 0} such that for every {n \geq 1} , one has {\bar \gamma_{\mathsf{sdp}}(\mathrm{CUT}_n) \geq e^{c n^{2/13}}} .

Our goal in this post and the next is to explain the proof of this theorem and how quantum entropy maximization plays a key role.

1.2. PSD rank and factorizations

Just as in the setting of polyhedra, there is a notion of “factorization through a cone” that characterizes the parameter {\bar \gamma_{\mathsf{sdp}}(P)} . Let {M \in \mathbb R^{m \times n}_+} be a non-negative matrix. One defines the psd rank of {M} as the quantity

\displaystyle  \mathrm{rank}_{\mathsf{psd}}(M) = \min \left\{ r : M_{ij} = \mathrm{Tr}(A_i B_j) \textrm{ for some } A_1, \ldots, A_m, B_1, \ldots, B_n \in \mathcal S_+^r\right\}\,.

The following theorem was independently proved by Fiorini, Massar, Pokutta, Tiwary, and de Wolf and Gouveia, Parrilo, and Thomas. The proof is a direct analog of Yannakakis’ proof for non-negative rank.

Theorem 2 For every polytope {P} , it holds that {\bar \gamma_{\mathsf{sdp}}(P) = \mathrm{rank}_{\mathsf{psd}}(M)} for any slack matrix {M} of {P} .

Recall the class {\mathrm{QML}_n^+} of non-negative quadratic multi-linear functions that are positive on {\{0,1\}^n} and the matrix {\mathcal M_n : \mathrm{QML}_n^+ \times \{0,1\}^n \rightarrow \mathbb R_+} given by

\displaystyle  \mathcal M_n(f,x) = f(x)\,.

We saw previously that {\mathcal M_n} is a submatrix of some slack matrix of {\mathrm{CUT}_n} . Thus our goal is to prove a lower bound on {\mathrm{rank}_{\mathsf{psd}}(\mathcal M_n)} .

1.3. Sum-of-squares certificates

Just as in the setting of non-negative matrix factorization, we can think of a low psd rank factorization of {\mathcal M_n} as a small set of “axioms” that can prove the non-negativity of every function in {\mathrm{QML}_n^+} . But now our proof system is considerably more powerful.

For a subspace of functions {\mathcal U \subseteq L^2(\{0,1\}^n)} , let us define the cone

\displaystyle  \mathsf{sos}(\mathcal U) = \mathrm{cone}\left(q^2 : q \in \mathcal U\right)\,.

This is the cone of squares of functions in {\mathcal U} . We will think of {\mathcal U} as a set of axioms of size {\mathrm{dim}(\mathcal U)} that is able to assert non-negativity of every {f \in \mathrm{sos}(\mathcal U)} by writing

\displaystyle  f = \sum_{i=1}^k q_i^2

for some {q_1, \ldots, q_k \in \mathsf{sos}(\mathcal U)} .

Fix a subspace {\mathcal U} and let {r = \dim(\mathcal U)} . Fix also a basis {q_1, \ldots, q_r : \{0,1\}^n \rightarrow \mathbb R} for {\mathcal U} .

Define {B : \{0,1\}^n \rightarrow \mathcal S_+^r} by setting {B(x)_{ij} = q_i(x) q_j(x)} . Note that {B(x)} is PSD for every {x} because {B(x) = \vec q(x) \vec q(x)^T} where {\vec q(x)=(q_1(x), \ldots, q_r(x))} .

We can write every {p \in \mathcal U} as {p = \sum_{i=1}^r \lambda_i q_i} . Defining {A(p^2) \in \mathcal S_+^r} by {A(p^2)_{ij} = \lambda_i \lambda_j} , we see that

\displaystyle  \mathrm{Tr}(A(p^2) Q(x)) = \sum_{i,j} \lambda_i \lambda_j q_i(x) q_j(x) = p(x)^2\,.

Now every {f \in \mathsf{sos}(\mathcal U)} can be written as {\sum_{i=1}^k c_i p_i^2} for some {k \geq 0} and {\{c_i \geq 0\}} . Therefore if we define {A(f) = \sum_{i=1}^k c_i \Lambda(p_i^2)} (which is a positive sum of PSD matrices), we arrive at the representation

\displaystyle  f(x) = \mathrm{Tr}(A(f) B(x))\,.

In conclusion, if {\mathrm{QML}_+^n \subseteq \mathsf{sos}(\mathcal U)} , then {\mathrm{rank}_{\mathsf{psd}}(\mathcal M_n) \leq \dim(\mathsf{sos}(\mathcal U))} .

By a “purification” argument, one can also conclude {\dim(\mathsf{sos}(\mathcal U)) \leq \mathrm{rank}_{\mathsf{psd}}(\mathcal M_n)^2} .

1.4. The canonical axioms

And just as {d} -juntas were the canonical axioms for our NMF proof system, there is a similar canonical family in the SDP setting: Let {\mathcal Q_d} be the subspace of all degree-{d} multi-linear polynomials on {\mathbb R^n} . We have

\displaystyle  \dim(\mathcal Q_d) \leq \sum_{k=0}^d {n \choose k} \leq 1+n^d\,. \ \ \ \ \ (1)

For a function {f : \{0,1\}^n \rightarrow \mathbb R_+} , one defines

\displaystyle  \deg_{\mathsf{sos}}(f) = \min \{d : f \in \mathsf{sos}(\mathcal Q_{d}) \}\,.

(One could debate whether the definition of sum-of-squares degree should have {d/2} or {d} . The most convincing arguments suggest that we should use membership in the cone of squares over {\mathcal Q_{\lfloor d/2\rfloor}} so that the sos-degree will be at least the real-degree of the function.)

On the other hand, our choice has the following nice property.

Lemma 3 For every {f : \{0,1\}^n \rightarrow \mathbb R_+} , we have {\deg_{\mathrm{sos}}(f) \leq \deg_J(f)} .

Proof: If {q} is a non-negative {d} -junta, then {\sqrt{q}} is also a non-negative {d} -junta. It is elementary to see that every {d} -junta is polynomial of degree at most {d} , thus {q} is the square of a polynomial of degree at most {d} . \Box

1.5. The canonical tests

As with junta-degree, there is a simple characterization of sos-degree in terms of separating functionals. Say that a functional {\varphi : \{0,1\}^n \rightarrow \mathbb R} is degree-{d} pseudo-positive if

\displaystyle  \langle \varphi, q^2 \rangle = \mathop{\mathbb E}_{x \in \{0,1\}^n} \varphi(x) q(x)^2 \geq 0

whenever {q : \{0,1\}^n \rightarrow \mathbb R} satisfies {\deg(q) \leq d} (and by {\deg} here, we mean degree as a multi-linear polynomial on {\{0,1\}^n} ).

Again, since {\mathsf{sos}(\mathcal Q_d)} is a closed convex set, there is precisely one way to show non-membership there. The following characterization is elementary.

Lemma 4 For every {f : \{0,1\}^n \rightarrow \mathbb R_+} , it holds that {\deg_{\mathsf{sos}}(f) > d} if and only if there is a degree-{d} pseudo-positive functional {\varphi : \{0,1\}^n \rightarrow \mathbb R} such that {\langle \varphi,f\rangle < 0} .

1.6. The connection to psd rank

Following the analogy with non-negative rank, we have two objectives left: (1) to exhibit a function {f \in \mathrm{QML}_n^+} with {\deg_{\mathsf{sos}}(f)} large, and (ii) to give a connection between the sum-of-squares degree of {f} and the psd rank of an associated matrix.

Notice that the function {f(x)=(1-\sum_{i=1}^n x_i)^2} we used for junta-degree has {\deg_{\mathsf{sos}}(f)=1} , making it a poor candidate. Fortunately, Grigoriev has shown that the knapsack polynomial has large sos-degree.

Theorem 5 For every odd {m \geq 1} , the function

\displaystyle  f(x) = \left(\frac{m}{2} - \sum_{i=1}^n x_i\right)^2 - \frac14

has {\deg_{\mathsf{sos}}(f) \geq \lceil m/2\rceil} .

Observe that this {f} is non-negative over {\{0,1\}^m} (because {m} is odd), but it is manifestly not non-negative on {\mathbb R^m} .

Finally, we recall the submatrices of {\mathcal M_n} defined as follows. Fix some integer {m \geq 1} and a function {g : \{0,1\}^m \rightarrow \mathbb R_+} . Then {M_n^g : {[n] \choose m} \times \{0,1\}^n \rightarrow \mathbb R_+} is given by

\displaystyle  M_n^g(S,x) = g(x|_S)\,.

In the next post, we discuss the proof of the following theorem.

Theorem 6 (L-Raghavendra-Steurer 2015) For every {m \geq 1} and {g : \{0,1\}^m \rightarrow \mathbb R_+} , there exists a constant {C(g)} such that the following holds. For every {n \geq 2m} ,

\displaystyle  1+n^{d} \geq \mathrm{rank}_{\mathsf{psd}}(M_n^g) \geq C(g) \left(\frac{n}{\log n}\right)^{(d-1)/2}\,,

where d=\deg_{\mathsf{sos}}(g) .

Note that the upper bound is from (1) and the non-trivial content is contained in the lower bound. As before, in conjunction with Theorem 5, this shows that \mathrm{rank}_{\mathsf{psd}}(\mathcal M_n)  cannot be bounded by any polynomial in n and thus the same holds for \bar \gamma_{\mathsf{sdp}}(\mathrm{CUT}_n) .

Entropy optimality: Non-negative rank lower bounds

Using the notation from the last post, our goal is now to prove the following theorem.

Theorem 1 For every {m \geq 1} and {g : \{0,1\}^m \rightarrow \mathbb R_+} , there is a constant {C=C(g)} such that for all {n \geq 2m} ,

\displaystyle  \mathrm{rank}_+(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{\mathrm{deg}_J(g)-1}\,.

1.1. Convex relaxations of non-negative rank

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets {X} and {Y} and a matrix {M : X \times Y \rightarrow \mathbb R_+} with {r=\mathrm{rank}_+(M)} .

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of {\{ N \in \mathbb R_+^{X \times Y} : \mathrm{rank}_+(N) = 1 \}} is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both {X} and {Y} with the uniform measure. Let {\mathcal Q = \{ b : Y \rightarrow \mathbb R_+ \mid \|b\|_1 = 1\}} denote the set of probability densities on {Y} . Now define

\displaystyle  \alpha_+(N) = \min \Big\{ \alpha : \exists A \in \mathbb R_+^{X \times k}, B \in \mathbb R_+^{k \times Y} \textrm{ with } N=AB, \textrm{ and}

\displaystyle  \qquad\qquad\qquad\qquad\qquad\{B_1, \ldots, B_k\} \subseteq \mathcal Q, \textrm{ and }

\displaystyle  \hphantom{\{B_1, \ldots, B_k\} \subseteq \mathcal Q,} \max_{i \in [k]} \|B_i\|_{\infty} \leq \alpha, \sum_{i=1}^k \|A^{(i)}\|_{\infty} \leq \alpha\Big\}.

Here {\{A^{(i)}\}} are the columns of {A} and {\{B_i\}} are the rows of {B} . Note that now {k} is unconstrained.

Observe that {\alpha_+} is a convex function. To see this, given a pair {N=AB} and {N'=A'B'} , write

\displaystyle  \frac{N+N'}{2} = \left(\begin{array}{cc} \frac12 A & \frac12 A' \end{array}\right) \left(\begin{array}{l} \vphantom{\bigoplus} B \\ \vphantom{\bigoplus} B' \end{array}\right)\,,

witnessing the fact that {\alpha_+(\frac12(N+N')) \leq \frac12 \alpha_+(N) + \frac12 \alpha_+(N')} .

1.2. A truncation argument

So the set {\{ N : \alpha_+(N) \leq c \}} is convex, but it’s not yet clear how this relates to {\mathrm{rank}_+} . We will see now that low non-negative rank matrices are close to matrices with {\alpha_+} small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms {\|M\|_1 = \mathbb E_{x,y} |M_{x,y}|} and {\|M\|_{\infty} = \max_{x,y} |M_{x,y}|} .

Lemma 2 For every non-negative {M \in \mathbb R_+^{X \times Y}} with {\mathrm{rank}_+(M) \leq r} and every {\delta \in (0,1)} , there is a matrix {\tilde M \in \mathbb R_+^{X \times Y}} such that

\displaystyle  \|M-\tilde M\|_1 \leq \delta

and

\displaystyle  \alpha_+(\tilde M) \leq \frac{r}{\delta} \|M\|_{\infty}\,.

Proof: Suppose that {M=AB} with {A \in \mathbb R_+^{X \times r}, B \in \mathbb R_+^{r \times Y}} , and let us interpret this factorization in the form

\displaystyle  M(x,y) = \sum_{i=1}^r A_i(x) B_i(y) \ \ \ \ \ (1)

(where {\{A_i\}} are the columns of {A} and {\{B_i\}} are the rows of {B} ). By rescaling the columns of {A} and the rows of {B} , respectively, we may assume that {\mathop{\mathbb E}_{\mu}[B_i]=1} for every {i \in [r]} .

Let {\Lambda = \{ i : \|B_i\|_{\infty} > \tau \}} denote the “bad set” of indices (we will choose {\tau} momentarily). Observe that if {i \in \Lambda} , then

\displaystyle  \|A_i\|_{\infty} \leq \frac{\|M\|_{\infty}}{\tau}\,,

from the representation (1) and the fact that all summands are positive.

Define the matrix {\tilde M(x,y) = \sum_{i \notin \Lambda} A_i(x) B_i(y)} . It follows that

\displaystyle  \|M-\tilde M\|_1 = \mathop{\mathbb E}_{x,y} \left[|M(x,y)-\tilde M(x,y)|\right] = \sum_{i \in \Lambda} \mathop{\mathbb E}_{x,y} [A_i(x) B_i(y)]\,.

Each of the latter terms is at most {\|A_i\|_{\infty} \|B_i\|_1 \leq \frac{\|M\|_{\infty}}{\tau}} and {|\Lambda| \leq r} , thus

\displaystyle  \|M-\tilde M\|_1 \leq r \frac{\|M\|_{\infty}}{\tau}\,.

Next, observe that

\displaystyle  \mathop{\mathbb E}_y [M(x,y)] = \sum_{i=1}^r A_i(x) \|B_i\|_1 = \sum_{i=1}^r A_i(x)\,,

implying that {\|A_i\|_{\infty} \leq \|M\|_{\infty}} and thus {\sum_{i=1}^r \|A_i\|_{\infty} \leq r \|M\|_{\infty}} .

Setting {\tau = r \|M\|_{\infty}/\delta} yields the statement of the lemma. \Box

Generally, the ratio {\frac{\|M\|_{\infty}}{\|M\|_1}} will be small compared to {r} (e.g., polynomial in {n} vs. super-polynomial in {n} ). Thus from now on, we will actually prove a lower bound on {\alpha_+(M)} . One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

1.3. The test functionals

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix {M_n^g} , we will now assume that {\alpha_+(M_n^g) \leq \alpha} . Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix {M_n^g} . We will consider the uniform measures on {{[n] \choose m}} and {\{0,1\}^n} . We use {\mathop{\mathbb E}_S} and {\mathop{\mathbb E}_x} to denote averaging with respect to these measures.

Let {d=\deg_J(g)-1} . From the last post, we know there exists a {d} -locally positive functional {\varphi : \{0,1\}^n \rightarrow \mathbb R} such that {\beta := \mathop{\mathbb E}_x \varphi(x) g(x) < 0} , and {\mathop{\mathbb E}_x \varphi(x) q(x) \geq 0} for every {d} -junta {q} .

For {S \subseteq [n]} with {|S|=m} , let us denote {\varphi_S(x) = \varphi(x|_S)} . These functionals prove lower bounds on the junta-degree of {g} restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family {\{\varphi_S\}} as our test functions.

1.4. Entropy maximization

Use {\alpha_+(M_n^g) \leq \alpha} to write

\displaystyle  M_n^g(S,x) = \sum_{i=1}^k A_i(S) B_i(x)\,,

where {A_i, B_i \geq 0} and we have {\|B_i\|_1=1} and {\|B_i\|_{\infty} \leq \alpha} for all {i \in [k]} , and finally {\sum_{i=1}^k \|A_i\|_{\infty} \leq \alpha} .

First, as we observed last time, if each {B_i} were a {d} -junta, we would have a contradiction:

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) M_n^g(S,x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^m} \varphi(y) g(y) = \beta < 0\,, \ \ \ \ \ (2)

and yet

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) \sum_{i=1}^k A_i(S) B_i(x)]\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] \geq 0 \ \ \ \ \ (3)

because {\mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right] \geq 0} since {\varphi} is {d} -locally positive and the function {y \mapsto \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right]} is a {d} -junta.

So now the key step: Use entropy maximization to approximate {B_i} by a junta! In future posts, we will need to consider the entire package {\{B_1, \ldots, B_k\}} of functions simultaneously. But for the present lower bound, it suffices to consider each {B_i} separately.

Consider the following optimization over variables {\{\tilde B_i(x) : x \in \{0,1\}^n\}} :

\displaystyle  \mathrm{minimize }\qquad \mathrm{Ent}(\tilde B_i) = \mathbb E [\tilde B_i \log \tilde B_i]

\displaystyle  \qquad\textrm{ subject to } \qquad \qquad\mathbb E[\tilde B_i]=1, \quad \tilde B_i(x) \geq 0 \quad \forall x \in \{0,1\}^n

\displaystyle  \hphantom{\bigoplus} \qquad \qquad \qquad \qquad \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right] \leq \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] + \varepsilon \qquad \forall |S|=m\,. \ \ \ \ \ (4)

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1 There exists a function {\tilde B_i : \{0,1\}^n \rightarrow \mathbb R_+^n} satisfying all the preceding constraints and of the form

\displaystyle  \tilde B_i = \frac{\exp\left(\sum_{j=1}^{M} \lambda_j \varphi_{S_j}\right)}{\mathop{\mathbb E} \exp\left(\sum_{i=1}^{M} \lambda_j \varphi_{S_j}\right)}

such that

\displaystyle  M \leq C(g) \frac{\log \alpha}{\varepsilon^2}\,,

where {C(g)} is some constant depending only on {g} .

Note that {\varphi} depends only on {g} , and thus {\|\varphi_S\|_{\infty}} only depends on {g} as well. Now each {\varphi_{S_j}} only depends on {m} variables (those in {S_j} and {|S_j|=m} ), meaning that our approximator {\tilde B_i} is an {h} -junta for

\displaystyle  h \leq m \cdot C(g) \frac{\log \alpha}{\varepsilon^2}\,. \ \ \ \ \ (5)

Oops. That doesn’t seem very good. The calculation in (3) needs that {\tilde B_i} is a {d} -junta, and certainly {d < m} (since {g} is a function on {\{0,1\}^m} ). Nevertheless, note that the approximator is a non-trivial junta. For instance, if {\alpha \leq 2^{\sqrt{n}}} , then it is an O(\sqrt{n}) -junta, recalling that m is a constant (depending on {g} ).

1.5. Random restriction saves the day

Let’s try to apply the logic of (3) to the {\tilde B_i} approximators anyway. Fix some {i \in [k]} and let {J_i} be the set of coordinates on which {\tilde B_i} depends. Then:

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] = \mathop{\mathbb E}_S\,A_i(S) \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]

Note that the map {y \mapsto \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]} is a junta on {J_i \cap S} . Thus if {|J_i \cap S| \leq d} , then the contribution from this term is non-negative since {\varphi} is {d} -locally positive. But {|S|=m} is fixed and {n} is growing, thus {|J_i \cap S| > d} is quite rare! Formally,

\displaystyle  \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq - \|A_i\|_{\infty}\, \mathbb P_S[|J_i \cap S| > d] \geq - \|A_i\|_{\infty} \frac{h^d (2m)^d}{n^d} \,.

In the last estimate, we have used a simple union bound and {n \geq 2m} .

Putting everything together and summing over {i \in [k]} , we conclude that

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq -\alpha \frac{h^d (2m)^d}{n^d}\,.

Note that by choosing {n} only moderately large, we will make this error term very small.

Moreover, since each {\tilde B_i} is a feasible point of the optimization (4), we have

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right]

\displaystyle  \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \left(-\varepsilon + \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right]\right)

\displaystyle  \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq -\varepsilon \sum_{i=1}^k \|A_i\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.

Almost there: Now observe that

\displaystyle  \|g\|_1 = \mathop{\mathbb E}_{S,x} [M_n^g(S,x)] = \sum_{i=1}^k \|A_i\|_1 \|B_i\|_1 = \sum_{i=1}^k \|A_i\|_1\,.

Plugging this into the preceding line yields

\displaystyle  \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] \geq -\varepsilon \|g\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.

Recalling now (2), we have derived a contradiction to {\alpha_+(M) \leq \alpha} if we can choose the right-hand side to be bigger than {\beta} (which is a negative constant depending only on {g} ). Setting {\varepsilon = -\beta/(2 \|g\|_1)} , we consult (5) to see that

\displaystyle  h \leq C'(g) m \log \alpha

for some other constant {C'(g)} depending only on {g} . We thus arrive at a contradiction if {\alpha = o((n/\log n)^d)} , recalling that {m, d} depend only on {g} . This completes the argument.

Entropy optimality: Lifts of polytopes

Recall from the previous post that {\mathrm{CUT}_n} denotes the cut polytope on the {n} -vertex complete graph, and {\bar \gamma(\mathrm{CUT}_n)} is the smallest number of facets in any polytope that linearly projects to {\mathrm{CUT}_n} . Our goal is to prove a lower bound on this quantity, but first we should mention that a nearly tight lower bound is known.

Theorem 1 (Fiorini, Massar, Pokutta, Tiwari, de Wolf 2012) There is a constant {c > 0} such that for every {n \geq 1} , {\bar \gamma(\mathrm{CUT}_n) \geq e^{cn}} .

We will present a somewhat weaker lower bound using entropy maximization, following our joint works with Chan, Raghavendra, and Steurer (2013) and with Raghavendra and Steurer (2015). This method is only currently capable of proving that {\bar \gamma(\mathrm{CUT}_n) \geq e^{cn^{1/3}}} , but it has the advantage of being generalizable—it extends well to the setting of approximate lifts and spectrahedral lifts (we’ll come to the latter in a few posts).

1.1. The entropy maximization framework

To use entropy optimality, we proceed as follows. For the sake of contradiction, we assume that {\bar \gamma(\mathrm{CUT}_n)} is small.

First, we will identify the space of potential lifts of small {\gamma} -value with the elements of a convex set of probability measures. (This is where the connection to non-negative matrix factorization (NMF) will come into play.) Then we will choose a family of “tests” intended to capture the difficult aspects of being a valid lift of {\mathrm{CUT}_n} . This step is important as having more freedom (corresponding to weaker tests) will allow the entropy maximization to do more “simplification.” The idea is that the family of tests should still be sufficiently powerful to prove a lower bound on the entropy-optimal hypothesis.

Finally, we will maximize the entropy of our lift over all elements of our convex set, subject to performing well on the tests. Our hope is that the resulting lift is simple enough that we can prove it couldn’t possibly pass all the tests, leading to a contradiction.

In order to find the right set of tests, we will identify a family of canonical (approximate) lifts. This is family {\{P_k\}} of polytopes so that {\gamma(P_k) \leq O(n^k)} and which we expect to give the “best approximation” to {\mathrm{CUT}_n} among all lifts with similar {\gamma} -value. We can identify this family precisely because these will be lifts that obey the natural symmetries of the cut polytope (observe that the symmetric group {S_n} acts on {\mathrm{CUT}_n} in the natural way).

1.2. NMF and positivity certificates

Recall the matrix {\mathcal M_n : \mathrm{QML}^+_n \times \{0,1\}^n \rightarrow \mathbb R_+} given by {\mathcal M_n(f,x)=f(x)} , where {\mathrm{QML}^+_n} is the set of all quadratic multi-linear functions that are non-negative on {\{0,1\}^n} . In the previous post, we argued that {\bar \gamma(\mathrm{CUT}_{n+1}) \geq \mathrm{rank}_+(\mathcal M_n)} .

If {r=\mathrm{rank}_+(\mathcal M_n)} , it means we can write

\displaystyle  f(x) = \mathcal M_n(f,x) = \sum_{i=1}^r A_i(f) B_i(x) \ \ \ \ \ (1)

for some functions {A_i : \mathrm{QML}^+_n \rightarrow \mathbb R_+} and {B_i : \{0,1\}^n \rightarrow \mathbb R_+} . (Here we are using a factorization {\mathcal M_n = AB} where {A_{f,i} = A_i(f)} and {B_{x,i}=B_i(x)} .)

Thus the low-rank factorization gives us a “proof system” for {\mathrm{QML}^+_n} . Every {f \in \mathrm{QML}^+_n} can be written as a conic combination of the functions {B_1, B_2, \ldots, B_r} , thereby certifying its positivity (since the {B_i} ‘s are positive functions).

If we think about natural families {\mathcal B = \{B_i\}} of “axioms,” then since {\mathrm{QML}^+_n} is invariant under the natural action of {S_n} , we might expect that our family {\mathcal B} should share this invariance. Once we entertain this expectation, there are natural small families of axioms to consider: The space of non-negative {k} -juntas for some {k \ll n} .

A {k} -junta {b : \{0,1\}^n \rightarrow \mathbb R} is a function whose value only depends on {k} of its input coordinates. For a subset {S \subseteq \{1,\ldots,n\}} with {|S|=k} and an element {z \in \{0,1\}^k} , let {q_{S,z} : \{0,1\}^n \rightarrow \{0,1\}} denote the function given by {q_{S,z}(x)=1} if and only if {x|_S=z} .

We let {\mathcal J_k = \{ q_{S,z} : |S| \leq k, z \in \{0,1\}^{|S|} \}} . Observe that {|\mathcal J_k| \leq O(n^k)} . Let us also define {\mathrm{cone}(\mathcal J_k)} as the set of all conic combinations of functions in {\mathcal J_k} . It is easy to see that {\mathrm{cone}(\mathcal J_k)} contains precisely the conic combinations of non-negative {k} -juntas.

If it were true that {\mathrm{QML}^+_n \subseteq \mathcal J_k} for some {k} , we could immediately conclude that {\mathrm{rank}_+(\mathcal M_n) \leq |\mathcal J_k| \leq O(n^k)} by writing {\mathcal M_n} in the form (1) where now {\{B_i\}} ranges over the elements of {\mathcal J_k} and {\{A_i(f)\}} gives the corresponding non-negative coefficients that follow from {f \in \mathcal J_k} .

1.3. No {k} -junta factorization for {k \leq n/2}

We will now see that juntas cannot provide a small set of axioms for {\mathrm{QML}^+_n} .

Theorem 2 Consider the function {f : \{0,1\}^n \rightarrow \mathbb R_+} given by {f(x) = \left(1-\sum_{i=1}^n x_i\right)^2} . Then {f \notin \mathcal J_{\lceil n/2\rceil}} .

Toward the proof, let’s introduce a few definitions. First, for {f : \{0,1\}^n \rightarrow \mathbb R_+} , define the junta degree of {f} to be

\displaystyle  \deg_J(f) = \min \{ k : f \in \mathrm{cone}(\mathcal J_k) \}\,.

Since every {f} is an {n} -junta, we have {\deg_J(f) \leq n} .

Now because {\{ f : \deg_J(f) \leq k \}} is a cone, there is a universal way of proving that {\deg_J(f) > k} . Say that a functional {\varphi : \{0,1\}^n \rightarrow \mathbb R} is {k} -locally positive if for all {|S| \leq k} and {z \in \{0,1\}^{|S|}} , we have

\displaystyle \sum_{x \in \{0,1\}^n} \varphi(x) q_{S,z}(x) \geq 0\,.

These are precisely the linear functionals separating a function {f} from {\mathrm{cone}(\mathcal J_k)} : We have {\mathrm{deg}_J(f) > k} if and only if there is a {k} -locally positive functional {\varphi} such that {\sum_{x \in \{0,1\}^n} \varphi(x) f(x) < 0} . Now we are ready to prove Theorem 2.

Proof: We will prove this using an appropriate {k} -locally positive functional. Define

\displaystyle  \varphi(x) = \begin{cases} -1 & |x| = 0 \\ \frac{2}{n} & |x|=1 \\ 0 & |x| > 1\,, \end{cases}

where {|x|} denotes the hamming weight of {x \in \{0,1\}^n} .

First, observe that

\displaystyle \sum_{x \in \{0,1\}^n} \varphi(x) = -1 + n \cdot \frac{2}{n} = 1\,.

Now recall the the function {f} from the statement of the theorem and observe that by opening up the square, we have

\displaystyle  \sum_{x \in \{0,1\}^n} \varphi(x) f(x) = \sum_{x \in \{0,1\}^n} \varphi(x) \left(1-2 \sum_i x_i + \sum_i x_i^2 + 2\sum_{i \neq j} x_i x_j\right)\ \ \ \ \ (2)

\displaystyle  = \sum_{x \in \{0,1\}^n} \varphi(x) \left(1-\sum_i x_i\right) = -1\,.

Finally, consider some {S \subseteq \{1,\ldots,n\}} with {|S|=k \leq n/2} and {z \in \{0,1\}^k} . If {z=\mathbf{0}} , then

\displaystyle  \sum_{x \in \{0,1\}^n} \varphi(x) q_{S,z}(x) = -1 + \frac{2}{n} \left(n-k\right) \geq 0\,.

If {|z| > 1} , then the sum is 0. If {|z|=1} , then the sum is non-negative because in that case {q_{S,z}} is only supported on non-negative values of {\varphi} . We conclude that {\varphi} is {k} -locally positive for {k \leq n/2} . Combined with (2), this yields the statement of the theorem. \Box

1.4. From juntas to general factorizations

So far we have seen that we cannot achieve a low non-negative rank factorization of {\mathcal M_n} using {k} -juntas for {k \leq n/2} .

Brief aside: If one translates this back into the setting of lifts, it says that the {k} -round Sherali-Adams lift of the polytope

\displaystyle P = \left\{ x \in [0,1]^{n^2} : x_{ij}=x_{ji},\, x_{ij} \leq x_{jk} + x_{ki} \quad \forall i,j,k \in \{1,\ldots,n\}\right\}

does not capture {\mathrm{CUT}_n} for {k \leq n/2} .

In the next post, we will use entropy maximization to show that a non-negative factorization of {\mathcal M_n} would lead to a {k} -junta factorization with {k} small (which we just saw is impossible).

For now, let us state the theorem we will prove. We first define a submatrix of {\mathcal M_n} . Fix some integer {m \geq 1} and a function {g : \{0,1\}^m \rightarrow \mathbb R_+} . Now define the matrix {M_n^g : {[n] \choose m} \times \{0,1\}^n \rightarrow \mathbb R_+} given by

\displaystyle  M_n^g(S,x) = g(x|_S)\,.

The matrix is indexed by subsets {S \subseteq [n]} with {|S|=m} and elements {x \in \{0,1\}^n} . Here, {x|_S} represents the (ordered) restriction of {x} to the coordinates in {S} .

Theorem 3 (Chan-L-Raghavendra-Steurer 2013) For every {m \geq 1} and {g : \{0,1\}^m \rightarrow \mathbb R_+} , there is a constant {C=C(g)} such that for all {n \geq 2m} ,

\displaystyle  \mathrm{rank}_+(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{\mathrm{deg}_J(g)-1}\,.

Note that if {g \in \mathrm{QML}^+_m} then {M_n^g} is a submatrix of {\mathcal M_n} . Since Theorem 2 furnishes a sequence of quadratic multi-linear functions {\{g_j\}} with {\mathrm{deg}_J(g_j) \rightarrow \infty} , the preceding theorem tells us that {\mathrm{rank}_+(\mathcal M_n)} cannot be bounded by any polynomial in {n} . A more technical version of the theorem is able to achieve a lower bound of {e^{c n^{1/3}}} (see Section 7 here).

Primer: Lifts of polytopes and non-negative matrix factorization

In preparation for the next post on entropy optimality, we need a little background on lifts of polytopes and non-negative matrix factorization.

1.1. Polytopes and inequalities

A {d} -dimensional convex polytope {P \subseteq \mathbb R^d} is the convex hull of a finite set of points in {\mathbb R^d} . Equivalently, it is a compact set defined by a family of linear inequalities

\displaystyle  P = \{ x \in \mathbb R^d : Ax \leq b \}

for some matrix {A \in \mathbb R^{m \times d}} .

Let us give a measure of complexity for {P} : Define {\gamma(P)} to be the smallest number {m} such that for some {C \in \mathbb R^{s \times d}, y \in \mathbb R^s, A \in \mathbb R^{m \times d}, b \in \mathbb R^m} , we have

\displaystyle  P = \{ x \in \mathbb R^d : Cx=y \textrm{ and } Ax \leq b\}\,.

In other words, this is the minimum number of inequalities needed to describe {P} . If {P} is full-dimensional, then this is precisely the number of facets of {P} (a facet is a maximal proper face of {P} ).

Thinking of {\gamma(P)} as a measure of complexity makes sense from the point of view of optimization: Interior point methods can efficiently optimize linear functions over {P} (to arbitrary accuracy) in time that is polynomial in {\gamma(P)} .

1.2. Lifts of polytopes

Many simple polytopes require a large number of inequalities to describe. For instance, the cross-polytope

\displaystyle  C_d = \{ x \in \mathbb R^d : \|x\|_1 \leq 1 \} = \{ x \in \mathbb R^d : \pm x_1 \pm x_2 \cdots \pm x_d \leq 1 \}

has {\gamma(C_d)=2^d} . On the other hand, {C_d} is the projection of the polytope

\displaystyle  Q_d = \left\{ (x,y) \in \mathbb R^{2d} : \sum_{i=1}^n y_i = 1, \,\, - y_i \leq x_i \leq y_i\,\forall i\right\}

onto the {x} coordinates, and manifestly, {\gamma(Q_d) \leq 2d} . Thus {C_d} is the (linear) shadow of a much simpler polytope in a higher dimension.

[image credit: Fiorini, Rothvoss, and Tiwary]

A polytope {Q} is called a lift of the polytope {P} if {P} is the image of {Q} under a linear projection. Again, from an optimization stand point, lifts are important: If we can optimize linear functionals over {Q} , then we can optimize linear functionals over {P} . Define now {\bar \gamma(P)} to be the minimal value of {\gamma(Q)} over all lifts {Q} of {P} . (The value {\bar \gamma(P)} is sometimes called the (linear) extension complexity of {P} .)

1.3. The permutahedron

Here is a somewhat more interesting family of examples where lifts reduce complexity. The permutahedron {\Pi_n \subseteq \mathbb R^n} is the convex hull of the vectors {(i_1, i_2, \ldots, i_n)} where {\{i_1, \ldots, i_n\} = \{1,\ldots,n \}} . It is known that {\gamma(\Pi_n)=2^n-2} .

Let {B_n \subseteq \mathbb R^{n^2}} denote the convex hull of the {n \times n} permutation matrices. The Birkhoff-von Neumann theorem tells us that {B_n} is precisely the set of doubly stochastic matrices, thus {\gamma(B_n)=n^2} (corresponding to the non-negativity constraints on each entry).

Observe that {\Pi_n} is the linear image of {B_n} under the map {A \mapsto A (1, 2, \ldots, n)^T} , i.e. we multiply a matrix {A \in B_n} on the right by the column vector {(1, 2, \ldots, n)} . Thus {B_n} is a lift of {\Pi_n} , and we conclude that {\bar \gamma(\Pi_n) \leq n^2 \ll \gamma(\Pi_n)} .

1.4. The cut polytope

If {P \neq NP} , there are certain combinatorial polytopes we should not be able to optimize over efficiently. A central example is the cut polytope: {\mathrm{CUT}_n \subseteq \mathbb R^{n^2}} is the convex hull of all matrices of the form {(A_S)_{ij} = |\mathbf{1}_S(i)-\mathbf{1}_S(j)|} for some subset {S \subseteq \{1,\ldots,n\}} . Here, {\mathbf{1}_S} denotes the characteristic function of {S} .

Note that the MAX-CUT problem on a graph {G=(V,E)} can be encoded in the following way: Let {W_{ij} = 1} if {\{i,j\} \in E} and {W_{ij}=0} otherwise. Then the value of the maximum cut in {G} is precisely the maximum of {\langle W, A\rangle} for {A \in \mathrm{CUT}_n} . Accordingly, we should expect that {\bar \gamma(\mathrm{CUT}_n)} cannot be bounded by any polynomial in {n} (lest we violate a basic tenet of complexity theory).

1.5. Non-negative matrix factorization

The key to understanding {\bar \gamma(\mathrm{CUT}_n)} comes from Yannakakis’ factorization theorem.

Consider a polytope {P \subseteq \mathbb R^d} and let us write in two ways: As a convex hull of vertices

\displaystyle  P = \mathrm{conv}\left(x_1, x_2, \ldots, x_n\right)\,,

and as an intersection of half-spaces: For some {A \in \mathbb R^{m \times d}} and {b \in \mathbb R^m} ,

\displaystyle  P = \left\{ x \in \mathbb R^d : Ax \leq b \right\}\,.

Given this pair of representations, we can define the corresponding slack matrix of {P} by

\displaystyle  S_{ij} = b_i - \langle A_i, x_j \rangle \qquad i \in \{1,2,\ldots,m\}, j \in \{1,2,\ldots,n\}\,.

Here, {A_1, \ldots, A_m} denote the rows of {A} .

We need one more definition. In what follows, we will use {\mathbb R_+ = [0,\infty)} . If we have a non-negative matrix {M \in \mathbb R_+^{m \times n}} , then a rank-{r} non-negative factorization of {M} is a factorization {M = AB} where {A \in \mathbb R_+^{m \times r}} and {B \in \mathbb R_+^{r \times n}} . We then define the non-negative rank of {M} , written {\mathrm{rank}_+(M)} , to be the smallest {r} such that {M} admits a rank-{r} non-negative factorization.

Theorem 1 (Yannakakis) For every polytope {P} , it holds that {\bar \gamma(P) = \mathrm{rank}_+(S)} for any slack matrix {S} of {P} .

The key fact underlying this theorem is Farkas’ Lemma. It asserts that if {P = \{ x \in \mathbb R^d : Ax \leq b \}} , then every valid linear inequality over {P} can be written as a non-negative combination of the defining inequalities {\langle A_i, x \rangle \leq b_i} .

There is an interesting connection here to proof systems. The theorem says that we can interpret {\bar \gamma(P)} as the minimum number of axioms so that every valid linear inequality for {P} can be proved using a conic (i.e., non-negative) combination of the axioms.

1.6. Slack matrices and the correlation polytope

Thus to prove a lower bound on {\bar \gamma(\mathrm{CUT}_n)} , it suffices to find a valid set of linear inequalities for {\mathrm{CUT}_n} and prove a lower bound on the non-negative rank of the corresponding slack matrix.

Toward this end, consider the correlation polytope {\mathrm{CORR}_n \subseteq \mathbb R^{n^2}} given by

\displaystyle  \mathrm{CORR}_n = \mathrm{conv}\left(\left\{x x^T : x \in \{0,1\}^n \right\}\right)\,.

It is an exercise to see that {\mathrm{CUT}_{n+1}} and {\mathrm{CORR}_n} are linearly isomorphic.

Now we identify a particularly interesting family of valid linear inequalities for {\mathrm{CORR}_n} . (In fact, it turns out that this will also be an exhaustive list.) A quadratic multi-linear function on {\mathbb R^n} is a function {f : \mathbb R^n \rightarrow \mathbb R} of the form

\displaystyle  f(x) = a_0 + \sum_i a_{ii} x_i + \sum_{i < j} a_{ij} x_i x_j\,,

for some real numbers {a_0} and {\{a_{ij}\}} .

Suppose {f} is a quadratic multi-linear function that is also non-negative on {\{0,1\}^n} . Then “{f(x) \geq 0 \,\,\forall x \in \{0,1\}^n} ” can be encoded as a valid linear inequality on {\mathrm{CORR}_n} . To see this, write {f(x)=\langle A,xx^T\rangle + a_0} where {A=(a_{ij})} . (Note that {\langle \cdot,\cdot\rangle} is intended to be the standard inner product on {\mathbb R^{n^2}} .)

Now let {\textrm{QML}^+_n} be the set of all quadratic multi-linear functions that are non-negative on {\{0,1\}^n} , and consider the matrix (represented here as a function) {\mathcal M_n : \textrm{QML}^+_n \times \{0,1\}^n \rightarrow \mathbb R_+} given by

\displaystyle  \mathcal M_n(f,x) = f(x)\,.

Then from the above discussion, {\mathcal M_n} is a valid sub-matrix of some slack matrix of {\mathrm{CORR}_n} . To summarize, we have the following theorem.

Theorem 2 For all {n \geq 1} , it holds that {\bar \gamma(\mathrm{CUT}_{n+1}) \geq \mathrm{rank}_+(\mathcal M_n)} .

It is actually the case that {\bar \gamma(\mathrm{CUT}_{n+1}) = \mathrm{rank}_+(\mathcal M_n)} . The next post will focus on providing a lower bound on {\mathrm{rank}_+(\mathcal M_n)} .

Entropy optimality: Forster’s isotropy

In this post, we will see an example of entropy optimality applied to a determinantal measure (see, for instance, Terry Tao’s post on determinantal processes and Russ Lyons’ ICM survey). I think this is an especially fertile setting for entropy maximization, but this will be the only post in this vein for now; I hope to return to the topic later.

Our goal is to prove the following theorem of Forster.

Theorem 1 (Forster) Suppose that {x_1, x_2, \ldots, x_n \in \mathbb R^k} are unit vectors such that every subset of {k} vectors is linearly independent. Then there exists a linear mapping {A : \mathbb R^k \rightarrow \mathbb R^k} such that

\displaystyle  \sum_{i=1}^n \frac{(A x_i) (A x_i)^T}{\|A x_i\|^2} = \frac{n}{k} I\,. \ \ \ \ \ (1)

This result is surprising at first glance. If we simply wanted to map the vectors {\{x_1, \dots x_n\}} to isotropic position, we could use the matrix {A = (\sum_{i=1}^n x_i x_i^T)^{-1/2}} . But Forster’s theorem asks that the unit vectors

\displaystyle  \left\{ \frac{Ax_1}{\|A x_1\|}, \ldots, \frac{A x_n}{\|A x_n\|} \right\}

are in isotropic position. This seems to be a much trickier task.

Forster used this as a step in proving lower bounds on the sign rank of certain matrices. Forster’s proof is based on an iterative argument combined with a compactness assertion.

There is another approach based on convex programming arising in the work of Barthe on a reverse Brascamp-Lieb inequality. The relation to Forster’s theorem was observed in work of Hardt and Moitra; it is essentially the dual program to the one we construct below.

1.1. Some facts about determinants

We first recall a few preliminary facts about determinants. For any {x_1, \ldots, x_n \in \mathbb R^k} , we have the Cauchy-Binet formula

\displaystyle  \det\left(\sum_{i=1}^n x_i x_i^T\right) = \sum_{|S|=k} \det\left(\sum_{i \in S} x_i x_i^T\right)\,.

We also have a rank-one update formula for the determinant: If a matrix {A} is invertible, then

\displaystyle  \det\left(A+u u^T\right) = \det(A) \left(1+u^T A^{-1} u\right)\,.

Finally, for {k} vectors {x_1, x_2, \ldots, x_k \in \mathbb R^k} and nonnegative coefficients {c_1, c_2, \ldots, c_k \geq 0} , we have

\displaystyle  \det\left(\sum_{i=1}^k c_i \,x_i x_i^T\right) = \left(\prod_{i=1}^k c_i\right) \det\left(\sum_{i=1}^k x_i x_i^T\right)\,.

This follows because replacing {x_i} by {c_i x_i} corresponds to multiplying the {i} th row and column of {XX^T} by {\sqrt{c_i}} , where {X} is the matrix that has the vectors {x_1, \ldots, x_k} as columns.

1.2. A determinantal measure

To prove Theorem 1, we will first define a probability measure on {{[n] \choose k}} , i.e., on the {k} -subsets of {\{1,2,\ldots,n\}} by setting:

\displaystyle  D_S = \frac{\det\left(\sum_{i \in S} x_i x_i^T\right)}{\det\left(\sum_{i=1}^n x_i x_i^T\right)}\,.

The Cauchy-Binet formula is precisely the statement that {\sum_{|S|=k} D_S=1} , i.e. the collection {\{D_S\}} forms a probability distribution on {{[n] \choose k}} . How can we capture the fact that some vectors {x_1, \ldots, x_n} satisfy {\sum_{i=1}^n x_i x_i^T = \frac{n}{k} I} using only the values {D_S} ?

Using the rank-one update formula, for an invertible {k \times k} matrix {B} , we have {\frac{\partial}{\partial t}\big|_{t=0} \log \det(B+t u u^T) = \langle u, B^{-1} u\rangle} . Thus {B} is the {k \times k} identity matrix if and only if for every {u \in \mathbb R^k} ,

\displaystyle \frac{\partial}{\partial t}\left|_{t=0} \log \det(B+t uu^T)=\|u\|^2\,.\right.

Note also that using Cauchy-Binet,

\displaystyle  \frac{\partial}{\partial t}\left|_{t=0}\vphantom{\bigoplus}\right. \log \det\left(\sum_{j=1}^n x_j x_j^T + t x_i x_i^T\right)\qquad\qquad\qquad

\displaystyle = \frac{\sum_{S : i \in S} \frac{\partial}{\partial t}\big|_{t=0} \det\left(\sum_{j \in S} x_j x_j^T + t x_i x_i^T\right)} {\det\left(\sum_{i=1}^n x_i x_i^T\right)}

\displaystyle  = \frac{\sum_{S : i \in S} \frac{\partial}{\partial t}\big|_{t=0} (1+t)\det\left(\sum_{j \in S} x_j x_j^T\right)} {\det\left(\sum_{i=1}^n x_i x_i^T\right)} = \sum_{S : i \in S} D_S\,.

In particular, if {\sum_{i=1}^n x_i x_i^T = \frac{n}{k} I} , then for every {i=1,2,\ldots,n} , we have

\displaystyle  \frac{k}{n} = \frac{\partial}{\partial t}\big|_{t=0} \log \det\left(\sum_{i=1}^n x_i x_i^T +t x_i x_i^T\right) = \sum_{S : i \in S} D_S\,. \ \ \ \ \ (2)

Of course, our vectors {x_1, \ldots, x_n} likely don’t satisfy this condition (otherwise we would be done). So we will use the max-entropy philosophy to find the “simplest” perturbation of the {D_S} values that does satisfy it. The optimal solution will yield a matrix {A} satisfying (1).

1.3. Entropy maximization

Consider the following convex program with variables {\{p_S : S \subseteq [n], |S|=k\}} .

\displaystyle  \textrm{minimize} \qquad \sum_{|S|=k} p_S \log \frac{p_S}{D_S}

\displaystyle  \textrm{subject to} \qquad \sum_{S : i \in S} p_S = \frac{k}{n} \qquad \forall i=1,2,\ldots,n

\displaystyle  \sum_{|S|=k} p_S = 1

\displaystyle  p_S \geq 0 \qquad \forall |S|=k\,.

In other words, we look for a distributon on {[n] \choose k} that has minimum entropy relative to {D_S} , and such that all the “one-dimensional marginals” are equal (recall (2)). Remarkably, the optimum {p^*_S} will be a determinantal measure as well.

Note that the uniform distribution on subsets of size {k} is a feasible point and the objective is finite precisely because {D_S \neq 0} for every {S} . The latter fact follows from our assumption that every subset of {k} vectors is linearly independent.

1.4. Analyzing the optimizer

By setting the gradient of the Lagrangian to zero, we see that the optimal solution has the form

\displaystyle  p_S(\lambda_1, \ldots, \lambda_n) = \frac{\exp\left(\sum_{i \in S} \lambda_i \right) D_S}{\sum_{|S|=k} D_S \exp\left(\sum_{i \in S} \lambda_i\right)} = \frac{\exp\left(\sum_{i \in S} \lambda_i \right) D_S}{\det\left(\sum_{i=1}^n e^{\lambda_i} x_i x_i^T\right)} = \frac{\det\left(\sum_{i \in S} e^{\lambda_i} x_i x_i^T\right)}{\det\left(\sum_{i=1}^n e^{\lambda_i} x_i x_i^T\right)} \ \ \ \ \ (3)

for some dual variables {(\lambda_1, \ldots, \lambda_n)} . Note that the {\{\lambda_i\}} dual variables are unconstrained because they come from equality constraints.

Let us write {U = \sum_{i=1}^n e^{\lambda_i} x_i x_i^T} . We use {p_S^*} , {U_*} , and {\{\lambda_i^*\}} to denote the values at the optimal solution. Using again the rank-one update formula for the determinant,

\displaystyle  \frac{\partial}{\partial \lambda_i} \log \det(U) = e^{\lambda_i} \langle x_i, U^{-1} x_i\rangle\,.

But just as in (2), we can also use Cauchy-Binet to calculate the derivative (from the second expression in (3)):

\displaystyle  \frac{\partial}{\partial \lambda_i} \log \det(U) = \sum_{S : i \in S} p_S\,,

where we have used the fact that {\frac{\partial}{\partial \lambda_i} p_S = p_S} if {i \in S} (and otherwise equals {0} ). We conclude that

\displaystyle  \langle x_i, U_*^{-1} x_i\rangle = e^{-\lambda^*_i} \sum_{S : i \in S} p^*_S = e^{-\lambda^*_i} \frac{k}{n}\,.

Now we can finish the proof: Let {A = U_*^{-1/2}} . Then:

\displaystyle  \sum_{i=1}^n \frac{(A x_i) (A x_i)^T}{\|A x_i\|^2} = U_*^{-1/2} \sum_{i=1}^n \frac{x_i x_i^T}{\langle x_i, U_*^{-1} x_i\rangle} U_*^{-1/2}

\displaystyle = U_*^{-1/2} \sum_{i=1}^n \frac{n}{k} e^{\lambda_i^*} x_i x_i^T U_*^{-1/2} = \frac{n}{k} I\,.

Entropy optimality: Bloom’s Chang’s Lemma

[Added in post: One might consult this post for a simpler/more general (and slightly more correct) version of the results presented here.]

In our last post regarding Chang’s Lemma, let us visit a version due to Thomas Bloom. We will offer a new proof using entropy maximization. In particular, we will again use only boundedness of the Fourier characters.

There are two new (and elementary) techniques here: (1) using a trade-off between entropy maximization and accuracy and (2) truncating the Taylor expansion of {e^x} .

We use the notation from our previous post: {G=\mathbb F_p^n} for some prime {p} and {\mu} is the uniform measure on {G} . For {\eta > 0}  and {f \in L^2(G)} , we define {\Delta_{\eta}(f) = \{ \alpha \in G : |\hat f(\alpha)| \geq \eta \|f\|_1 \}} . We also use {Q_{\mu} \subseteq L^2(G)} to denote the set of all densities with respect to {\mu} .

Theorem 1 (Bloom) There is a constant {c > 0} such that for every {\eta > 0} and every density {f \in Q_{\mu}} , there is a subset {\Delta \subseteq \Delta_{\eta}(f) \subseteq G} such that {|\Delta| \geq c \eta |\Delta_{\eta}(f)|} and {\Delta} is contained in some {\mathbb F_p} subspace of dimension at most

\displaystyle  \frac{1+\mathrm{Ent}_{\mu}(f)}{c\eta}\,.

Note that we only bound the dimension of a subset of the large spectrum, but the bound on the dimension improves by a factor of {1/\eta} . Bloom uses this as the key step in his proof of what (at the time of writing) constitutes the best asymptotic bounds in Roth’s theorem on three-term arithmetic progressions:

Theorem 2 If a subset {A \subseteq \{1,2,\ldots,N\}} contains no non-trivial three-term arithmetic progressions, then

\displaystyle  |A| \leq O(1) \frac{\left(\log \log N\right)^4}{\log N} N\,.

This represents a modest improvement over the breakthrough of Sanders achieving {\frac{(\log \log N)^{O(1)}}{\log N} N} , but the proof is somewhat different.

1.1. A stronger version

In fact, we will prove a stronger theorem.

Theorem 3 For every {\eta > 0} and every density {f \in Q_{\mu}} , there is a random subset {\Delta \subseteq G} such that almost surely

\displaystyle  \dim_{\mathbb F_p}(\mathrm{span}\, \Delta) \leq 12 \frac{\mathrm{Ent}_{\mu}(f)}{\eta} + O(\log (1/\eta))\,,

and for every {\alpha \in \Delta_{\eta}(f)} , it holds that

\displaystyle  \mathbb P[\alpha \in \Delta] \geq \frac{\eta}{4}\,.

This clearly yields Theorem 1 by averaging.

1.2. The same polytope

To prove Theorem 3, we use the same polytope we saw before. Recall the class of test functionals {\mathcal F = \{ \pm \mathrm{Re}\,\chi_\alpha,  \pm \mathrm{Im}\,\chi_\alpha : \alpha \in G\} \subseteq L^2(G) }

We defined {P(f,\eta) \subseteq L^2(G)} by

\displaystyle  P(f,\eta) = \left\{ g \in L^2(G) : \langle g, \varphi \rangle \geq \langle f,\varphi\rangle - \eta\quad\forall \varphi \in \mathcal F\right\}\,.

Let us consider a slightly different convex optimization:

\displaystyle  \textrm{minimize } \mathrm{Ent}_{\mu}(g)+K \mathbf{\varepsilon} \qquad \textrm{ subject to } g \in P(f,\mathbf{\varepsilon}) \cap Q_{\mu}\,. \ \ \ \ \ (1)

Here, {K} is a constant that we will set soon. On the other hand, {\varepsilon \geq 0} is now intended as an additional variable over which to optimize. We allow the optimization to trade off the entropy term and the accuracy {\varepsilon} . The constant {K > 0} represents how much we value one vs. the other.

Notice that, since {f \in P(f,0) \cap Q_{\mu}} , this convex program satisfies Slater’s condition (there is a feasible point in the relative interior), meaning that strong duality holds (see Section 5.2.3).

1.3. The optimal solution

As in our first post on this topic, we can set the gradient of the Lagrangian equal to zero to obtain the form of the optimal solution: For some dual variables {\{\lambda^*_{\varphi} \geq 0: \varphi \in \mathcal F\} \subseteq \mathbb R} ,

\displaystyle  g^*(x) = \frac{\exp\left(\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} \varphi(x)\right)}{\mathop{\mathbb E}_{\mu} \left[\exp\left(\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} \varphi\right)\right]} \ \ \ \ \ (2)

Furthermore, corresponding to our new variable {\varepsilon} , there is a new constraint on the dual variables:

\displaystyle  \sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} \leq K\,.

Observe now that if we put {K = 2 \frac{\mathrm{Ent}_{\mu}(f)}{\eta}} then we can bound {\varepsilon^*} (the error in the optimal solution): Since {f} is a feasible solution with {\varepsilon=0} , we have

\displaystyle  \mathrm{Ent}_{\mu}(g^*) + K \varepsilon^* \leq \mathrm{Ent}_{\mu}(f)\,,

which implies that {\varepsilon^* \leq \frac{\mathrm{Ent}_{\mu}(f)}{K} = \frac{\eta}{2}} since {\mathrm{Ent}_{\mu}(g^*) \geq 0} .

To summarize: By setting {K} appropriately, we obtain {g^* \in P(f,\eta/2) \cap Q_{\mu}} of the form (2) and such that

\displaystyle  \sum_{\varphi \in \mathcal F} \lambda_{\varphi}^* \leq 2\frac{\mathrm{Ent}_{\mu}(f)}{\eta}\,. \ \ \ \ \ (3)

Note that one can arrive at the same conclusion using the algorithm from our previous post: The version unconcerned with sparsity finds a feasible point after time {T \leq \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon}} . Setting {\varepsilon = \eta/2} yields the same result without using duality.

1.4. A Taylor expansion

Let us slightly rewrite {g^*} by multiplying the numerator and denominator by {\exp\left({\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi}}\right)} . This yields:

\displaystyle  g^* = \frac{\exp\left(\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} (1+\varphi)\right)}{\mathop{\mathbb E}_{\mu} \exp\left(\sum_{\varphi \in \mathcal F} \lambda^*_{\varphi} (1+\varphi)\right)}

The point of this transformation is that now the exponent is a sum of positive terms (using {\|\varphi\|_{\infty} \leq 1} ), and furthermore by (3), the exponent is always bounded by

\displaystyle  B \stackrel{\mathrm{def}}{=} 4 \frac{\mathrm{Ent}_{\mu}(f)}{\eta}\,. \ \ \ \ \ (4)

Let us now Taylor expand {e^x = \sum_{j \geq 0} \frac{x^j}{j!}} . Applying this to the numerator, we arrive at an expression

\displaystyle  g^* = \sum_{\vec \alpha} y_{\vec \alpha} T_{\vec \alpha}

where {y_{\vec \alpha} \geq 0} , {\sum_{\vec \alpha} y_{\vec \alpha} = 1} , and each {T_{\vec \alpha} \in Q_{\mu}} is a density. Here, {\vec \alpha} ranges over all finite sequences of elements from {\mathcal F} and

\displaystyle  T_{\vec \alpha} = \frac{\prod_{i=1}^{|\vec \alpha|} (1+\varphi_{\vec \alpha_i})}{\mathop{\mathbb E}_{\mu} \prod_{i=1}^{|\vec \alpha|} (1+\varphi_{\vec \alpha_i})}=\prod_{i=1}^{|\vec \alpha|} (1+\varphi_{\vec \alpha_i})\,,

where we use {|\vec \alpha|} to denote the length of the sequence {\vec \alpha} .

1.5. The random subset

We now define a random function {\mathbf{T} \in L^2(G)} by taking {\mathbf{T}=T_{\vec \alpha}} with probability {y_{\vec \alpha}} .

Consider some {\gamma \in \Delta_{\eta}(f)} . Since {g^* \in P(f,\eta/2)} , we know that {\gamma \in \Delta_{\eta/2}(g^*)} . Thus

\displaystyle  \frac{\eta}{2} < |\langle g^*, \chi_{\gamma}\rangle| \leq \sum_{\vec \alpha} y_{\vec \alpha} |\langle T_{\vec \alpha}, \chi_{\gamma}\rangle| = \mathop{\mathbb E}\,|\langle \mathbf{T},\chi_{\gamma}\rangle|\,.

But we also have {|\langle T_{\vec \alpha}, \chi_{\gamma}\rangle| \leq \|T_{\vec \alpha}\|_1 \cdot \|\chi_{\gamma}\|_{\infty} \leq 1} . This implies that {\mathbb P[|\langle \mathbf{T}, \chi_{\gamma}\rangle| > 0] > \eta/2} .

Equivalently, for any {\gamma \in \Delta_{\eta}(f)} , it holds that {\mathbb P[\gamma \in \Delta_0(\mathbf{T})] > \eta/2} . We would be done with the proof of Theorem 3 if we also knew that {\mathbf{T}} were supported on functions {T_{\vec \alpha}} for which {|\vec \alpha| \leq O(B)} because {\dim_{\mathbb F_p}(\mathrm{span}(\Delta_0(T_{\vec \alpha}))) \leq |\vec \alpha|} . This is not necessarily true, but we can simply truncate the Taylor expansion to ensure it.

1.6. Truncation

Let {p_k(x) = \sum_{j \leq k} \frac{x^j}{j!}} denote the Taylor expansion of {e^x} to degree {k} . Since the exponent in {g^*} is always bounded by {B} (recall (4)), we have

\displaystyle  \sum_{|\vec \alpha| > k} y_{\vec \alpha} \leq \sup_{x \in [0,B]} \frac{|e^x - p_k(x)|}{e^x} \leq \frac{B^{k+1}}{(k+1)!}\,.

By standard estimates, we can choose {k \leq 3 B + O(\log(1/\eta))} to make the latter quantity at most {\eta/4} .

Since {\sum_{|\vec \alpha| > k} y_{\vec \alpha} \leq \eta/4} , a union bound combined with our previous argument immediately implies that for {\gamma \in \Delta_{\eta}(f)} , we have

\displaystyle  \mathbb P\left[|\langle \mathbf{T}, \chi_{\gamma}\rangle| > 0 \textrm{ and } \dim_{\mathbb F_p}(\mathrm{span}(\Delta_0(\mathbf{T}))) \leq k \right] \geq \frac{\eta}{4}\,.

This completes the proof of Theorem 3.

1.7. Prologue: A structure theorem

Generalizing the preceding argument a bit, one can prove the following.

Let {G} be a finite abelian group and use {\hat G} to denote the dual group. Let {\mu} denote the uniform measure on {G} . For every {\gamma \in \hat G} , let {\chi_{\gamma}} denote the corresponding character. Let us define a degree-{k} Reisz product to be a function of the form

\displaystyle  R(x) = \prod_{i=1}^k (1+\varepsilon_{i} \Lambda_i \chi_{\gamma_i}(x))

for some {\gamma_1, \ldots, \gamma_k \in \hat G} and {\varepsilon_1, \ldots, \varepsilon_k \in \{-1,1\}} and {\Lambda_1, \ldots, \Lambda_k \in \{\mathrm{Re},\mathrm{Im}\}} .

Theorem 4 For every {\eta > 0} , the following holds. For every {f : G \rightarrow \mathbb R_+} with {\mathbb E_{\mu} [f]=1} , there exists a {g : G \rightarrow \mathbb R_+} with {\mathbb E_{\mu} [g] = 1} such that {\|\hat f - \hat g\|_{\infty} \leq \eta} and {g} is a convex combination of degree-{k} Reisz products where

\displaystyle  k \leq O(1) \frac{\mathrm{Ent}_{\mu}(f)}{\eta} + O(\log (1/\eta))\,.

1.8. A prologue’s prologue

To indicate the lack of algebraic structure required for the preceding statement, we can set things up in somewhat greater generality.

For simplicity, let {X} be a finite set equipped with a probability measure {\mu} . Recall that {L^2(X)} is the Hilbert space of real-valued functions on {X} equipped with the inner product {\langle f,g\rangle = \mathop{\mathbb E}_{\mu}[fg]} . Let {\mathcal F \subseteq L^2(X)} be a set of functionals with the property that {\|\varphi\|_{\infty} \leq 1} for {\varphi \in \mathcal F} .

Define a degree-{k} {\mathcal F} -Riesz product as a function of the form

\displaystyle  R(x) = \prod_{i=1}^k (1+\varphi_i(x))

for some functions {\varphi_1, \ldots, \varphi_k \in \mathcal F} . Define also the (semi-) norm {\|f\|_{\mathcal F} = \sup_{\varphi \in \mathcal F} |\langle f,\varphi\rangle_{L^2(X)}|} .

Theorem 5 For every {\eta > 0} , the following holds. For every {f : X \rightarrow \mathbb R_+} with {\mathbb E_{\mu} [f]=1} , there exists a {g : X \rightarrow \mathbb R_+} with {\mathbb E_{\mu} [g] = 1} such that {\|f - g\|_{\mathcal F} \leq \eta} and {g} is a convex combination of degree-{k} {\mathcal F} -Riesz products where

\displaystyle  k \leq O(1) \frac{\mathrm{Ent}_{\mu}(f)}{\eta} + O(\log (1/\eta))\,.

Entropy optimality: A potential function

Our goal now is to prove Lemma 3 from the last post, thereby completing the entropy-maximization proof of Chang’s Lemma. We will prove it now in greater generality because it shows how relative entropy provides a powerful potential function for measuring progress.

Let {X} be a finite set equipped with a measure {\mu} . We use {L^2(X,\mu)} to denote the space of real-valued functions {f : X \rightarrow \mathbb R} equipped with the inner product {\langle f,g\rangle = \sum_{x \in X} \mu(x) f(x) g(x)} . Let {Q_{\mu} \subseteq L^2(X,\mu)} denote the convex polytope of probability densities with respect to {\mu} :

\displaystyle  Q_{\mu} = \left\{ g \in L^2(X,\mu) : g \geq 0, \mathop{\mathbb E}_{\mu}[g]=1 \right\}\,.

1.1. A ground truth and a family of tests

Fix now a density {f \in Q_{\mu}} and a family {\mathcal T \subseteq L^2(X,\mu)} that one can think of as “tests” (or properties of a function we might care about). Given an error parameter {\varepsilon > 0} , we define a convex polytope {P_{\varepsilon}(f, \mathcal T) \subseteq L^2(X,\mu)} as follows:

\displaystyle  P_{\varepsilon}(f,\mathcal T) = \left\{ g : \langle \varphi, g \rangle \geq \langle \varphi,f\rangle - \varepsilon \quad \forall \varphi \in \mathcal T\right\}\,.

This is the set of functions that have “performance” similar to that of {f} on all the tests in {\mathcal T} . Note that the tests are one-sided; if we wanted two-sided bounds for some {\varphi} , we could just add {-\varphi} to {\mathcal T} .

1.2. (Projected) coordinate descent in the dual

We now describe a simple algorithm to find a point {g \in P_{\varepsilon}(f,\mathcal T) \cap Q_{\mu}} . It will be precisely analogous to the algorithm we saw in the previous post for the natural polytope coming from Chang’s Lemma. We define a family of functions {\{g_t : t \geq 0\} \subseteq Q_{\mu}} indexed by a continuous time parameter:

\displaystyle  g_t = \frac{\exp\left(\int_0^t \varphi_s\,ds\right)}{\mathop{\mathbb E}_{\mu}\left[\exp\left(\int_0^t \varphi_s\,ds\right)\right]}\,.

Here {\varphi_t \in \mathcal T} is some test we have yet to specify. Intuitively, it will be a test that {g_t} is not performing well on. The idea is that at time {t} , we exponentially average some of {\varphi_t} into {g_t} . The exponential ensures that {g_t} is non-negative, and our normalization ensures that {g_t \in Q_{\mu}} . Observe that {g_0 = \mathbf{1}} is the constant {1} function.


1.3. The potential function

For two densities {h,g \in Q_{\mu}} , we define the relative entropy

\displaystyle  D_{\mu}(h\,\|\,g) = \sum_{x \in X} \mu(x) h(x) \log \frac{h(x)}{g(x)}\,.

Our goal is to analyze the potential function {D_{\mu}(f \,\|\,g_t)} . This functional measures how far the “ground truth” {f} is from our hypothesis {g_t} .

Note that {D_{\mu}(f \,\|\,g_0) = D_{\mu}(f \,\|\, \mathbf{1}) = \mathrm{Ent}_{\mu}(f)} , using the notation from the previous post. Furthermore, since the relative entropy between two probability measures is always non-negative, {D_{\mu}(f\,\|\,g_t) \geq 0} for all {t \geq 0} .

Now a simple calculation gives

\displaystyle  \frac{d}{dt} D_{\mu}(f\,\|\,g_t) = \langle \varphi_t, g_t-f\rangle\,. \ \ \ \ \ (1)

In other words, as long as the constraint {\langle \varphi_t, g_t\rangle \geq \langle \varphi_t, f\rangle - \varepsilon} is violated by {g_t} , the potential is decreasing by at least {\varepsilon} !

Since the potential starts at {\mathrm{Ent}_{\mu}(f)} and must always be non-negative, if we choose at every time {t} a violated test {\varphi_t} , then after time {T \leq \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon}} , it must be that {g_t \in P_{\varepsilon}(f,\mathcal T)} for some {t \in [0,T]} .

1.4. A sparse solution

But our goal in Chang’s Lemma was to find a “sparse” {g \in P_{\varepsilon}(f,\mathcal T)} . In other words, we want {g} to be built out of only a few constraints. To accomplish this, we should have {\varphi_t} switch between different tests as little as possible.

So when we find a violated test {\varphi \in \mathcal T} , let’s keep {\varphi_t = \varphi} until {\langle \varphi, g_t\rangle = \langle \varphi,f\rangle} . How fast can the quantity {\langle \varphi, g_t-f\rangle} drop from larger than {\varepsilon} to zero?

Another simple calculation (using {\varphi_t = \varphi} ) yields

\displaystyle  \frac{d}{dt} \langle \varphi, g_t\rangle = - \left\langle \vphantom{\bigoplus} \varphi, g_t (\varphi - \langle \varphi,g_t\rangle)\right\rangle = - \langle \varphi^2, g_t \rangle + \langle \varphi, g_t\rangle^2\,.

This quantity is at least {- \|g_t\|_1 \cdot \|\varphi\|_{\infty}^2 = -\|\varphi\|_{\infty}^2} .

In order to calculate how much the potential drops while focused on a single constraint, we can make the pessimistic assumption that {\frac{d}{dt} \langle \varphi, g_t\rangle = - \|\varphi\|_{\infty}^2} . (This is formally justified by Grönwall’s inequality.) In this case (recalling (1)), the potential drop is at least

\displaystyle  \frac{1}{\|\varphi\|_{\infty}^2} \int_0^{\varepsilon} x \,dx = \frac12 \frac{\varepsilon^2}{\|\varphi\|_{\infty}^2}\,.

Since the potential can drop by at most {\mathrm{Ent}_{\mu}(f)} overall, this bounds the number of steps of our algorithm, yielding the following result.

Theorem 1 For every {\varepsilon > 0} , there exists a function {g \in P_{\varepsilon}(f,\mathcal T) \cap Q_{\mu}} such that for some {\{\lambda_i \geq 0\}} and {\{\varphi_i\} \subseteq \mathcal T}

\displaystyle  g = \frac{\exp\left(\sum_{i=1}^k \lambda_i \varphi_i\right)}{\mathop{\mathbb E}_{\mu}\left[\exp\left(\sum_{i=1}^k \lambda_i \varphi_i\right)\right]}\,,

for some value

\displaystyle  k \leq 2 \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon^2} \sup_{\varphi \in \mathcal T} \|\varphi\|_{\infty}^2\,.

In order to prove Lemma 3 from the preceding post, simply use the fact that our tests were characters (and their negations), and these satisfy {\|\chi_{\alpha}\|_{\infty} \leq 1} .

1.5. Bibliographic notes

Mohit Singh has pointed out to me the similarity of this approach to the Frank-Wolfe algorithm. The use of the potential {D(f \,\|\,g_t)} is a common feature of algorithms that go by the names “boosting” or “multiplicative weights update” (see, e.g., this survey).

Entropy optimality: Chang’s Lemma

In the last post, we saw how minimizing the relative entropy allowed us to obtain a “simple” feasible point of a particular polytope. We continue exploring this theme by giving a related proof of Chang’s Lemma, a Fourier-analytic result that arose in additive combinatorics. Our proof will reveal another aspect of entropy optimality: “Projected sub-gradient descent” and a corresponding convergence analysis.

1.1. Discrete Fourier analysis and Chang’s Lemma

Fix {n \geq 1} and a prime {p} . Let {\mathbb F_p} denote the finite field with {p} elements. We will work over {\mathbb F_p^n} . For convenience, denote {G=\mathbb F_p^n} . Let {\mu} denote the uniform measure on {G} , and for a non-negative function {f : G \rightarrow [0,\infty)} , define the relative entropy

\displaystyle  \mathrm{Ent}_{\mu}(f) = \mathbb E_{\mu} \left[f \log \frac{f}{\mathbb E_{\mu} f}\right]\,,

where we use the shorthand {\mathop{\mathbb E}_{\mu} [f] = \sum_{x \in G} \mu(x) f(x)} . If {\mathop{\mathbb E}_{\mu} [f] = 1} , we say that {f} is a density (with respect to {\mu} ). In that case, {\mathrm{Ent}_{\mu}(f) = D(f \mu \,\|\,\mu)} where {D} denotes the relative entropy (introduced in the last post).

We briefly review some notions from discrete Fourier analysis. One could skip this without much loss; our proof will only use boundedness of the Fourier characters and no other features. For functions {f,g : G \rightarrow \mathbb C} , we will use the inner product {\langle f,g\rangle = \mathop{\mathbb E}_{\mu}[f\bar g]} and the norms {\|f\|_p = (\mathop{\mathbb E}_{\mu}[|f|^p])^{1/p}} for {p \geq 1} .

For an element {\alpha \in G} , we use

\displaystyle  \chi_{\alpha}(x) = \exp\left(\frac{-2\pi}{p} \sum_{i=1}^n \alpha_i x_i\right)

to denote the corresponding Fourier character. Now every {f : G \rightarrow \mathbb C} can be written in the Fourier basis as

\displaystyle  f = \sum_{\alpha \in G} \hat f(\alpha) \chi_{\alpha}\,,

where {\hat f(\alpha) = \langle f,\chi_{\alpha}\rangle} . (For simplicity of notation, we have implicitly identified {G} and {\hat G} .)

We will be interested in the large Fourier coefficients of a function. To that end, for {\eta > 0} , we define

\displaystyle  \Delta_{\eta}(f) = \left\{ \alpha \in G : |\hat f(\alpha)| > \eta \|f\|_1 \right\}\,,

Now we are ready to state Chang’s Lemma; it asserts that if {\mathrm{Ent}_{\mu}(f)} is small, then all of {f} ‘s large Fourier coefficients are contained in a low-dimensional subspace.

Lemma 1 If {f : G \rightarrow [0,\infty)} is a density, then for every {\eta > 0} the set {\Delta_{\eta}(f)} is contained in an {\mathbb F_p} -subspace of dimension at most {d} , where

\displaystyle  d \leq 2 \sqrt{2} \frac{\mathrm{Ent}_{\mu}(f)}{\eta^2}\,.

In the special case {p=2} , the proof will actually give constant {2} instead of {2 \sqrt{2}} .

1.2. A slightly stronger statement

In fact, we will prove the following slightly stronger statement.

Theorem 2 If {f : G \rightarrow [0,\infty)} is a density, then for every {\eta > 0} there exists a density {g : G \rightarrow [0,\infty)} such that {\Delta_{\eta}(f) \subseteq \Delta_0(g)} and moreover {g(x) = \Psi(\chi_{\alpha_1}(x), \ldots, \chi_{\alpha_d}(x))} for some {d \leq 1+2\sqrt{2} \frac{\mathrm{Ent}_{\mu}(f)}{\eta^2}} and {\{\alpha_i\} \subseteq G} and some function \Psi : \mathbb C^d \to \mathbb R . In other words, {g} is a function of at most {d} characters.

It is an exercise in linear algebra to see that for such a density {g} , the set of non-zero Fourier coefficients {\Delta_0(g)} is contained in the span of {\alpha_1, \ldots, \alpha_d} . Since {\Delta_{\eta}(f) \subseteq \Delta_0(g)} , this yields Lemma 1.

1.3. Entropy optimality

Let {L^2(G)} denote the Hilbert space of real-valued functions on {G} equipped with the inner product {\langle \cdot, \cdot\rangle} . Assume now that {f : G \rightarrow [0,\infty)} is a density and fix {\eta > 0} .

Dtnote by {\mathcal F} the family of functions of the form {\pm \mathrm{Re}\,\chi_\alpha} or {\pm \mathrm{Im}\, \chi_\alpha} for some {\alpha \in G} .

Let us associate to {f} the convex polytope {P(f, \eta) \subseteq L^2(G)} of functions {g : G \rightarrow [0,\infty)} satisfying

\displaystyle  \langle g,\varphi\rangle \geq \langle f,\varphi\rangle - \eta

for all {\varphi \in \mathcal F} .

Just as in the matrix scaling setting, {P(f,\eta)} encapsulates the family of “tests” that we care about (which Fourier coefficient magnitudes are large). Note that if {g \in P(f,\eta/\sqrt{2})} , then {\Delta_{\eta}(f) \subseteq \Delta_0(g)} . We also let {Q} denote the probability (density) simplex: The set of all {g : G \rightarrow [0,\infty)} such that {\mathop{\mathbb E}_{\mu} g = 1} .

Of course, {P(f,\eta) \cap Q} is non-empty because {f \in P(f,\eta) \cap Q} . As before, we will try to find a “simple” representative from {P(f,\eta) \cap Q} by doing entropy maximization (which is the same thing as relative entropy minimization):

\displaystyle  \textrm{minimize } \mathrm{Ent}_{\mu}(g) = \sum_{x \in G} \mu(x) g(x) \log g(x) \qquad \textrm{ subject to } g \in P(f,\eta) \cap Q\,.

1.4. Sub-gradient descent

Soon we will explore this optimization using convex duality. But in this post, let us give an algorithm that tries to locate a feasible point of {P(f,\eta) \cap Q} by a rather naive form of (sub-)gradient descent.

We start with the initial point {g_0 = \mathbf{1}} which is the function that is identically {1} on all of {G} . Note that by strict convexity, this is the unique maximizer of {\mathrm{Ent}_{\mu}(g)} among all {g \in Q} . But clearly it may be that {g_0 \notin P(f,\eta)} .

So suppose we have a function {g_t \in Q} for some {t \geq 0} . If {g_t \in P(f,\eta)} , we are done. Otherwise, there exists a violated constraint: Some {\varphi_t \in \mathcal F} such that {\langle g_t, \varphi_t\rangle < \langle f,\varphi_t\rangle - \eta} . Our goal is now to update {g_t} to {g_{t+1} \in Q} that doesn't violate this constraint quite as much.

Notice the dichotomy being drawn here between the “soft'' constraints of {P(f,\eta)} (which we try to satisfy slowly) and the “hard'' constraints of {Q} that we are enforcing at every step. (This procedure and the analysis that follows are a special case of the mirror descent framework; see, for example, Section 4 of Bubeck’s monograph. In particular, there is a more principled way to design the algorithm that follows.)

We might try to write {g_{t+1} = g_t + \varepsilon \varphi_t} for some small {\varepsilon > 0} , but then {g_{t+1}} could take negative values! Instead, we will follow the approximation {e^x \approx 1+x} (for {x} small) and make the update

\displaystyle  g_{t+1} = \frac{g_t \exp\left(\varepsilon \varphi_t\right)}{\mathbb E_{\mu}[g_t \exp\left(\varepsilon \varphi_t\right)]}\,.

Observe that the exponential weight update keeps {g_{t+1}} positive, and our explicit renormalization ensures that {\mathop{\mathbb E}_{\mu} [g_{t+1}]=1} . Thus {g_{t+1} \in Q} .

Of course, one might feel fishy about this whole procedure. Sure, we have improved the constraint corresponding to {\varphi_t} , but maybe we messed up some earlier constraints that we had satisfied. While this may happen, it cannot go on for too long. In the next post, we will prove the following.

Lemma 3 For an appropriate choice of step size {\varepsilon > 0} , it holds that {g_T \in P(f,\eta)} for some {T \leq 2 \frac{\mathrm{Ent}_{\mu}(f)}{\eta^2}} .

Notice that this lemma proves Theorem 2 and thus finishes our proof of Chang’s Lemma. That’s because by virtue of {g_T \in P(f,\eta/\sqrt{2})} , we know that {\Delta_{\eta}(f) \subseteq \Delta_0(g_T)} . Moreover, {g_T} is built out of at most {T+1} characters {\alpha_0, \alpha_1, \ldots, \alpha_{T}} , and thus it satisfies the conclusion of Theorem 2.

For the case {p=2} , we do not have to approximate the real and complex parts separately, so one saves a factor of {\sqrt{2}} .

1.5. Bibliographic notes

This argument bears some similarity to the proof of Impagliazzo, Moore, and Russell. It was discovered in a joint work with Chan, Raghavendra, and Steurer on extended formulations. In both those arguments, the “descent” is implicit and is done by hand—the characters are carefully chosen not to interact. It turns out that this level of care is unnecessary. The analysis in the next post will allow the descent to choose any violated constraint at any time, and moreover will only use the fact that the characters are bounded.

Entropy optimality: Matrix scaling

This is the first in a series of posts on the surprising power of “entropy optimality.” I have recently encountered many incarnations of this principle in a number of d\ifferent settings (functional analysis, stochastic processes, additive number theory, communication complexity, etc.). It is also connected to many well-studied phenomena in machine learning, optimization, and statistical physics. In the spirit of blogs, my goal is for each post to highlight a single interesting aspect of the theory.

For the first post, we’ll look at a very simple but compelling example: matrix scaling. This is a problem that arises in statistics, numerical analysis, and a number of other areas (see Section 1.2 here for references).

Suppose we are given an {n \times n} target matrix {T=(t_{ij})} with nonnegative entries. We also have an {n \times n} input matrix {X=(x_{ij})} and our goal is to multiply the rows and columns of {X} by positive weights so that the resulting matrix has the same row and column sums as {T} .

Stated differently, we look for diagonal matrices {D_1} and {D_2} such that {D_1 X D_2} has the same row and column sums as {T} . A typical choice of target might be {t_{ij} = 1/n} for all {1 \leq i,j \leq n} , meaning that we want to rescale {X} to be doubly stochastic.

Theorem 1 If {x_{ij} = 0 \iff t_{ij} = 0} for all {1 \leq i,j \leq n} , then such a weighting exists.

To prove the theorem, we will search for a matrix {Y = (y_{ij})} with the same row and column sums as {T} : For all {i,j \in [n]} :

\displaystyle \sum_{i} y_{ij}=\sum_i t_{ij} \quad \textrm{ and } \quad \sum_j y_{ij} = \sum_j t_{ij}\,.      (1)

Clearly such a matrix exists; we could just take {Y=T} ! But the idea is that we want {Y}  to be a “simple perturbation” of {X} . We will follow a philosophy analogous to Jaynes’ principle of maximum entropy. What results is precisely the argument of Franklin and Lorenz.

1.1. The relative entropy

Given two probability distributions {\mu} and {\nu} on a finite set {S} , we recall that the relative entropy between two measures {\mu} and {\nu} on {X} is given by

\displaystyle D(\mu\,\|\,\nu) = \sum_{x \in S} \mu(x) \log \frac{\mu(x)}{\nu(x)}\,.

We take this quantity to be infinite unless {\nu(x) = 0 \implies \mu(x)=0} . A fundamental property of {D} is that it is jointly convex in its arguments. For the following, we will only need that {D} is convex in {\mu} which is a simple consequence of the fact that {y \mapsto y \log y} is a convex function on {\mathbb R_+} .

In proving Theorem 1, we may clearly assume that {\sum_{ij} x_{ij} = \sum_{ij} t_{ij} = 1} . Now let {P(T) \subseteq \mathbb R^{n^2}} denote the affine subspace of all {Y} satisfying the constraints (1). We will consider the optimization problem:

\displaystyle \textrm{minimize } D(Y \,\|\, X) = \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} \quad \textrm{ subject to } Y \in P(T)\quad (2)

The idea is that the objective function prefers {Y} to be as close to {X} as possible in terms of relative entropy while still satisfying the constraints. Philosophically, this is supposed to encourage {Y} to be “simple” with respect to {X} .

Observe that, by our assumption on {X} , it follows that {T \in P(T)} is a point satisfying {D(T \,\|\,X) < \infty} . Note that since {P(T)} is a convex, compact set, we know that there is an optimizer, and since {Y \mapsto D(Y \,\|\,X)} is strictly convex, the optimal solution {Y^*} is unique. It will turn out that {Y^*} is of the form {D_1 X D_2} , and this fact will prove our theorem. The last step is to understand the structure of {Y^*} in terms of the Lagrangian.

1.2. The Lagrangian

Let us introduce {2n} dual variables {{ \alpha_i, \beta_j \in \mathbb R : i,j \in [n] }} and consider the Lagrangian (see the Boyd-Vandenberghe book) corresponding to the optimization (2):

\displaystyle \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} + \sum_i \alpha_i \sum_j (t_{ij}-y_{ij}) + \sum_j \beta_j \sum_i (t_{ij}-y_{ij})\,.

At optimality, the gradient should be equal to zero. If we differentiate with respect to some {y_{ij}} , we see that

\displaystyle 1+\log y_{ij} + \alpha_i + \beta_j - \log x_{ij} = 0\,,

which implies that, at optimality,

\displaystyle y^*_{ij} = x_{ij}, e^{-\alpha^*_i-\beta^*_j-1}\,.

In particular, we see that {Y^*} is obtained from {X} by multiplying the rows by {{e^{-\alpha_i^*-1}}} and multiplying the columns by {{e^{-\beta_j^*}}} , completing the proof of Theorem 1.

1.3. Being careful about optimality

An astute reader (or, in this case, commenter) might point out that we did not use the condition {t_{ij} = 0 \implies x_{ij}=0} . But clearly it is necessary, as one can see from examining the pair

\displaystyle  T = \left(\begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array}\right) \quad X = \left(\begin{array}{cc} 1 & 1 \\ 1 & 0 \end{array}\right)\,.

See these lecture notes for a detailed account.