Using the notation from the last post, our goal is now to prove the following theorem.

Theorem 1For every and , there is a constant such that for all ,

** 1.1. Convex relaxations of non-negative rank **

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets and and a matrix with .

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both and with the uniform measure. Let denote the set of probability densities on . Now define

Here are the columns of and are the rows of . Note that now is unconstrained.

Observe that is a convex function. To see this, given a pair and , write

witnessing the fact that .

** 1.2. A truncation argument **

So the set is convex, but it’s not yet clear how this relates to . We will see now that low non-negative rank matrices are close to matrices with small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms and .

Lemma 2For every non-negative with and every , there is a matrix such thatand

*Proof:* Suppose that with , and let us interpret this factorization in the form

(where are the columns of and are the rows of ). By rescaling the columns of and the rows of , respectively, we may assume that for every .

Let denote the “bad set” of indices (we will choose momentarily). Observe that if , then

from the representation (1) and the fact that all summands are positive.

Define the matrix . It follows that

Each of the latter terms is at most and , thus

Next, observe that

implying that and thus .

Setting yields the statement of the lemma.

Generally, the ratio will be small compared to (e.g., polynomial in vs. super-polynomial in ). Thus from now on, we will actually prove a lower bound on . One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

** 1.3. The test functionals **

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix , we will now assume that . Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix . We will consider the uniform measures on and . We use and to denote averaging with respect to these measures.

Let . From the last post, we know there exists a -locally positive functional such that , and for every -junta .

For with , let us denote . These functionals prove lower bounds on the junta-degree of restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family as our test functions.

** 1.4. Entropy maximization **

Use to write

where and we have and for all , and finally .

First, as we observed last time, if each were a -junta, we would have a contradiction:

because since is -locally positive and the function is a -junta.

So now the key step: Use entropy maximization to approximate by a junta! In future posts, we will need to consider the entire package of functions simultaneously. But for the present lower bound, it suffices to consider each separately.

Consider the following optimization over variables :

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1There exists a function satisfying all the preceding constraints and of the formsuch that

where is some constant depending only on .

Note that depends only on , and thus only depends on as well. Now each only depends on variables (those in and ), meaning that our approximator is an -junta for

Oops. That doesn’t seem very good. The calculation in (3) needs that is a -junta, and certainly (since is a function on ). Nevertheless, note that the approximator is a *non-trivial junta*. For instance, if , then it is an -junta, recalling that is a constant (depending on ).

** 1.5. Random restriction saves the day **

Let’s try to apply the logic of (3) to the approximators anyway. Fix some and let be the set of coordinates on which depends. Then:

Note that the map is a junta on . Thus if , then the contribution from this term is non-negative since is -locally positive. But is fixed and is growing, thus is quite rare! Formally,

In the last estimate, we have used a simple union bound and .

Putting everything together and summing over , we conclude that

Note that by choosing only moderately large, we will make this error term very small.

Moreover, since each is a feasible point of the optimization (4), we have

Almost there: Now observe that

Plugging this into the preceding line yields

Recalling now (2), we have derived a contradiction to if we can choose the right-hand side to be bigger than (which is a negative constant depending only on ). Setting , we consult (5) to see that

for some other constant depending only on . We thus arrive at a contradiction if , recalling that depend only on . This completes the argument.