Using the notation from the last post, our goal is now to prove the following theorem.
1.1. Convex relaxations of non-negative rank
Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets and and a matrix with .
In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of is precisely the set of all non-negative matrices.
Instead, let us proceed analytically. For simplicity, let us equip both and with the uniform measure. Let denote the set of probability densities on . Now define
Here are the columns of and are the rows of . Note that now is unconstrained.
Observe that is a convex function. To see this, given a pair and , write
witnessing the fact that .
1.2. A truncation argument
So the set is convex, but it’s not yet clear how this relates to . We will see now that low non-negative rank matrices are close to matrices with small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”
In the following lemma, we use the norms and .
(where are the columns of and are the rows of ). By rescaling the columns of and the rows of , respectively, we may assume that for every .
Let denote the “bad set” of indices (we will choose momentarily). Observe that if , then
from the representation (1) and the fact that all summands are positive.
Define the matrix . It follows that
Each of the latter terms is at most and , thus
Next, observe that
implying that and thus .
Setting yields the statement of the lemma.
Generally, the ratio will be small compared to (e.g., polynomial in vs. super-polynomial in ). Thus from now on, we will actually prove a lower bound on . One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.
1.3. The test functionals
Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix , we will now assume that . Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix . We will consider the uniform measures on and . We use and to denote averaging with respect to these measures.
Let . From the last post, we know there exists a -locally positive functional such that , and for every -junta .
For with , let us denote . These functionals prove lower bounds on the junta-degree of restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family as our test functions.
1.4. Entropy maximization
Use to write
where and we have and for all , and finally .
because since is -locally positive and the function is a -junta.
So now the key step: Use entropy maximization to approximate by a junta! In future posts, we will need to consider the entire package of functions simultaneously. But for the present lower bound, it suffices to consider each separately.
Consider the following optimization over variables :
The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).
Claim 1 There exists a function satisfying all the preceding constraints and of the form
where is some constant depending only on .
Oops. That doesn’t seem very good. The calculation in (3) needs that is a -junta, and certainly (since is a function on ). Nevertheless, note that the approximator is a non-trivial junta. For instance, if , then it is an -junta, recalling that is a constant (depending on ).
1.5. Random restriction saves the day
Let’s try to apply the logic of (3) to the approximators anyway. Fix some and let be the set of coordinates on which depends. Then:
Note that the map is a junta on . Thus if , then the contribution from this term is non-negative since is -locally positive. But is fixed and is growing, thus is quite rare! Formally,
In the last estimate, we have used a simple union bound and .
Putting everything together and summing over , we conclude that
Note that by choosing only moderately large, we will make this error term very small.
Moreover, since each is a feasible point of the optimization (4), we have
Almost there: Now observe that
Plugging this into the preceding line yields
for some other constant depending only on . We thus arrive at a contradiction if , recalling that depend only on . This completes the argument.