Using the notation from the last post, our goal is now to prove the following theorem.
Theorem 1 For every
and
, there is a constant
such that for all
,
1.1. Convex relaxations of non-negative rank
Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets and
and a matrix
with
.
In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of is precisely the set of all non-negative matrices.
Instead, let us proceed analytically. For simplicity, let us equip both and
with the uniform measure. Let
denote the set of probability densities on
. Now define
Here are the columns of
and
are the rows of
. Note that now
is unconstrained.
Observe that is a convex function. To see this, given a pair
and
, write
witnessing the fact that .
1.2. A truncation argument
So the set is convex, but it’s not yet clear how this relates to
. We will see now that low non-negative rank matrices are close to matrices with
small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”
In the following lemma, we use the norms and
.
Lemma 2 For every non-negative
with
and every
, there is a matrix
such that
and
Proof: Suppose that with
, and let us interpret this factorization in the form
(where are the columns of
and
are the rows of
). By rescaling the columns of
and the rows of
, respectively, we may assume that
for every
.
Let denote the “bad set” of indices (we will choose
momentarily). Observe that if
, then
from the representation (1) and the fact that all summands are positive.
Define the matrix . It follows that
Each of the latter terms is at most and
, thus
Next, observe that
implying that and thus
.
Setting yields the statement of the lemma.
Generally, the ratio will be small compared to
(e.g., polynomial in
vs. super-polynomial in
). Thus from now on, we will actually prove a lower bound on
. One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.
1.3. The test functionals
Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix , we will now assume that
. Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix
. We will consider the uniform measures on
and
. We use
and
to denote averaging with respect to these measures.
Let . From the last post, we know there exists a
-locally positive functional
such that
, and
for every
-junta
.
For with
, let us denote
. These functionals prove lower bounds on the junta-degree of
restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family
as our test functions.
1.4. Entropy maximization
Use to write
where and we have
and
for all
, and finally
.
First, as we observed last time, if each were a
-junta, we would have a contradiction:
because since
is
-locally positive and the function
is a
-junta.
So now the key step: Use entropy maximization to approximate by a junta! In future posts, we will need to consider the entire package
of functions simultaneously. But for the present lower bound, it suffices to consider each
separately.
Consider the following optimization over variables :
The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).
Claim 1 There exists a function
satisfying all the preceding constraints and of the form
such that
where
is some constant depending only on
.
Note that depends only on
, and thus
only depends on
as well. Now each
only depends on
variables (those in
and
), meaning that our approximator
is an
-junta for
Oops. That doesn’t seem very good. The calculation in (3) needs that is a
-junta, and certainly
(since
is a function on
). Nevertheless, note that the approximator is a non-trivial junta. For instance, if
, then it is an
-junta, recalling that
is a constant (depending on
).
1.5. Random restriction saves the day
Let’s try to apply the logic of (3) to the approximators anyway. Fix some
and let
be the set of coordinates on which
depends. Then:
Note that the map is a junta on
. Thus if
, then the contribution from this term is non-negative since
is
-locally positive. But
is fixed and
is growing, thus
is quite rare! Formally,
In the last estimate, we have used a simple union bound and .
Putting everything together and summing over , we conclude that
Note that by choosing only moderately large, we will make this error term very small.
Moreover, since each is a feasible point of the optimization (4), we have
Almost there: Now observe that
Plugging this into the preceding line yields
Recalling now (2), we have derived a contradiction to if we can choose the right-hand side to be bigger than
(which is a negative constant depending only on
). Setting
, we consult (5) to see that
for some other constant depending only on
. We thus arrive at a contradiction if
, recalling that
depend only on
. This completes the argument.