Using the notation from the last post, our goal is now to prove the following theorem.
Theorem 1 For every
and
, there is a constant
such that for all
,
1.1. Convex relaxations of non-negative rank
Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets and
and a matrix
with
.
In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of is precisely the set of all non-negative matrices.
Instead, let us proceed analytically. For simplicity, let us equip both and
with the uniform measure. Let
denote the set of probability densities on
. Now define
Here are the columns of
and
are the rows of
. Note that now
is unconstrained.
Observe that is a convex function. To see this, given a pair
and
, write
witnessing the fact that .
1.2. A truncation argument
So the set is convex, but it’s not yet clear how this relates to
. We will see now that low non-negative rank matrices are close to matrices with
small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”
In the following lemma, we use the norms and
.
Lemma 2 For every non-negative
with
and every
, there is a matrix
such that
and
Proof: Suppose that with
, and let us interpret this factorization in the form
(where are the columns of
and
are the rows of
). By rescaling the columns of
and the rows of
, respectively, we may assume that
for every
.
Let denote the “bad set” of indices (we will choose
momentarily). Observe that if
, then
from the representation (1) and the fact that all summands are positive.
Define the matrix . It follows that
Each of the latter terms is at most and
, thus
Next, observe that
implying that and thus
.
Setting yields the statement of the lemma.
Generally, the ratio will be small compared to
(e.g., polynomial in
vs. super-polynomial in
). Thus from now on, we will actually prove a lower bound on
. One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.
1.3. The test functionals
Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix , we will now assume that
. Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix
. We will consider the uniform measures on
and
. We use
and
to denote averaging with respect to these measures.
Let . From the last post, we know there exists a
-locally positive functional
such that
, and
for every
-junta
.
For with
, let us denote
. These functionals prove lower bounds on the junta-degree of
restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family
as our test functions.
1.4. Entropy maximization
Use to write
where and we have
and
for all
, and finally
.
First, as we observed last time, if each were a
-junta, we would have a contradiction:
because since
is
-locally positive and the function
is a
-junta.
So now the key step: Use entropy maximization to approximate by a junta! In future posts, we will need to consider the entire package
of functions simultaneously. But for the present lower bound, it suffices to consider each
separately.
Consider the following optimization over variables :
The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).
Claim 1 There exists a function
satisfying all the preceding constraints and of the form
such that
where
is some constant depending only on
.
Note that depends only on
, and thus
only depends on
as well. Now each
only depends on
variables (those in
and
), meaning that our approximator
is an
-junta for
Oops. That doesn’t seem very good. The calculation in (3) needs that is a
-junta, and certainly
(since
is a function on
). Nevertheless, note that the approximator is a non-trivial junta. For instance, if
, then it is an
-junta, recalling that
is a constant (depending on
).
1.5. Random restriction saves the day
Let’s try to apply the logic of (3) to the approximators anyway. Fix some
and let
be the set of coordinates on which
depends. Then:
Note that the map is a junta on
. Thus if
, then the contribution from this term is non-negative since
is
-locally positive. But
is fixed and
is growing, thus
is quite rare! Formally,
In the last estimate, we have used a simple union bound and .
Putting everything together and summing over , we conclude that
Note that by choosing only moderately large, we will make this error term very small.
Moreover, since each is a feasible point of the optimization (4), we have
Almost there: Now observe that
Plugging this into the preceding line yields
Recalling now (2), we have derived a contradiction to if we can choose the right-hand side to be bigger than
(which is a negative constant depending only on
). Setting
, we consult (5) to see that
for some other constant depending only on
. We thus arrive at a contradiction if
, recalling that
depend only on
. This completes the argument.
I’m still digesting this post, but it’s really nice! The entropy minimization argument seems to be used just to make a claim of the form that if the only thing you care about is to preserve correlations between a bounded function B and all (bounded) m-juntas you may as well assume the function B itself only depends on “few” coordinates – not much more than m times log(||B||_\infty). This is somewhat intuitive since you only want to preserve correlations up to an additive error, and a bounded function could possibly correlate highly with many juntas on disjoint coordinates.
A couple comments:
– In terms of the definition of \alpha_+, I can see loosely the connection with a factorization norm but I couldn’t formulate it precisely as one either. In that respect it might be more natural to have \alpha be defined as the infimum, over all decompositions, of the product of the two quantities that it is currently assumed to bound (the largest row norm of B and the sum of the column norms of A). In fact in the proof of Lemma 2 you end up with a bound that is not uniform, you have a much better bond on the A part (rM) than on the B part (rM/delta).
– A couple lines before 1.2, “write write”.
– In lemma 2 it might be helpful to clarify that the 1- and infinity-norms over M are element-wise (rather than the Schatten norms), as this confused me at first. Also, this is the first time so many norms show up, and it’s not clear at first why e.g. the distance between M and \tilde{M} are measured in 1-norm, but the bound on \alpha_+ involves the infinity-norm. Is there some good intuition for this choice of normalizations? The different norms show up repeatedly in the argument and I had trouble tracking down which bounds were critical and which were more loose. (Feel free to ignore this question, it’s a bit unfair…)
– Right after the proof of Lemma 2, did you mean the inverse ratio, ||M||_1/||M||_\infty? The one you wrote is <<1, no?
Hi Thomas, thanks for the comments!
I fixed the typos and added an explanation of the norms. It should also address your last comment; all the norms are supposed to be averages, so |M|_\infty/|M|_1 makes sense.
Your two more substantial comments are related. I agree that
should involve a product of the
norm of B and the
norm of A. I went back and forth on the correct way to write this. Writing as a product introduced yet another rescaling in the max-entropy argument. As you say, it creates an unnatural imbalance in the truncation argument.
On the other hand, the imbalance would still persist: The eventual dependence on
is linear, while the dependence on
only comes in only through the entropy and is logarithmic.
I am going to pause on writing it correctly until I write the quantum version. There things are more subtle (e.g., there is no canonical decomposition into B_i’s so they cannot be individually normalized in the L^1 norm), so hopefully the correct normalization will be more apparent.
As for the way error is measured: We are trying to separate
from the convex set of matrices with small
value. So we are going to use a linear functional which, in this case is
The functional is bounded in
norm (simply because
is fixed), so we need our approximator
to be good in the
norm. The idea is that the error from approximation should be smaller than the “fatness” of the separating functional (which in the post is related to the value
).
Since we care about
error, the correct way to write the bound in Lemma 2 is
, i.e. with relative error. If you do that, you see that actually the
bound on
should be written as
. This is great, as now the statement of the lemma is scale invariant.
Finally, the reason that the ratio
has to arise somewhere is in the heart of the truncation argument. Specifically, if B_i is big somewhere, then A_i must be small everywhere as long as M is uniformly bounded. This allows us to get rid of the large B_i’s and is an essential use of the non-negativity of the factorization.
In any case, you make good points, and I struggled to address them without complicating the (very simple) truncation argument any further. I wonder even if introducing
is a good idea. Maybe Lemma 2 should just say what kind of approximator exists and dispense with
. I wanted to have it there for analogy since In the quantum setting, the analogous “factorization (not a) norm” is more important and subtle.
Yes, this is exactly right.
And, in fact, when you really have this explicit structure, there are other ways at arriving at this approximation. We could build a junta as follows. Suppose B is a distribution on {0,1}^n with very high entropy, e.g. n-t (so t = relative entropy). Find a junta approximation as follows: Initially, the junta set J is empty. Now if B has a large low-degree Fourier coefficient S that is not completely contained in J, then add all the coordinates in S to J. Repeat until no such coefficients remain.
Since B has n-t bits of entropy, this cannot go on for too long. Every time we find a new set S on which B is biased, our estimate of its entropy decreases. At the end, we have constructed an approximation B’ that is a J-junta. It’s an approximation in the sense that B-B’ has no large low-degree Fourier coefficients.
Now do the random restriction to a subset R. As in the current argument, B’|_R will likely have only few junta coordinates. At the same time, the error B-B’|_R will shrink dramatically because a high-degree Fourier coefficient will likely not survive the restriction (all its coordinates will have to fall in R).
This is the argument from our FOCS 2013 paper.
The disadvantage is that (a) it’s more complicated, and (b) it requires looking very carefully at the kind of structure you want the approximator to have. (Notice, for instance, that the entropy-minimization argument does not need to refer to Fourier coefficients, and can potentially work just as well over other domains like the set of traveling salesman tours in the complete graph without getting into representation theory of the symmetric group!)
David and Prasad and I spent well over a year trying to figure out the analogous structural characterization for the SDP setting. In some sense, we were forced to discover the (quantum) entropy-minimization approach because we couldn’t get a handle on the right structure. We had very complicated arguments for special cases.
In the LP setting, one has r functions B_1, …, B_r. In the SDP setting, one has an r-dimensional subspace V of functions and you have to control p^2 for every
simultaneously. How do you approximate a subspace by a simpler one? and how does random restriction make the subspace nicer? (It’s a tricky question, and choosing a basis for V is clearly a horrible idea–since it breaks invariance–and it doesn’t seem to work.)
The entropy-minimization philosophy is: Pick your tests. Now if:
(i) The tests are simple.
(ii) The tests are smooth (the L^\infty norm is small)
(iii) You allow some error \eps
Then the entropy-optimal approximator will also be “simple” (as it is a simple combination of simple things).
It’s nice because the tests (which often come up front for a natural reason) dictate the structure. As we’ll see soon, this also works in the SDP/quantum setting.
Thanks for the clarifications, they’re helpful! I’m wondering if indeed it wouldn’t be easier to present things in a different order, with Lemma 2 coming at the end:
– You start with any low-rank factorization of M, with the normalization ||B_i||_1=1 wlog. (From which the bound ||A_i||_\infty\leq ||M||_\infty follows immediately and can be used as needed).
– Then you observe that the entropy minimization argument, together with the random restriction step, gives you an upper bound on the degree of g that depends on the entropy Ent(B_i). It is clear why this comes in: it is the basis for the entropy argument.
– Finally you need to deal with the fact that Ent(B_i) could be large. For this it is natural to treat high-entropy rows as being rare and try to argue that, for the purposes of the overall argument, they cannot do too much harm. Your explanations on getting a L1-approximant then make it more or less clear how Lemma 2 should come in, and you end up removing large-L_infty columns and using duality to bound the effect on M in L1.
Ok, I’m not suggesting this is how it should be done; it’s mostly for my own benefit. You’re probably right that the “quantum” setting will help clarify, by seeing the argument once more in a setting where perhaps fewer types of approximations are possible. It’s like trying to understand duality by working with L2, it’s easy to get it the wrong way round :-)
(Minor typo: you might be missing an A_i(S) on the rhs of the first centered equation in section 1.5)