Entropy optimality: Non-negative rank lower bounds

Using the notation from the last post, our goal is now to prove the following theorem.

Theorem 1 For every ${m \geq 1}$ and ${g : \{0,1\}^m \rightarrow \mathbb R_+}$, there is a constant ${C=C(g)}$ such that for all ${n \geq 2m}$,

$\displaystyle \mathrm{rank}_+(M_n^g) \geq C \left(\frac{n}{\log n}\right)^{\mathrm{deg}_J(g)-1}\,.$

1.1. Convex relaxations of non-negative rank

Before getting to the proof, let us discuss the situation in somewhat more generality. Consider finite sets ${X}$ and ${Y}$ and a matrix ${M : X \times Y \rightarrow \mathbb R_+}$ with ${r=\mathrm{rank}_+(M)}$.

In order to use entropy-maximization, we would like to define a convex set of low non-negative rank factorizations (so that maximizing entropy over this set will give us a “simple” factorization). But the convex hull of ${\{ N \in \mathbb R_+^{X \times Y} : \mathrm{rank}_+(N) = 1 \}}$ is precisely the set of all non-negative matrices.

Instead, let us proceed analytically. For simplicity, let us equip both ${X}$ and ${Y}$ with the uniform measure. Let ${\mathcal Q = \{ b : Y \rightarrow \mathbb R_+ \mid \|b\|_1 = 1\}}$ denote the set of probability densities on ${Y}$. Now define

$\displaystyle \alpha_+(N) = \min \Big\{ \alpha : \exists A \in \mathbb R_+^{X \times k}, B \in \mathbb R_+^{k \times Y} \textrm{ with } N=AB, \textrm{ and}$

$\displaystyle \qquad\qquad\qquad\qquad\qquad\{B_1, \ldots, B_k\} \subseteq \mathcal Q, \textrm{ and }$

$\displaystyle \hphantom{\{B_1, \ldots, B_k\} \subseteq \mathcal Q,} \max_{i \in [k]} \|B_i\|_{\infty} \leq \alpha, \sum_{i=1}^k \|A^{(i)}\|_{\infty} \leq \alpha\Big\}.$

Here ${\{A^{(i)}\}}$ are the columns of ${A}$ and ${\{B_i\}}$ are the rows of ${B}$. Note that now ${k}$ is unconstrained.

Observe that ${\alpha_+}$ is a convex function. To see this, given a pair ${N=AB}$ and ${N'=A'B'}$, write

$\displaystyle \frac{N+N'}{2} = \left(\begin{array}{cc} \frac12 A & \frac12 A' \end{array}\right) \left(\begin{array}{l} \vphantom{\bigoplus} B \\ \vphantom{\bigoplus} B' \end{array}\right)\,,$

witnessing the fact that ${\alpha_+(\frac12(N+N')) \leq \frac12 \alpha_+(N) + \frac12 \alpha_+(N')}$.

1.2. A truncation argument

So the set ${\{ N : \alpha_+(N) \leq c \}}$ is convex, but it’s not yet clear how this relates to ${\mathrm{rank}_+}$. We will see now that low non-negative rank matrices are close to matrices with ${\alpha_+}$ small. In standard communication complexity/discrepancy arguments, this corresponds to discarding “small rectangles.”

In the following lemma, we use the norms ${\|M\|_1 = \mathbb E_{x,y} |M_{x,y}|}$ and ${\|M\|_{\infty} = \max_{x,y} |M_{x,y}|}$.

Lemma 2 For every non-negative ${M \in \mathbb R_+^{X \times Y}}$ with ${\mathrm{rank}_+(M) \leq r}$ and every ${\delta \in (0,1)}$, there is a matrix ${\tilde M \in \mathbb R_+^{X \times Y}}$ such that

$\displaystyle \|M-\tilde M\|_1 \leq \delta$

and

$\displaystyle \alpha_+(\tilde M) \leq \frac{r}{\delta} \|M\|_{\infty}\,.$

Proof: Suppose that ${M=AB}$ with ${A \in \mathbb R_+^{X \times r}, B \in \mathbb R_+^{r \times Y}}$, and let us interpret this factorization in the form

$\displaystyle M(x,y) = \sum_{i=1}^r A_i(x) B_i(y) \ \ \ \ \ (1)$

(where ${\{A_i\}}$ are the columns of ${A}$ and ${\{B_i\}}$ are the rows of ${B}$). By rescaling the columns of ${A}$ and the rows of ${B}$, respectively, we may assume that ${\mathop{\mathbb E}_{\mu}[B_i]=1}$ for every ${i \in [r]}$.

Let ${\Lambda = \{ i : \|B_i\|_{\infty} > \tau \}}$ denote the “bad set” of indices (we will choose ${\tau}$ momentarily). Observe that if ${i \in \Lambda}$, then

$\displaystyle \|A_i\|_{\infty} \leq \frac{\|M\|_{\infty}}{\tau}\,,$

from the representation (1) and the fact that all summands are positive.

Define the matrix ${\tilde M(x,y) = \sum_{i \notin \Lambda} A_i(x) B_i(y)}$. It follows that

$\displaystyle \|M-\tilde M\|_1 = \mathop{\mathbb E}_{x,y} \left[|M(x,y)-\tilde M(x,y)|\right] = \sum_{i \in \Lambda} \mathop{\mathbb E}_{x,y} [A_i(x) B_i(y)]\,.$

Each of the latter terms is at most ${\|A_i\|_{\infty} \|B_i\|_1 \leq \frac{\|M\|_{\infty}}{\tau}}$ and ${|\Lambda| \leq r}$, thus

$\displaystyle \|M-\tilde M\|_1 \leq r \frac{\|M\|_{\infty}}{\tau}\,.$

Next, observe that

$\displaystyle \mathop{\mathbb E}_y [M(x,y)] = \sum_{i=1}^r A_i(x) \|B_i\|_1 = \sum_{i=1}^r A_i(x)\,,$

implying that ${\|A_i\|_{\infty} \leq \|M\|_{\infty}}$ and thus ${\sum_{i=1}^r \|A_i\|_{\infty} \leq r \|M\|_{\infty}}$.

Setting ${\tau = r \|M\|_{\infty}/\delta}$ yields the statement of the lemma. $\Box$

Generally, the ratio ${\frac{\|M\|_{\infty}}{\|M\|_1}}$ will be small compared to ${r}$ (e.g., polynomial in ${n}$ vs. super-polynomial in ${n}$). Thus from now on, we will actually prove a lower bound on ${\alpha_+(M)}$. One has to verify that the proof is robust enough to allow for the level of error inherent in Lemma 2.

1.3. The test functionals

Now we have a convex body of low “analytic non-negative rank” matrices. Returning to Theorem 1 and the matrix ${M_n^g}$, we will now assume that ${\alpha_+(M_n^g) \leq \alpha}$. Next we identify the proper family of test functionals that highlight the difficulty of factoring the matrix ${M_n^g}$. We will consider the uniform measures on ${{[n] \choose m}}$ and ${\{0,1\}^n}$. We use ${\mathop{\mathbb E}_S}$ and ${\mathop{\mathbb E}_x}$ to denote averaging with respect to these measures.

Let ${d=\deg_J(g)-1}$. From the last post, we know there exists a ${d}$-locally positive functional ${\varphi : \{0,1\}^n \rightarrow \mathbb R}$ such that ${\beta := \mathop{\mathbb E}_x \varphi(x) g(x) < 0}$, and ${\mathop{\mathbb E}_x \varphi(x) q(x) \geq 0}$ for every ${d}$-junta ${q}$.

For ${S \subseteq [n]}$ with ${|S|=m}$, let us denote ${\varphi_S(x) = \varphi(x|_S)}$. These functionals prove lower bounds on the junta-degree of ${g}$ restricted to various subsets of the coordinates. If we expect that junta-factorizations are the “best” of a given rank, then we have some confidence in choosing the family ${\{\varphi_S\}}$ as our test functions.

1.4. Entropy maximization

Use ${\alpha_+(M_n^g) \leq \alpha}$ to write

$\displaystyle M_n^g(S,x) = \sum_{i=1}^k A_i(S) B_i(x)\,,$

where ${A_i, B_i \geq 0}$ and we have ${\|B_i\|_1=1}$ and ${\|B_i\|_{\infty} \leq \alpha}$ for all ${i \in [k]}$, and finally ${\sum_{i=1}^k \|A_i\|_{\infty} \leq \alpha}$.

First, as we observed last time, if each ${B_i}$ were a ${d}$-junta, we would have a contradiction:

$\displaystyle \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) M_n^g(S,x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^m} \varphi(y) g(y) = \beta < 0\,, \ \ \ \ \ (2)$

and yet

$\displaystyle \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) \sum_{i=1}^k A_i(S) B_i(x)]\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] \geq 0 \ \ \ \ \ (3)$

because ${\mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] = \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right] \geq 0}$ since ${\varphi}$ is ${d}$-locally positive and the function ${y \mapsto \mathop{\mathbb E}_{x} \left[B_i(x) \,\big|\, x|_S =y\right]}$ is a ${d}$-junta.

So now the key step: Use entropy maximization to approximate ${B_i}$ by a junta! In future posts, we will need to consider the entire package ${\{B_1, \ldots, B_k\}}$ of functions simultaneously. But for the present lower bound, it suffices to consider each ${B_i}$ separately.

Consider the following optimization over variables ${\{\tilde B_i(x) : x \in \{0,1\}^n\}}$:

$\displaystyle \mathrm{minimize }\qquad \mathrm{Ent}(\tilde B_i) = \mathbb E [\tilde B_i \log \tilde B_i]$

$\displaystyle \qquad\textrm{ subject to } \qquad \qquad\mathbb E[\tilde B_i]=1, \quad \tilde B_i(x) \geq 0 \quad \forall x \in \{0,1\}^n$

$\displaystyle \hphantom{\bigoplus} \qquad \qquad \qquad \qquad \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right] \leq \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right] + \varepsilon \qquad \forall |S|=m\,. \ \ \ \ \ (4)$

The next claim follows immediately from Theorem 1 in this post (solving the max-entropy optimization by sub-gradient descent).

Claim 1 There exists a function ${\tilde B_i : \{0,1\}^n \rightarrow \mathbb R_+^n}$ satisfying all the preceding constraints and of the form

$\displaystyle \tilde B_i = \frac{\exp\left(\sum_{j=1}^{M} \lambda_j \varphi_{S_j}\right)}{\mathop{\mathbb E} \exp\left(\sum_{i=1}^{M} \lambda_j \varphi_{S_j}\right)}$

such that

$\displaystyle M \leq C(g) \frac{\log \alpha}{\varepsilon^2}\,,$

where ${C(g)}$ is some constant depending only on ${g}$.

Note that ${\varphi}$ depends only on ${g}$, and thus ${\|\varphi_S\|_{\infty}}$ only depends on ${g}$ as well. Now each ${\varphi_{S_j}}$ only depends on ${m}$ variables (those in ${S_j}$ and ${|S_j|=m}$), meaning that our approximator ${\tilde B_i}$ is an ${h}$-junta for

$\displaystyle h \leq m \cdot C(g) \frac{\log \alpha}{\varepsilon^2}\,. \ \ \ \ \ (5)$

Oops. That doesn’t seem very good. The calculation in (3) needs that ${\tilde B_i}$ is a ${d}$-junta, and certainly ${d < m}$ (since ${g}$ is a function on ${\{0,1\}^m}$). Nevertheless, note that the approximator is a non-trivial junta. For instance, if ${\alpha \leq 2^{\sqrt{n}}}$, then it is an $O(\sqrt{n})$-junta, recalling that $m$ is a constant (depending on ${g}$).

1.5. Random restriction saves the day

Let’s try to apply the logic of (3) to the ${\tilde B_i}$ approximators anyway. Fix some ${i \in [k]}$ and let ${J_i}$ be the set of coordinates on which ${\tilde B_i}$ depends. Then:

$\displaystyle \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] = \mathop{\mathbb E}_S\,A_i(S) \mathop{\mathbb E}_{y \in \{0,1\}^S} \varphi(y) \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]$

Note that the map ${y \mapsto \mathop{\mathbb E}_{x \in \{0,1\}^n} \left[B_i(x) \,\big|\, x|_S=y\right]}$ is a junta on ${J_i \cap S}$. Thus if ${|J_i \cap S| \leq d}$, then the contribution from this term is non-negative since ${\varphi}$ is ${d}$-locally positive. But ${|S|=m}$ is fixed and ${n}$ is growing, thus ${|J_i \cap S| > d}$ is quite rare! Formally,

$\displaystyle \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq - \|A_i\|_{\infty}\, \mathbb P_S[|J_i \cap S| > d] \geq - \|A_i\|_{\infty} \frac{h^d (2m)^d}{n^d} \,.$

In the last estimate, we have used a simple union bound and ${n \geq 2m}$.

Putting everything together and summing over ${i \in [k]}$, we conclude that

$\displaystyle \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) \tilde B_i(x)\right] \geq -\alpha \frac{h^d (2m)^d}{n^d}\,.$

Note that by choosing ${n}$ only moderately large, we will make this error term very small.

Moreover, since each ${\tilde B_i}$ is a feasible point of the optimization (4), we have

$\displaystyle \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] = \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \mathop{\mathbb E}_x \left[\varphi_S(x) B_i(x)\right]$

$\displaystyle \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq \sum_{i=1}^k \mathop{\mathbb E}_S A_i(S) \left(-\varepsilon + \mathop{\mathbb E}_x \left[\varphi_S(x) \tilde B_i(x)\right]\right)$

$\displaystyle \hphantom{\sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right]} \geq -\varepsilon \sum_{i=1}^k \|A_i\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.$

Almost there: Now observe that

$\displaystyle \|g\|_1 = \mathop{\mathbb E}_{S,x} [M_n^g(S,x)] = \sum_{i=1}^k \|A_i\|_1 \|B_i\|_1 = \sum_{i=1}^k \|A_i\|_1\,.$

Plugging this into the preceding line yields

$\displaystyle \sum_{i=1}^k \mathop{\mathbb E}_{S,x} \left[\varphi_S(x) A_i(S) B_i(x)\right] \geq -\varepsilon \|g\|_1 - \alpha \frac{h^d (2m)^d}{n^d}\,.$

Recalling now (2), we have derived a contradiction to ${\alpha_+(M) \leq \alpha}$ if we can choose the right-hand side to be bigger than ${\beta}$ (which is a negative constant depending only on ${g}$). Setting ${\varepsilon = -\beta/(2 \|g\|_1)}$, we consult (5) to see that

$\displaystyle h \leq C'(g) m \log \alpha$

for some other constant ${C'(g)}$ depending only on ${g}$. We thus arrive at a contradiction if ${\alpha = o((n/\log n)^d)}$, recalling that ${m, d}$ depend only on ${g}$. This completes the argument.