In this post, we will see an example of entropy optimality applied to a *determinantal measure* (see, for instance, Terry Tao’s post on determinantal processes and Russ Lyons’ ICM survey). I think this is an especially fertile setting for entropy maximization, but this will be the only post in this vein for now; I hope to return to the topic later.

Our goal is to prove the following theorem of Forster.

Theorem 1 (Forster)Suppose that are unit vectors such that every subset of vectors is linearly independent. Then there exists a linear mapping such that

This result is surprising at first glance. If we simply wanted to map the vectors to isotropic position, we could use the matrix . But Forster’s theorem asks that the unit vectors

are in isotropic position. This seems to be a much trickier task.

Forster used this as a step in proving lower bounds on the *sign rank* of certain matrices. Forster’s proof is based on an iterative argument combined with a compactness assertion.

There is another approach based on convex programming arising in the work of Barthe on a reverse Brascamp-Lieb inequality. The relation to Forster’s theorem was observed in work of Hardt and Moitra; it is essentially the dual program to the one we construct below.

** 1.1. Some facts about determinants **

We first recall a few preliminary facts about determinants. For any , we have the Cauchy-Binet formula

We also have a rank-one update formula for the determinant: If a matrix is invertible, then

Finally, for vectors and nonnegative coefficients , we have

This follows because replacing by corresponds to multiplying the th row and column of by , where is the matrix that has the vectors as columns.

** 1.2. A determinantal measure **

To prove Theorem 1, we will first define a probability measure on , i.e., on the -subsets of by setting:

The Cauchy-Binet formula is precisely the statement that , i.e. the collection forms a probability distribution on . How can we capture the fact that some vectors satisfy using only the values ?

Using the rank-one update formula, for an invertible matrix , we have . Thus is the identity matrix if and only if for every ,

Note also that using Cauchy-Binet,

In particular, if , then for every , we have

Of course, our vectors likely don’t satisfy this condition (otherwise we would be done). So we will use the max-entropy philosophy to find the “simplest” perturbation of the values that does satisfy it. The optimal solution will yield a matrix satisfying (1).

** 1.3. Entropy maximization **

Consider the following convex program with variables .

In other words, we look for a distributon on that has minimum entropy relative to , and such that all the “one-dimensional marginals” are equal (recall (2)). Remarkably, the optimum will be a determinantal measure as well.

Note that the uniform distribution on subsets of size is a feasible point and the objective is finite precisely because for every . The latter fact follows from our assumption that every subset of vectors is linearly independent.

** 1.4. Analyzing the optimizer **

By setting the gradient of the Lagrangian to zero, we see that the optimal solution has the form

for some dual variables . Note that the dual variables are unconstrained because they come from equality constraints.

Let us write . We use , , and to denote the values at the optimal solution. Using again the rank-one update formula for the determinant,

But just as in (2), we can also use Cauchy-Binet to calculate the derivative (from the second expression in (3)):

where we have used the fact that if (and otherwise equals ). We conclude that

Now we can finish the proof: Let . Then:

May I ask what’s the major application of this theorem?

I would look at the papers linked in the post–Forster’s paper gives an application to lower bounding the sign rank of a matrix (though this has been surpassed by Linial, et. al. using factorization norms). The Hardt-Moitra paper gives applications in machine learning. And Barthe’s paper (which is actually the first and the theorem should really be credited to Barthe) uses it as a tool for some functional inequalities.

Thanks a lot.