In this post, we will see an example of entropy optimality applied to a determinantal measure (see, for instance, Terry Tao’s post on determinantal processes and Russ Lyons’ ICM survey). I think this is an especially fertile setting for entropy maximization, but this will be the only post in this vein for now; I hope to return to the topic later.
Our goal is to prove the following theorem of Forster.
Theorem 1 (Forster) Suppose that are unit vectors such that every subset of vectors is linearly independent. Then there exists a linear mapping such that
This result is surprising at first glance. If we simply wanted to map the vectors to isotropic position, we could use the matrix . But Forster’s theorem asks that the unit vectors
are in isotropic position. This seems to be a much trickier task.
Forster used this as a step in proving lower bounds on the sign rank of certain matrices. Forster’s proof is based on an iterative argument combined with a compactness assertion.
There is another approach based on convex programming arising in the work of Barthe on a reverse Brascamp-Lieb inequality. The relation to Forster’s theorem was observed in work of Hardt and Moitra; it is essentially the dual program to the one we construct below.
1.1. Some facts about determinants
We first recall a few preliminary facts about determinants. For any , we have the Cauchy-Binet formula
We also have a rank-one update formula for the determinant: If a matrix is invertible, then
Finally, for vectors and nonnegative coefficients , we have
This follows because replacing by corresponds to multiplying the th row and column of by , where is the matrix that has the vectors as columns.
1.2. A determinantal measure
To prove Theorem 1, we will first define a probability measure on , i.e., on the -subsets of by setting:
The Cauchy-Binet formula is precisely the statement that , i.e. the collection forms a probability distribution on . How can we capture the fact that some vectors satisfy using only the values ?
Using the rank-one update formula, for an invertible matrix , we have . Thus is the identity matrix if and only if for every ,
Note also that using Cauchy-Binet,
In particular, if , then for every , we have
Of course, our vectors likely don’t satisfy this condition (otherwise we would be done). So we will use the max-entropy philosophy to find the “simplest” perturbation of the values that does satisfy it. The optimal solution will yield a matrix satisfying (1).
1.3. Entropy maximization
Consider the following convex program with variables .
In other words, we look for a distributon on that has minimum entropy relative to , and such that all the “one-dimensional marginals” are equal (recall (2)). Remarkably, the optimum will be a determinantal measure as well.
Note that the uniform distribution on subsets of size is a feasible point and the objective is finite precisely because for every . The latter fact follows from our assumption that every subset of vectors is linearly independent.
1.4. Analyzing the optimizer
By setting the gradient of the Lagrangian to zero, we see that the optimal solution has the form
for some dual variables . Note that the dual variables are unconstrained because they come from equality constraints.
Let us write . We use , , and to denote the values at the optimal solution. Using again the rank-one update formula for the determinant,
But just as in (2), we can also use Cauchy-Binet to calculate the derivative (from the second expression in (3)):
where we have used the fact that if (and otherwise equals ). We conclude that
Now we can finish the proof: Let . Then:
4 thoughts on “Entropy optimality: Forster’s isotropy”
May I ask what’s the major application of this theorem?
I would look at the papers linked in the post–Forster’s paper gives an application to lower bounding the sign rank of a matrix (though this has been surpassed by Linial, et. al. using factorization norms). The Hardt-Moitra paper gives applications in machine learning. And Barthe’s paper (which is actually the first and the theorem should really be credited to Barthe) uses it as a tool for some functional inequalities.
Thanks a lot.