This is the first in a series of posts on the surprising power of “entropy optimality.” I have recently encountered many incarnations of this principle in a number of d\ifferent settings (functional analysis, stochastic processes, additive number theory, communication complexity, etc.). It is also connected to many well-studied phenomena in machine learning, optimization, and statistical physics. In the spirit of blogs, my goal is for each post to highlight a single interesting aspect of the theory.
For the first post, we’ll look at a very simple but compelling example: matrix scaling. This is a problem that arises in statistics, numerical analysis, and a number of other areas (see Section 1.2 here for references).
Suppose we are given an target matrix
with nonnegative entries. We also have an
input matrix
and our goal is to multiply the rows and columns of
by positive weights so that the resulting matrix has the same row and column sums as
.
Stated differently, we look for diagonal matrices and
such that
has the same row and column sums as
. A typical choice of target might be
for all
, meaning that we want to rescale
to be doubly stochastic.
To prove the theorem, we will search for a matrix with the same row and column sums as
: For all
:
Clearly such a matrix exists; we could just take ! But the idea is that we want
to be a “simple perturbation” of
. We will follow a philosophy analogous to Jaynes’ principle of maximum entropy. What results is precisely the argument of Franklin and Lorenz.
1.1. The relative entropy
Given two probability distributions and
on a finite set
, we recall that the relative entropy between two measures
and
on
is given by
We take this quantity to be infinite unless . A fundamental property of
is that it is jointly convex in its arguments. For the following, we will only need that
is convex in
which is a simple consequence of the fact that
is a convex function on
.
In proving Theorem 1, we may clearly assume that . Now let
denote the affine subspace of all
satisfying the constraints (1). We will consider the optimization problem:
The idea is that the objective function prefers to be as close to
as possible in terms of relative entropy while still satisfying the constraints. Philosophically, this is supposed to encourage
to be “simple” with respect to
.
Observe that, by our assumption on , it follows that
is a point satisfying
. Note that since
is a convex, compact set, we know that there is an optimizer, and since
is strictly convex, the optimal solution
is unique. It will turn out that
is of the form
, and this fact will prove our theorem. The last step is to understand the structure of
in terms of the Lagrangian.
1.2. The Lagrangian
Let us introduce dual variables
and consider the Lagrangian (see the Boyd-Vandenberghe book) corresponding to the optimization (2):
At optimality, the gradient should be equal to zero. If we differentiate with respect to some , we see that
which implies that, at optimality,
In particular, we see that is obtained from
by multiplying the rows by
and multiplying the columns by
, completing the proof of Theorem 1.
1.3. Being careful about optimality
An astute reader (or, in this case, commenter) might point out that we did not use the condition . But clearly it is necessary, as one can see from examining the pair
See these lecture notes for a detailed account.
Hi James. These are an amazing series of posts!
Theorem 1 seems to imply that the matrix X = [1 1; 1 0] has a doubly stochastic scaling since we can take T = [0 1; 1 0], but this is not true. With a scaling of X, we can only reach arbitrarily close to being doubly stochastic.
Hi Ankit,
Of course your comment is spot on! We actually need the condition
. I added an explanation.
This issue also arises (and is fixed in the same way) in this post.