This is the first in a series of posts on the surprising power of “entropy optimality.” I have recently encountered many incarnations of this principle in a number of d\ifferent settings (functional analysis, stochastic processes, additive number theory, communication complexity, etc.). It is also connected to many well-studied phenomena in machine learning, optimization, and statistical physics. In the spirit of blogs, my goal is for each post to highlight a single interesting aspect of the theory.

For the first post, we’ll look at a very simple but compelling example: **matrix scaling**. This is a problem that arises in statistics, numerical analysis, and a number of other areas (see Section 1.2 here for references).

Suppose we are given an target matrix with nonnegative entries. We also have an input matrix and our goal is to multiply the rows and columns of by positive weights so that the resulting matrix has the same row and column sums as .

Stated differently, we look for diagonal matrices and such that has the same row and column sums as . A typical choice of target might be for all , meaning that we want to rescale to be doubly stochastic.

To prove the theorem, we will search for a matrix with the same row and column sums as : For all :

Clearly such a matrix exists; we could just take ! But the idea is that we want to be a “simple perturbation” of . We will follow a philosophy analogous to Jaynes’ principle of maximum entropy. What results is precisely the argument of Franklin and Lorenz.

** 1.1. The relative entropy **

Given two probability distributions and on a finite set , we recall that the *relative entropy* between two measures and on is given by

We take this quantity to be infinite unless . A fundamental property of is that it is *jointly convex* in its arguments. For the following, we will only need that is convex in which is a simple consequence of the fact that is a convex function on .

In proving Theorem 1, we may clearly assume that . Now let denote the affine subspace of all satisfying the constraints (1). We will consider the optimization problem:

The idea is that the objective function prefers to be as close to as possible in terms of relative entropy while still satisfying the constraints. Philosophically, this is supposed to encourage to be “simple” with respect to .

Observe that, by our assumption on , it follows that is a point satisfying . Note that since is a convex, compact set, we know that there is an optimizer, and since is strictly convex, the optimal solution is unique. It will turn out that is of the form , and this fact will prove our theorem. The last step is to understand the structure of in terms of the Lagrangian.

** 1.2. The Lagrangian **

Let us introduce dual variables and consider the Lagrangian (see the Boyd-Vandenberghe book) corresponding to the optimization (2):

At optimality, the gradient should be equal to zero. If we differentiate with respect to some , we see that

which implies that, at optimality,

In particular, we see that is obtained from by multiplying the rows by and multiplying the columns by , completing the proof of Theorem 1.

** 1.3. Being careful about optimality **

An astute reader (or, in this case, commenter) might point out that we did not use the condition . But clearly it is necessary, as one can see from examining the pair

See these lecture notes for a detailed account.

Hi James. These are an amazing series of posts!

Theorem 1 seems to imply that the matrix X = [1 1; 1 0] has a doubly stochastic scaling since we can take T = [0 1; 1 0], but this is not true. With a scaling of X, we can only reach arbitrarily close to being doubly stochastic.

Hi Ankit,

Of course your comment is spot on! We actually need the condition . I added an explanation.

This issue also arises (and is fixed in the same way) in this post.