Entropy optimality: Matrix scaling

This is the first in a series of posts on the surprising power of “entropy optimality.” I have recently encountered many incarnations of this principle in a number of d\ifferent settings (functional analysis, stochastic processes, additive number theory, communication complexity, etc.). It is also connected to many well-studied phenomena in machine learning, optimization, and statistical physics. In the spirit of blogs, my goal is for each post to highlight a single interesting aspect of the theory.

For the first post, we’ll look at a very simple but compelling example: matrix scaling. This is a problem that arises in statistics, numerical analysis, and a number of other areas (see Section 1.2 here for references).

Suppose we are given an {n \times n} target matrix {T=(t_{ij})} with nonnegative entries. We also have an {n \times n} input matrix {X=(x_{ij})} and our goal is to multiply the rows and columns of {X} by positive weights so that the resulting matrix has the same row and column sums as {T} .

Stated differently, we look for diagonal matrices {D_1} and {D_2} such that {D_1 X D_2} has the same row and column sums as {T} . A typical choice of target might be {t_{ij} = 1/n} for all {1 \leq i,j \leq n} , meaning that we want to rescale {X} to be doubly stochastic.

Theorem 1 If {x_{ij} = 0 \iff t_{ij} = 0} for all {1 \leq i,j \leq n} , then such a weighting exists.

To prove the theorem, we will search for a matrix {Y = (y_{ij})} with the same row and column sums as {T} : For all {i,j \in [n]} :

\displaystyle \sum_{i} y_{ij}=\sum_i t_{ij} \quad \textrm{ and } \quad \sum_j y_{ij} = \sum_j t_{ij}\,.      (1)

Clearly such a matrix exists; we could just take {Y=T} ! But the idea is that we want {Y}  to be a “simple perturbation” of {X} . We will follow a philosophy analogous to Jaynes’ principle of maximum entropy. What results is precisely the argument of Franklin and Lorenz.

1.1. The relative entropy

Given two probability distributions {\mu} and {\nu} on a finite set {S} , we recall that the relative entropy between two measures {\mu} and {\nu} on {X} is given by

\displaystyle D(\mu\,\|\,\nu) = \sum_{x \in S} \mu(x) \log \frac{\mu(x)}{\nu(x)}\,.

We take this quantity to be infinite unless {\nu(x) = 0 \implies \mu(x)=0} . A fundamental property of {D} is that it is jointly convex in its arguments. For the following, we will only need that {D} is convex in {\mu} which is a simple consequence of the fact that {y \mapsto y \log y} is a convex function on {\mathbb R_+} .

In proving Theorem 1, we may clearly assume that {\sum_{ij} x_{ij} = \sum_{ij} t_{ij} = 1} . Now let {P(T) \subseteq \mathbb R^{n^2}} denote the affine subspace of all {Y} satisfying the constraints (1). We will consider the optimization problem:

\displaystyle \textrm{minimize } D(Y \,\|\, X) = \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} \quad \textrm{ subject to } Y \in P(T)\quad (2)

The idea is that the objective function prefers {Y} to be as close to {X} as possible in terms of relative entropy while still satisfying the constraints. Philosophically, this is supposed to encourage {Y} to be “simple” with respect to {X} .

Observe that, by our assumption on {X} , it follows that {T \in P(T)} is a point satisfying {D(T \,\|\,X) < \infty} . Note that since {P(T)} is a convex, compact set, we know that there is an optimizer, and since {Y \mapsto D(Y \,\|\,X)} is strictly convex, the optimal solution {Y^*} is unique. It will turn out that {Y^*} is of the form {D_1 X D_2} , and this fact will prove our theorem. The last step is to understand the structure of {Y^*} in terms of the Lagrangian.

1.2. The Lagrangian

Let us introduce {2n} dual variables {{ \alpha_i, \beta_j \in \mathbb R : i,j \in [n] }} and consider the Lagrangian (see the Boyd-Vandenberghe book) corresponding to the optimization (2):

\displaystyle \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} + \sum_i \alpha_i \sum_j (t_{ij}-y_{ij}) + \sum_j \beta_j \sum_i (t_{ij}-y_{ij})\,.

At optimality, the gradient should be equal to zero. If we differentiate with respect to some {y_{ij}} , we see that

\displaystyle 1+\log y_{ij} + \alpha_i + \beta_j - \log x_{ij} = 0\,,

which implies that, at optimality,

\displaystyle y^*_{ij} = x_{ij}, e^{-\alpha^*_i-\beta^*_j-1}\,.

In particular, we see that {Y^*} is obtained from {X} by multiplying the rows by {{e^{-\alpha_i^*-1}}} and multiplying the columns by {{e^{-\beta_j^*}}} , completing the proof of Theorem 1.

1.3. Being careful about optimality

An astute reader (or, in this case, commenter) might point out that we did not use the condition {t_{ij} = 0 \implies x_{ij}=0} . But clearly it is necessary, as one can see from examining the pair

\displaystyle  T = \left(\begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array}\right) \quad X = \left(\begin{array}{cc} 1 & 1 \\ 1 & 0 \end{array}\right)\,.

See these lecture notes for a detailed account.

2 thoughts on “Entropy optimality: Matrix scaling

  1. Hi James. These are an amazing series of posts!

    Theorem 1 seems to imply that the matrix X = [1 1; 1 0] has a doubly stochastic scaling since we can take T = [0 1; 1 0], but this is not true. With a scaling of X, we can only reach arbitrarily close to being doubly stochastic.

  2. Hi Ankit,

    Of course your comment is spot on! We actually need the condition x_{ij}=0 \iff t_{ij}=0 . I added an explanation.

    This issue also arises (and is fixed in the same way) in this post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s