# Entropy optimality: Matrix scaling

This is the first in a series of posts on the surprising power of “entropy optimality.” I have recently encountered many incarnations of this principle in a number of d\ifferent settings (functional analysis, stochastic processes, additive number theory, communication complexity, etc.). It is also connected to many well-studied phenomena in machine learning, optimization, and statistical physics. In the spirit of blogs, my goal is for each post to highlight a single interesting aspect of the theory.

For the first post, we’ll look at a very simple but compelling example: matrix scaling. This is a problem that arises in statistics, numerical analysis, and a number of other areas (see Section 1.2 here for references).

Suppose we are given an ${n \times n}$ target matrix ${T=(t_{ij})}$ with nonnegative entries. We also have an ${n \times n}$ input matrix ${X=(x_{ij})}$ and our goal is to multiply the rows and columns of ${X}$ by positive weights so that the resulting matrix has the same row and column sums as ${T}$.

Stated differently, we look for diagonal matrices ${D_1}$ and ${D_2}$ such that ${D_1 X D_2}$ has the same row and column sums as ${T}$. A typical choice of target might be ${t_{ij} = 1/n}$ for all ${1 \leq i,j \leq n}$, meaning that we want to rescale ${X}$ to be doubly stochastic.

Theorem 1 If ${x_{ij} = 0 \iff t_{ij} = 0}$ for all ${1 \leq i,j \leq n}$, then such a weighting exists.

To prove the theorem, we will search for a matrix ${Y = (y_{ij})}$ with the same row and column sums as ${T}$: For all ${i,j \in [n]}$:

$\displaystyle \sum_{i} y_{ij}=\sum_i t_{ij} \quad \textrm{ and } \quad \sum_j y_{ij} = \sum_j t_{ij}\,. (1)$

Clearly such a matrix exists; we could just take ${Y=T}$! But the idea is that we want ${Y}$ to be a “simple perturbation” of ${X}$. We will follow a philosophy analogous to Jaynes’ principle of maximum entropy. What results is precisely the argument of Franklin and Lorenz.

1.1. The relative entropy

Given two probability distributions ${\mu}$ and ${\nu}$ on a finite set ${S}$, we recall that the relative entropy between two measures ${\mu}$ and ${\nu}$ on ${X}$ is given by

$\displaystyle D(\mu\,\|\,\nu) = \sum_{x \in S} \mu(x) \log \frac{\mu(x)}{\nu(x)}\,.$

We take this quantity to be infinite unless ${\nu(x) = 0 \implies \mu(x)=0}$. A fundamental property of ${D}$ is that it is jointly convex in its arguments. For the following, we will only need that ${D}$ is convex in ${\mu}$ which is a simple consequence of the fact that ${y \mapsto y \log y}$ is a convex function on ${\mathbb R_+}$.

In proving Theorem 1, we may clearly assume that ${\sum_{ij} x_{ij} = \sum_{ij} t_{ij} = 1}$. Now let ${P(T) \subseteq \mathbb R^{n^2}}$ denote the affine subspace of all ${Y}$ satisfying the constraints (1). We will consider the optimization problem:

$\displaystyle \textrm{minimize } D(Y \,\|\, X) = \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} \quad \textrm{ subject to } Y \in P(T)\quad (2)$

The idea is that the objective function prefers ${Y}$ to be as close to ${X}$ as possible in terms of relative entropy while still satisfying the constraints. Philosophically, this is supposed to encourage ${Y}$ to be “simple” with respect to ${X}$.

Observe that, by our assumption on ${X}$, it follows that ${T \in P(T)}$ is a point satisfying ${D(T \,\|\,X) < \infty}$. Note that since ${P(T)}$ is a convex, compact set, we know that there is an optimizer, and since ${Y \mapsto D(Y \,\|\,X)}$ is strictly convex, the optimal solution ${Y^*}$ is unique. It will turn out that ${Y^*}$ is of the form ${D_1 X D_2}$, and this fact will prove our theorem. The last step is to understand the structure of ${Y^*}$ in terms of the Lagrangian.

1.2. The Lagrangian

Let us introduce ${2n}$ dual variables ${{ \alpha_i, \beta_j \in \mathbb R : i,j \in [n] }}$ and consider the Lagrangian (see the Boyd-Vandenberghe book) corresponding to the optimization (2):

$\displaystyle \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} + \sum_i \alpha_i \sum_j (t_{ij}-y_{ij}) + \sum_j \beta_j \sum_i (t_{ij}-y_{ij})\,.$

At optimality, the gradient should be equal to zero. If we differentiate with respect to some ${y_{ij}}$, we see that

$\displaystyle 1+\log y_{ij} + \alpha_i + \beta_j - \log x_{ij} = 0\,,$

which implies that, at optimality,

$\displaystyle y^*_{ij} = x_{ij}, e^{-\alpha^*_i-\beta^*_j-1}\,.$

In particular, we see that ${Y^*}$ is obtained from ${X}$ by multiplying the rows by ${{e^{-\alpha_i^*-1}}}$ and multiplying the columns by ${{e^{-\beta_j^*}}}$, completing the proof of Theorem 1.

An astute reader (or, in this case, commenter) might point out that we did not use the condition ${t_{ij} = 0 \implies x_{ij}=0}$. But clearly it is necessary, as one can see from examining the pair

$\displaystyle T = \left(\begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array}\right) \quad X = \left(\begin{array}{cc} 1 & 1 \\ 1 & 0 \end{array}\right)\,.$

See these lecture notes for a detailed account.

## 2 thoughts on “Entropy optimality: Matrix scaling”

1. Ankit Garg says:

Hi James. These are an amazing series of posts!

Theorem 1 seems to imply that the matrix X = [1 1; 1 0] has a doubly stochastic scaling since we can take T = [0 1; 1 0], but this is not true. With a scaling of X, we can only reach arbitrarily close to being doubly stochastic.

2. Hi Ankit,

Of course your comment is spot on! We actually need the condition $x_{ij}=0 \iff t_{ij}=0$. I added an explanation.

This issue also arises (and is fixed in the same way) in this post.