# Entropy optimality: A potential function

Our goal now is to prove Lemma 3 from the last post, thereby completing the entropy-maximization proof of Chang’s Lemma. We will prove it now in greater generality because it shows how relative entropy provides a powerful potential function for measuring progress.

Let ${X}$ be a finite set equipped with a measure ${\mu}$. We use ${L^2(X,\mu)}$ to denote the space of real-valued functions ${f : X \rightarrow \mathbb R}$ equipped with the inner product ${\langle f,g\rangle = \sum_{x \in X} \mu(x) f(x) g(x)}$. Let ${Q_{\mu} \subseteq L^2(X,\mu)}$ denote the convex polytope of probability densities with respect to ${\mu}$:

$\displaystyle Q_{\mu} = \left\{ g \in L^2(X,\mu) : g \geq 0, \mathop{\mathbb E}_{\mu}[g]=1 \right\}\,.$

1.1. A ground truth and a family of tests

Fix now a density ${f \in Q_{\mu}}$ and a family ${\mathcal T \subseteq L^2(X,\mu)}$ that one can think of as “tests” (or properties of a function we might care about). Given an error parameter ${\varepsilon > 0}$, we define a convex polytope ${P_{\varepsilon}(f, \mathcal T) \subseteq L^2(X,\mu)}$ as follows:

$\displaystyle P_{\varepsilon}(f,\mathcal T) = \left\{ g : \langle \varphi, g \rangle \geq \langle \varphi,f\rangle - \varepsilon \quad \forall \varphi \in \mathcal T\right\}\,.$

This is the set of functions that have “performance” similar to that of ${f}$ on all the tests in ${\mathcal T}$. Note that the tests are one-sided; if we wanted two-sided bounds for some ${\varphi}$, we could just add ${-\varphi}$ to ${\mathcal T}$.

1.2. (Projected) coordinate descent in the dual

We now describe a simple algorithm to find a point ${g \in P_{\varepsilon}(f,\mathcal T) \cap Q_{\mu}}$. It will be precisely analogous to the algorithm we saw in the previous post for the natural polytope coming from Chang’s Lemma. We define a family of functions ${\{g_t : t \geq 0\} \subseteq Q_{\mu}}$ indexed by a continuous time parameter:

$\displaystyle g_t = \frac{\exp\left(\int_0^t \varphi_s\,ds\right)}{\mathop{\mathbb E}_{\mu}\left[\exp\left(\int_0^t \varphi_s\,ds\right)\right]}\,.$

Here ${\varphi_t \in \mathcal T}$ is some test we have yet to specify. Intuitively, it will be a test that ${g_t}$ is not performing well on. The idea is that at time ${t}$, we exponentially average some of ${\varphi_t}$ into ${g_t}$. The exponential ensures that ${g_t}$ is non-negative, and our normalization ensures that ${g_t \in Q_{\mu}}$. Observe that ${g_0 = \mathbf{1}}$ is the constant ${1}$ function.

1.3. The potential function

For two densities ${h,g \in Q_{\mu}}$, we define the relative entropy

$\displaystyle D_{\mu}(h\,\|\,g) = \sum_{x \in X} \mu(x) h(x) \log \frac{h(x)}{g(x)}\,.$

Our goal is to analyze the potential function ${D_{\mu}(f \,\|\,g_t)}$. This functional measures how far the “ground truth” ${f}$ is from our hypothesis ${g_t}$.

Note that ${D_{\mu}(f \,\|\,g_0) = D_{\mu}(f \,\|\, \mathbf{1}) = \mathrm{Ent}_{\mu}(f)}$, using the notation from the previous post. Furthermore, since the relative entropy between two probability measures is always non-negative, ${D_{\mu}(f\,\|\,g_t) \geq 0}$ for all ${t \geq 0}$.

Now a simple calculation gives

$\displaystyle \frac{d}{dt} D_{\mu}(f\,\|\,g_t) = \langle \varphi_t, g_t-f\rangle\,. \ \ \ \ \ (1)$

In other words, as long as the constraint ${\langle \varphi_t, g_t\rangle \geq \langle \varphi_t, f\rangle - \varepsilon}$ is violated by ${g_t}$, the potential is decreasing by at least ${\varepsilon}$!

Since the potential starts at ${\mathrm{Ent}_{\mu}(f)}$ and must always be non-negative, if we choose at every time ${t}$ a violated test ${\varphi_t}$, then after time ${T \leq \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon}}$, it must be that ${g_t \in P_{\varepsilon}(f,\mathcal T)}$ for some ${t \in [0,T]}$.

1.4. A sparse solution

But our goal in Chang’s Lemma was to find a “sparse” ${g \in P_{\varepsilon}(f,\mathcal T)}$. In other words, we want ${g}$ to be built out of only a few constraints. To accomplish this, we should have ${\varphi_t}$ switch between different tests as little as possible.

So when we find a violated test ${\varphi \in \mathcal T}$, let’s keep ${\varphi_t = \varphi}$ until ${\langle \varphi, g_t\rangle = \langle \varphi,f\rangle}$. How fast can the quantity ${\langle \varphi, g_t-f\rangle}$ drop from larger than ${\varepsilon}$ to zero?

Another simple calculation (using ${\varphi_t = \varphi}$) yields

$\displaystyle \frac{d}{dt} \langle \varphi, g_t\rangle = - \left\langle \vphantom{\bigoplus} \varphi, g_t (\varphi - \langle \varphi,g_t\rangle)\right\rangle = - \langle \varphi^2, g_t \rangle + \langle \varphi, g_t\rangle^2\,.$

This quantity is at least ${- \|g_t\|_1 \cdot \|\varphi\|_{\infty}^2 = -\|\varphi\|_{\infty}^2}$.

In order to calculate how much the potential drops while focused on a single constraint, we can make the pessimistic assumption that ${\frac{d}{dt} \langle \varphi, g_t\rangle = - \|\varphi\|_{\infty}^2}$. (This is formally justified by Grönwall’s inequality.) In this case (recalling (1)), the potential drop is at least

$\displaystyle \frac{1}{\|\varphi\|_{\infty}^2} \int_0^{\varepsilon} x \,dx = \frac12 \frac{\varepsilon^2}{\|\varphi\|_{\infty}^2}\,.$

Since the potential can drop by at most ${\mathrm{Ent}_{\mu}(f)}$ overall, this bounds the number of steps of our algorithm, yielding the following result.

Theorem 1 For every ${\varepsilon > 0}$, there exists a function ${g \in P_{\varepsilon}(f,\mathcal T) \cap Q_{\mu}}$ such that for some ${\{\lambda_i \geq 0\}}$ and ${\{\varphi_i\} \subseteq \mathcal T}$

$\displaystyle g = \frac{\exp\left(\sum_{i=1}^k \lambda_i \varphi_i\right)}{\mathop{\mathbb E}_{\mu}\left[\exp\left(\sum_{i=1}^k \lambda_i \varphi_i\right)\right]}\,,$

for some value

$\displaystyle k \leq 2 \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon^2} \sup_{\varphi \in \mathcal T} \|\varphi\|_{\infty}^2\,.$

In order to prove Lemma 3 from the preceding post, simply use the fact that our tests were characters (and their negations), and these satisfy ${\|\chi_{\alpha}\|_{\infty} \leq 1}$.

1.5. Bibliographic notes

Mohit Singh has pointed out to me the similarity of this approach to the Frank-Wolfe algorithm. The use of the potential ${D(f \,\|\,g_t)}$ is a common feature of algorithms that go by the names “boosting” or “multiplicative weights update” (see, e.g., this survey).