Entropy optimality: A potential function

Our goal now is to prove Lemma 3 from the last post, thereby completing the entropy-maximization proof of Chang’s Lemma. We will prove it now in greater generality because it shows how relative entropy provides a powerful potential function for measuring progress.

Let {X} be a finite set equipped with a measure {\mu} . We use {L^2(X,\mu)} to denote the space of real-valued functions {f : X \rightarrow \mathbb R} equipped with the inner product {\langle f,g\rangle = \sum_{x \in X} \mu(x) f(x) g(x)} . Let {Q_{\mu} \subseteq L^2(X,\mu)} denote the convex polytope of probability densities with respect to {\mu} :

\displaystyle  Q_{\mu} = \left\{ g \in L^2(X,\mu) : g \geq 0, \mathop{\mathbb E}_{\mu}[g]=1 \right\}\,.

1.1. A ground truth and a family of tests

Fix now a density {f \in Q_{\mu}} and a family {\mathcal T \subseteq L^2(X,\mu)} that one can think of as “tests” (or properties of a function we might care about). Given an error parameter {\varepsilon > 0} , we define a convex polytope {P_{\varepsilon}(f, \mathcal T) \subseteq L^2(X,\mu)} as follows:

\displaystyle  P_{\varepsilon}(f,\mathcal T) = \left\{ g : \langle \varphi, g \rangle \geq \langle \varphi,f\rangle - \varepsilon \quad \forall \varphi \in \mathcal T\right\}\,.

This is the set of functions that have “performance” similar to that of {f} on all the tests in {\mathcal T} . Note that the tests are one-sided; if we wanted two-sided bounds for some {\varphi} , we could just add {-\varphi} to {\mathcal T} .

1.2. (Projected) coordinate descent in the dual

We now describe a simple algorithm to find a point {g \in P_{\varepsilon}(f,\mathcal T) \cap Q_{\mu}} . It will be precisely analogous to the algorithm we saw in the previous post for the natural polytope coming from Chang’s Lemma. We define a family of functions {\{g_t : t \geq 0\} \subseteq Q_{\mu}} indexed by a continuous time parameter:

\displaystyle  g_t = \frac{\exp\left(\int_0^t \varphi_s\,ds\right)}{\mathop{\mathbb E}_{\mu}\left[\exp\left(\int_0^t \varphi_s\,ds\right)\right]}\,.

Here {\varphi_t \in \mathcal T} is some test we have yet to specify. Intuitively, it will be a test that {g_t} is not performing well on. The idea is that at time {t} , we exponentially average some of {\varphi_t} into {g_t} . The exponential ensures that {g_t} is non-negative, and our normalization ensures that {g_t \in Q_{\mu}} . Observe that {g_0 = \mathbf{1}} is the constant {1} function.


1.3. The potential function

For two densities {h,g \in Q_{\mu}} , we define the relative entropy

\displaystyle  D_{\mu}(h\,\|\,g) = \sum_{x \in X} \mu(x) h(x) \log \frac{h(x)}{g(x)}\,.

Our goal is to analyze the potential function {D_{\mu}(f \,\|\,g_t)} . This functional measures how far the “ground truth” {f} is from our hypothesis {g_t} .

Note that {D_{\mu}(f \,\|\,g_0) = D_{\mu}(f \,\|\, \mathbf{1}) = \mathrm{Ent}_{\mu}(f)} , using the notation from the previous post. Furthermore, since the relative entropy between two probability measures is always non-negative, {D_{\mu}(f\,\|\,g_t) \geq 0} for all {t \geq 0} .

Now a simple calculation gives

\displaystyle  \frac{d}{dt} D_{\mu}(f\,\|\,g_t) = \langle \varphi_t, g_t-f\rangle\,. \ \ \ \ \ (1)

In other words, as long as the constraint {\langle \varphi_t, g_t\rangle \geq \langle \varphi_t, f\rangle - \varepsilon} is violated by {g_t} , the potential is decreasing by at least {\varepsilon} !

Since the potential starts at {\mathrm{Ent}_{\mu}(f)} and must always be non-negative, if we choose at every time {t} a violated test {\varphi_t} , then after time {T \leq \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon}} , it must be that {g_t \in P_{\varepsilon}(f,\mathcal T)} for some {t \in [0,T]} .

1.4. A sparse solution

But our goal in Chang’s Lemma was to find a “sparse” {g \in P_{\varepsilon}(f,\mathcal T)} . In other words, we want {g} to be built out of only a few constraints. To accomplish this, we should have {\varphi_t} switch between different tests as little as possible.

So when we find a violated test {\varphi \in \mathcal T} , let’s keep {\varphi_t = \varphi} until {\langle \varphi, g_t\rangle = \langle \varphi,f\rangle} . How fast can the quantity {\langle \varphi, g_t-f\rangle} drop from larger than {\varepsilon} to zero?

Another simple calculation (using {\varphi_t = \varphi} ) yields

\displaystyle  \frac{d}{dt} \langle \varphi, g_t\rangle = - \left\langle \vphantom{\bigoplus} \varphi, g_t (\varphi - \langle \varphi,g_t\rangle)\right\rangle = - \langle \varphi^2, g_t \rangle + \langle \varphi, g_t\rangle^2\,.

This quantity is at least {- \|g_t\|_1 \cdot \|\varphi\|_{\infty}^2 = -\|\varphi\|_{\infty}^2} .

In order to calculate how much the potential drops while focused on a single constraint, we can make the pessimistic assumption that {\frac{d}{dt} \langle \varphi, g_t\rangle = - \|\varphi\|_{\infty}^2} . (This is formally justified by Grönwall’s inequality.) In this case (recalling (1)), the potential drop is at least

\displaystyle  \frac{1}{\|\varphi\|_{\infty}^2} \int_0^{\varepsilon} x \,dx = \frac12 \frac{\varepsilon^2}{\|\varphi\|_{\infty}^2}\,.

Since the potential can drop by at most {\mathrm{Ent}_{\mu}(f)} overall, this bounds the number of steps of our algorithm, yielding the following result.

Theorem 1 For every {\varepsilon > 0} , there exists a function {g \in P_{\varepsilon}(f,\mathcal T) \cap Q_{\mu}} such that for some {\{\lambda_i \geq 0\}} and {\{\varphi_i\} \subseteq \mathcal T}

\displaystyle  g = \frac{\exp\left(\sum_{i=1}^k \lambda_i \varphi_i\right)}{\mathop{\mathbb E}_{\mu}\left[\exp\left(\sum_{i=1}^k \lambda_i \varphi_i\right)\right]}\,,

for some value

\displaystyle  k \leq 2 \frac{\mathrm{Ent}_{\mu}(f)}{\varepsilon^2} \sup_{\varphi \in \mathcal T} \|\varphi\|_{\infty}^2\,.

In order to prove Lemma 3 from the preceding post, simply use the fact that our tests were characters (and their negations), and these satisfy {\|\chi_{\alpha}\|_{\infty} \leq 1} .

1.5. Bibliographic notes

Mohit Singh has pointed out to me the similarity of this approach to the Frank-Wolfe algorithm. The use of the potential {D(f \,\|\,g_t)} is a common feature of algorithms that go by the names “boosting” or “multiplicative weights update” (see, e.g., this survey).

Entropy optimality: Chang’s Lemma

In the last post, we saw how minimizing the relative entropy allowed us to obtain a “simple” feasible point of a particular polytope. We continue exploring this theme by giving a related proof of Chang’s Lemma, a Fourier-analytic result that arose in additive combinatorics. Our proof will reveal another aspect of entropy optimality: “Projected sub-gradient descent” and a corresponding convergence analysis.

1.1. Discrete Fourier analysis and Chang’s Lemma

Fix {n \geq 1} and a prime {p} . Let {\mathbb F_p} denote the finite field with {p} elements. We will work over {\mathbb F_p^n} . For convenience, denote {G=\mathbb F_p^n} . Let {\mu} denote the uniform measure on {G} , and for a non-negative function {f : G \rightarrow [0,\infty)} , define the relative entropy

\displaystyle  \mathrm{Ent}_{\mu}(f) = \mathbb E_{\mu} \left[f \log \frac{f}{\mathbb E_{\mu} f}\right]\,,

where we use the shorthand {\mathop{\mathbb E}_{\mu} [f] = \sum_{x \in G} \mu(x) f(x)} . If {\mathop{\mathbb E}_{\mu} [f] = 1} , we say that {f} is a density (with respect to {\mu} ). In that case, {\mathrm{Ent}_{\mu}(f) = D(f \mu \,\|\,\mu)} where {D} denotes the relative entropy (introduced in the last post).

We briefly review some notions from discrete Fourier analysis. One could skip this without much loss; our proof will only use boundedness of the Fourier characters and no other features. For functions {f,g : G \rightarrow \mathbb C} , we will use the inner product {\langle f,g\rangle = \mathop{\mathbb E}_{\mu}[f\bar g]} and the norms {\|f\|_p = (\mathop{\mathbb E}_{\mu}[|f|^p])^{1/p}} for {p \geq 1} .

For an element {\alpha \in G} , we use

\displaystyle  \chi_{\alpha}(x) = \exp\left(\frac{-2\pi}{p} \sum_{i=1}^n \alpha_i x_i\right)

to denote the corresponding Fourier character. Now every {f : G \rightarrow \mathbb C} can be written in the Fourier basis as

\displaystyle  f = \sum_{\alpha \in G} \hat f(\alpha) \chi_{\alpha}\,,

where {\hat f(\alpha) = \langle f,\chi_{\alpha}\rangle} . (For simplicity of notation, we have implicitly identified {G} and {\hat G} .)

We will be interested in the large Fourier coefficients of a function. To that end, for {\eta > 0} , we define

\displaystyle  \Delta_{\eta}(f) = \left\{ \alpha \in G : |\hat f(\alpha)| > \eta \|f\|_1 \right\}\,,

Now we are ready to state Chang’s Lemma; it asserts that if {\mathrm{Ent}_{\mu}(f)} is small, then all of {f} ‘s large Fourier coefficients are contained in a low-dimensional subspace.

Lemma 1 If {f : G \rightarrow [0,\infty)} is a density, then for every {\eta > 0} the set {\Delta_{\eta}(f)} is contained in an {\mathbb F_p} -subspace of dimension at most {d} , where

\displaystyle  d \leq 2 \sqrt{2} \frac{\mathrm{Ent}_{\mu}(f)}{\eta^2}\,.

In the special case {p=2} , the proof will actually give constant {2} instead of {2 \sqrt{2}} .

1.2. A slightly stronger statement

In fact, we will prove the following slightly stronger statement.

Theorem 2 If {f : G \rightarrow [0,\infty)} is a density, then for every {\eta > 0} there exists a density {g : G \rightarrow [0,\infty)} such that {\Delta_{\eta}(f) \subseteq \Delta_0(g)} and moreover {g(x) = \Psi(\chi_{\alpha_1}(x), \ldots, \chi_{\alpha_d}(x))} for some {d \leq 1+2\sqrt{2} \frac{\mathrm{Ent}_{\mu}(f)}{\eta^2}} and {\{\alpha_i\} \subseteq G} and some function \Psi : \mathbb C^d \to \mathbb R . In other words, {g} is a function of at most {d} characters.

It is an exercise in linear algebra to see that for such a density {g} , the set of non-zero Fourier coefficients {\Delta_0(g)} is contained in the span of {\alpha_1, \ldots, \alpha_d} . Since {\Delta_{\eta}(f) \subseteq \Delta_0(g)} , this yields Lemma 1.

1.3. Entropy optimality

Let {L^2(G)} denote the Hilbert space of real-valued functions on {G} equipped with the inner product {\langle \cdot, \cdot\rangle} . Assume now that {f : G \rightarrow [0,\infty)} is a density and fix {\eta > 0} .

Dtnote by {\mathcal F} the family of functions of the form {\pm \mathrm{Re}\,\chi_\alpha} or {\pm \mathrm{Im}\, \chi_\alpha} for some {\alpha \in G} .

Let us associate to {f} the convex polytope {P(f, \eta) \subseteq L^2(G)} of functions {g : G \rightarrow [0,\infty)} satisfying

\displaystyle  \langle g,\varphi\rangle \geq \langle f,\varphi\rangle - \eta

for all {\varphi \in \mathcal F} .

Just as in the matrix scaling setting, {P(f,\eta)} encapsulates the family of “tests” that we care about (which Fourier coefficient magnitudes are large). Note that if {g \in P(f,\eta/\sqrt{2})} , then {\Delta_{\eta}(f) \subseteq \Delta_0(g)} . We also let {Q} denote the probability (density) simplex: The set of all {g : G \rightarrow [0,\infty)} such that {\mathop{\mathbb E}_{\mu} g = 1} .

Of course, {P(f,\eta) \cap Q} is non-empty because {f \in P(f,\eta) \cap Q} . As before, we will try to find a “simple” representative from {P(f,\eta) \cap Q} by doing entropy maximization (which is the same thing as relative entropy minimization):

\displaystyle  \textrm{minimize } \mathrm{Ent}_{\mu}(g) = \sum_{x \in G} \mu(x) g(x) \log g(x) \qquad \textrm{ subject to } g \in P(f,\eta) \cap Q\,.

1.4. Sub-gradient descent

Soon we will explore this optimization using convex duality. But in this post, let us give an algorithm that tries to locate a feasible point of {P(f,\eta) \cap Q} by a rather naive form of (sub-)gradient descent.

We start with the initial point {g_0 = \mathbf{1}} which is the function that is identically {1} on all of {G} . Note that by strict convexity, this is the unique maximizer of {\mathrm{Ent}_{\mu}(g)} among all {g \in Q} . But clearly it may be that {g_0 \notin P(f,\eta)} .

So suppose we have a function {g_t \in Q} for some {t \geq 0} . If {g_t \in P(f,\eta)} , we are done. Otherwise, there exists a violated constraint: Some {\varphi_t \in \mathcal F} such that {\langle g_t, \varphi_t\rangle < \langle f,\varphi_t\rangle - \eta} . Our goal is now to update {g_t} to {g_{t+1} \in Q} that doesn't violate this constraint quite as much.

Notice the dichotomy being drawn here between the “soft'' constraints of {P(f,\eta)} (which we try to satisfy slowly) and the “hard'' constraints of {Q} that we are enforcing at every step. (This procedure and the analysis that follows are a special case of the mirror descent framework; see, for example, Section 4 of Bubeck’s monograph. In particular, there is a more principled way to design the algorithm that follows.)

We might try to write {g_{t+1} = g_t + \varepsilon \varphi_t} for some small {\varepsilon > 0} , but then {g_{t+1}} could take negative values! Instead, we will follow the approximation {e^x \approx 1+x} (for {x} small) and make the update

\displaystyle  g_{t+1} = \frac{g_t \exp\left(\varepsilon \varphi_t\right)}{\mathbb E_{\mu}[g_t \exp\left(\varepsilon \varphi_t\right)]}\,.

Observe that the exponential weight update keeps {g_{t+1}} positive, and our explicit renormalization ensures that {\mathop{\mathbb E}_{\mu} [g_{t+1}]=1} . Thus {g_{t+1} \in Q} .

Of course, one might feel fishy about this whole procedure. Sure, we have improved the constraint corresponding to {\varphi_t} , but maybe we messed up some earlier constraints that we had satisfied. While this may happen, it cannot go on for too long. In the next post, we will prove the following.

Lemma 3 For an appropriate choice of step size {\varepsilon > 0} , it holds that {g_T \in P(f,\eta)} for some {T \leq 2 \frac{\mathrm{Ent}_{\mu}(f)}{\eta^2}} .

Notice that this lemma proves Theorem 2 and thus finishes our proof of Chang’s Lemma. That’s because by virtue of {g_T \in P(f,\eta/\sqrt{2})} , we know that {\Delta_{\eta}(f) \subseteq \Delta_0(g_T)} . Moreover, {g_T} is built out of at most {T+1} characters {\alpha_0, \alpha_1, \ldots, \alpha_{T}} , and thus it satisfies the conclusion of Theorem 2.

For the case {p=2} , we do not have to approximate the real and complex parts separately, so one saves a factor of {\sqrt{2}} .

1.5. Bibliographic notes

This argument bears some similarity to the proof of Impagliazzo, Moore, and Russell. It was discovered in a joint work with Chan, Raghavendra, and Steurer on extended formulations. In both those arguments, the “descent” is implicit and is done by hand—the characters are carefully chosen not to interact. It turns out that this level of care is unnecessary. The analysis in the next post will allow the descent to choose any violated constraint at any time, and moreover will only use the fact that the characters are bounded.

Entropy optimality: Matrix scaling

This is the first in a series of posts on the surprising power of “entropy optimality.” I have recently encountered many incarnations of this principle in a number of d\ifferent settings (functional analysis, stochastic processes, additive number theory, communication complexity, etc.). It is also connected to many well-studied phenomena in machine learning, optimization, and statistical physics. In the spirit of blogs, my goal is for each post to highlight a single interesting aspect of the theory.

For the first post, we’ll look at a very simple but compelling example: matrix scaling. This is a problem that arises in statistics, numerical analysis, and a number of other areas (see Section 1.2 here for references).

Suppose we are given an {n \times n} target matrix {T=(t_{ij})} with nonnegative entries. We also have an {n \times n} input matrix {X=(x_{ij})} and our goal is to multiply the rows and columns of {X} by positive weights so that the resulting matrix has the same row and column sums as {T} .

Stated differently, we look for diagonal matrices {D_1} and {D_2} such that {D_1 X D_2} has the same row and column sums as {T} . A typical choice of target might be {t_{ij} = 1/n} for all {1 \leq i,j \leq n} , meaning that we want to rescale {X} to be doubly stochastic.

Theorem 1 If {x_{ij} = 0 \iff t_{ij} = 0} for all {1 \leq i,j \leq n} , then such a weighting exists.

To prove the theorem, we will search for a matrix {Y = (y_{ij})} with the same row and column sums as {T} : For all {i,j \in [n]} :

\displaystyle \sum_{i} y_{ij}=\sum_i t_{ij} \quad \textrm{ and } \quad \sum_j y_{ij} = \sum_j t_{ij}\,.      (1)

Clearly such a matrix exists; we could just take {Y=T} ! But the idea is that we want {Y}  to be a “simple perturbation” of {X} . We will follow a philosophy analogous to Jaynes’ principle of maximum entropy. What results is precisely the argument of Franklin and Lorenz.

1.1. The relative entropy

Given two probability distributions {\mu} and {\nu} on a finite set {S} , we recall that the relative entropy between two measures {\mu} and {\nu} on {X} is given by

\displaystyle D(\mu\,\|\,\nu) = \sum_{x \in S} \mu(x) \log \frac{\mu(x)}{\nu(x)}\,.

We take this quantity to be infinite unless {\nu(x) = 0 \implies \mu(x)=0} . A fundamental property of {D} is that it is jointly convex in its arguments. For the following, we will only need that {D} is convex in {\mu} which is a simple consequence of the fact that {y \mapsto y \log y} is a convex function on {\mathbb R_+} .

In proving Theorem 1, we may clearly assume that {\sum_{ij} x_{ij} = \sum_{ij} t_{ij} = 1} . Now let {P(T) \subseteq \mathbb R^{n^2}} denote the affine subspace of all {Y} satisfying the constraints (1). We will consider the optimization problem:

\displaystyle \textrm{minimize } D(Y \,\|\, X) = \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} \quad \textrm{ subject to } Y \in P(T)\quad (2)

The idea is that the objective function prefers {Y} to be as close to {X} as possible in terms of relative entropy while still satisfying the constraints. Philosophically, this is supposed to encourage {Y} to be “simple” with respect to {X} .

Observe that, by our assumption on {X} , it follows that {T \in P(T)} is a point satisfying {D(T \,\|\,X) < \infty} . Note that since {P(T)} is a convex, compact set, we know that there is an optimizer, and since {Y \mapsto D(Y \,\|\,X)} is strictly convex, the optimal solution {Y^*} is unique. It will turn out that {Y^*} is of the form {D_1 X D_2} , and this fact will prove our theorem. The last step is to understand the structure of {Y^*} in terms of the Lagrangian.

1.2. The Lagrangian

Let us introduce {2n} dual variables {{ \alpha_i, \beta_j \in \mathbb R : i,j \in [n] }} and consider the Lagrangian (see the Boyd-Vandenberghe book) corresponding to the optimization (2):

\displaystyle \sum_{ij} y_{ij} \log \frac{y_{ij}}{x_{ij}} + \sum_i \alpha_i \sum_j (t_{ij}-y_{ij}) + \sum_j \beta_j \sum_i (t_{ij}-y_{ij})\,.

At optimality, the gradient should be equal to zero. If we differentiate with respect to some {y_{ij}} , we see that

\displaystyle 1+\log y_{ij} + \alpha_i + \beta_j - \log x_{ij} = 0\,,

which implies that, at optimality,

\displaystyle y^*_{ij} = x_{ij}, e^{-\alpha^*_i-\beta^*_j-1}\,.

In particular, we see that {Y^*} is obtained from {X} by multiplying the rows by {{e^{-\alpha_i^*-1}}} and multiplying the columns by {{e^{-\beta_j^*}}} , completing the proof of Theorem 1.

1.3. Being careful about optimality

An astute reader (or, in this case, commenter) might point out that we did not use the condition {t_{ij} = 0 \implies x_{ij}=0} . But clearly it is necessary, as one can see from examining the pair

\displaystyle  T = \left(\begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array}\right) \quad X = \left(\begin{array}{cc} 1 & 1 \\ 1 & 0 \end{array}\right)\,.

See these lecture notes for a detailed account.

What does Kadison-Singer have to do with Quantum Mechanics?

Most accounts of the Kadison-Singer problem mention that it arose from considerations in the foundations of quantum mechanics. Specifically, a question about “extensions of pure states.” See, for instance, Gil’s blog. But trying to reconcile this question with my (limited) knowledge of QM proved frustrating. There is also an intuitive explanation posted at soulphysics, along with mathematical notes. But I couldn’t figure that one out either, because it only deals with “normal” quantum states (those corresponding to countably additive measures; see below). I think the KS problem for normal states is trivial. For the meat of KS, one has to think not only about infinite-dimensional spaces, but about “pathological” (non-constructible) measures. In particular, one has to think about “quantum states” that don’t have density matrices.

What follows are my notes on the matter, all of which are either well-known or wrong. Skip to the end for the very simple formulation with no mention of quantum states. Even better advice is to just Read Nick Harvey’s survey.

How I understand quantum measurement

A (mixed) quantum state {\rho} on the space {\mathbb C^n} is an {n \times n} Hermatian matrix satisfying {\rho \succeq 0} ({\rho} is positive semi-definite) and {\mathrm{Tr}(\rho)=1} . Suppose we also have a resolution of the identity: A collection {\mathcal M = \{M_1, M_2, \ldots, M_k\}} of PSD matrices with {\sum_{j=1}^k M_j = I} . Then when we perform the measurement {\mathcal M} , the probability of obtaining outcome {j} is {\mathrm{Tr}(\rho M_j)} for {j=1,2,\ldots,k} .

A pure state is given by a density matrix of the form {\rho = \psi \psi^T} for some {\psi \in \mathbb C^n} . Equivalently, a state is pure if and only if it cannot be written as a non-trivial convex combination of other states, i.e. if {\rho = \lambda \rho_1 + (1-\lambda) \rho_2} for {\lambda \in (0,1)} , then {\rho=\rho_1=\rho_2} . The pure states are the extreme points of the convex set of all states.

Quantum logic

There is another way to think about quantum measurement. Let {\mathcal H} be a (separable) Hilbert space and consider a projection {P : \mathcal H \rightarrow S} onto some closed subspace. A measure {\mu} on projections assigns to every such projection a value in {[0,1]} . We say that {\mu} is {\sigma} -additive if whenever {Q = \sum_{j=1}^{\infty} P_j} and the projections {\{P_j\}} are mutually orthogonal, we have {\mu(Q) = \sum_{j=1}^{\infty} \mu(P_j)} . If this only holds for finite sums of projections, we say that {\mu} is finitely additive. We should also have {\mu(I)=1} .

Note that we have allowed here the possibility that {\mathcal H} is infinite-dimensional. In the case where the measure {\mu} is {\sigma} -additive, it is completely specified by its value at one-dimensional projections, i.e. the values {\mu(P_x)} where {P_x} denotes projection onto {x \in \mathcal H} . Now we can state Gleason’s theorem: Our two notions of quantum measurement coincide.

Theorem 1 Let {\mathcal H} be a Hilbert space of dimension at least 3 and let {\mu} be a {\sigma} -additive measure on projections. Then there exists a unique linear operator {\rho} such that for any projection {P} , we have {\mu(P) = \mathrm{Tr}(\rho P)} .

In other words, every {\sigma} -additive measure on projections arises from a mixed state.

Restricted purity

Note that a measure on projections corresponds to a pure state if and only if it cannot be written as a non-trivial convex combination of measures on projections. In this way, we can also define purity with respect to a restricted family of projections.

Consider, for example, the Hilbert space {\ell^2(\mathbb N)} of complex sequences, and let {\{e_j : j \in \mathbb N\}} denote the standard basis. Suppose we consider only the projections {\mathcal P = \{P_j\}} onto each of the basis elements. Say that a {\mathcal P} -measure is a measure on all projections of the form {\sum_{i \in A} P_i} for {A \subseteq \mathbb N} . Such a measure only gives us probabilities for measurements performed in the standard basis. (The algebra generated by {\mathcal P} is a canonical example of a “maximal abelian subalgebra,” i.e. MASA.) We can say that a {\mathcal P} -measure is pure if it cannot be written as a non-trivial convex combination of other {\mathcal P} -measures. What are the pure {\mathcal P} -measures? Well, the only {\sigma} -additive pure {\mathcal P} -measures are the point measures {\mu(P_j)=1} for some {j} . If {\mu(P_j), \mu(P_{j'}) > 0} for {j \neq j'} , then by {\sigma} -additivity, we can split {\mu} into a convex combination of two measures (one supported only on {P_j} and the other supported off {P_j} ).

But when {\sigma} -additivity is dropped, some tricky things can happen (see the next section). It turns out that, in this case, there are more pure states than just {e_j e_j^T} (in fact, the cardinality of the finitely-additive pure states is gigantic, {2^{2^{\aleph_0}}} ). Now we are in position to state the Kadison-Singer problem.

Kadison-Singer Problem: Is it the case that every pure, finitely-additive {\mathcal P} -measure has a unique extension to a finitely-additive measure on all the projections of {\mathcal H} ?

The answer is clearly positive for {\sigma} -additive measures (in which case the unique extension corresponds to {e_j e_j^T} ). The answer is also positive for finitely-additive {\mathcal P} -measures, as recently proved by Marcus, Spielman, and Srivistava.

In their original work, Kadison and Singer showed that the answer is false if one constructs the MASA using the continuous Hilbert space {L^2([0,1])} instead of {\ell^2(\mathbb N)} . One can take the subalgebra {\mathcal C} of multiplication operators {T_g (f)=gf} for {\|g\|_{\infty} < \infty} . Here the corresponding family of projections are the maps {P_S f = \mathbf{1}_S f} for {S \subseteq [0,1]} . Unlike in the case above, this subalgebra is not “discrete.” It turns out that every maximal abelian subalgebra of bounded operators on a separable Hilbert space is either finite-dimensional, the discrete one, the continuous one, or a direct sum of algebras isomorphic to these.

Note that states corresponding to finitely additive (but not {\sigma} -additive measures) do not have density matrices and are often considered to be “unphysical” because they would take an infinite amount of energy to prepare.

A {\beta} -world teeming with pure states

In the above setting, we said there are finitely-additive pure {\mathcal P} -measures {\mu} other than point measures. To see what one of these must look like, consider the following construction. Let {P_{\mathrm{even}}} denote the projection onto {\{e_j : j \textrm{ even}\}} and define {P_{\mathrm{odd}}} similarly. Define the measures {\mu_{\mathrm{even}}(P) = \mu(P P_{\mathrm{even}})} and {\mu_{\mathrm{odd}}(P) = \mu(P P_{\mathrm{odd}})} . Then we can write {\mu = \mu(P_{\mathrm{even}}) \mu_{\mathrm{even}} + \mu(P_{\mathrm{odd}}) \mu_{\mathrm{odd}}} . This is a non-trivial convex combination unless either {\mu(P_{\mathrm{even}})=0} or {\mu(P_{\mathrm{odd}})=0} .

Indeed, for every partition {\mathbb N = S \cup \bar S} , a pure {\mathcal P} -measure {\mu} can only put non-zero weight on projections involving coordinates of exactly one of {S} or {\bar S} . Furthermore, if {\mu(\sum_{i \in {\bar S}} P_i) = 0} then {\mu(\sum_{i \in S} P_i)=1} . In other words, for every set of projections, the measure {\mu} takes only values 0 or 1. Such finitely-additive measures are exactly given by ultrafilters on {\mathbb N} . The finitely-additive pure {\mathcal P} -measures are in one-to-one correspondence with such ultrafilters, the set of which can be identified with {\beta \mathbb N} , the Stone-Cech compactification of the naturals. See Terry Tao’s notes on ultafilters and Stone-Cech compactification.


A simple formulation

Take the Hilbert space {\ell^2 = \ell^2(\mathbb N)} with standard basis {\{e_j : j \in \mathbb N\}} , and let {\mathcal S} denote the set of all its closed subspaces. Consider a map {\mu : \mathcal S \rightarrow [0,1]} . We say that {\mu} is a finitely-additive measure if {\mu(\ell^2)=1} and whenever {S_1, S_2, \ldots, S_k \in \mathcal S} is a finite collection of mutually orthogonal subspaces, we have {\mu(\mathrm{span}(S_1, \ldots, S_k)) = \sum_{i=1}^k \mu(S_i)} . Say that such a measure {\mu} is pure if it cannot be written as a non-trivial convex combination of two finitely-additive measures. A canonical example of a pure measure involves fixing a unit vector {x \in \ell^2} and setting {\mu(S) = \langle x, P_S x\rangle} where P_S denotes the orthogonal projection onto S.

Now let {\mathcal S_{\mathrm{diag}}} be the set of all the subspaces of {\ell^2} of the form {\mathrm{span}(e_i : i \in A)} for some subset {A \subseteq \mathbb N} , i.e. all the coordinate subspaces. We can define “finitely-additive” and “pure” for measures on {\mathcal S_{\mathrm{diag}}} in the analogous way.

Kadison-Singer problem: Can every finitely-additive pure measure on {\mathcal S_{\mathrm{diag}}} be extended uniquely to a finitely-additive measure on {\mathcal S} ?

Note that the real question is about uniqueness of the extension. There always exists a pure extension; see the comments.

No frills proof of higher-order Cheeger inequality

Following some recent applications by Mamoru Tanaka and Laurent Miclo, I was asked where there is a short, no-frills, self-contained, and (possibly) quantitatively non-optimal proof of the higher-order Cheeger inequalities from our paper with Shayan Oveis-Gharan and Luca Trevisan. I thought I would post it here. (If you’re hungering for something new, see this recently posted preprint of my coauthors relating higher eigenvalues to graph expansion.)

[Update: The application of Miclo can also be done using the higher-order Cheeger inequalities of Louis, Raghavendra, Tetali, and Vempala.]

The main simplification comes from doing the random partitioning non-optimally with axis-parallel cubes. For ease of notation, we will deal only with regular graphs, but there will be no quantitative dependence on the degree and this assumption can be removed (see the full paper).

Suppose that {G=(V,E)} is a connected, {n} -vertex, {d} -regular graph. Define the Laplacian by {\mathcal L = I - (1/d)A} , where {A} is the adjacency matrix of {G} . We will think of {\mathcal L} as acting on {\ell^2(V)} , the space of functions {f : V \rightarrow \mathbb R} equipped with the {\ell^2} norm. {\mathcal L} is positive semi-definite with spectrum

\displaystyle  0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_{|V|} \leq 2\,.

We we define the Rayleigh quotient of a function {f \in \ell^2(V)} by

\displaystyle  \mathcal R(f) = \frac{\sum_{u \sim v} (f(u)-f(v))^2}{d \sum_{u \in V} f(u)^2}\,,

where the numerator is summed over edges of {G} . By the variational principle for eigenvalues, we have

\displaystyle  \lambda_k = \min_{\stackrel{W \subseteq \ell^2(V)}{\dim(W)=k}} \max_{0 \neq f \in W} \mathcal R(f)\,. \ \ \ \ \ (1)

For a subset {S \subseteq V} , we define the expansion of {S} by {\phi(S) = \mathcal R(\mathbf{1}_S)} , where {\mathbf{1}_S} is the indicator function of {S} .

Finally, for {k \in \{1,2,\ldots,n\}} , we define the {k} -way expansion constant of {G} by

\displaystyle  \rho_G(k) = \min \left\{ \max_i \phi(S_i) : S_1, S_2, \ldots, S_k \subseteq V \right\}\,,

where the minimum is over collections of {k} disjoint, non-empty subsets of {V} .

The classical discrete Cheeger inequality asserts that

\displaystyle  \frac{\lambda_2}{2} \leq \rho_G(2) \leq \sqrt{2\lambda_2}\,.

We will now prove the following generalization. See the full paper for a discussion of the surrounding issues and better quantitative bounds.

Theorem 1 For every {k \in \{1,2,\ldots,n\}} ,

\displaystyle  \frac{\lambda_k}{2} \leq \rho_G(k) \leq 30 k^{3.5} \sqrt{\lambda_k}\,. \ \ \ \ \ (2)

First, let’s prove the (easy) LHS of (2). Suppose we have {S_1, S_2, \ldots, S_k \subseteq V} which are disjoint and non-empty and which satisfy {\phi(S_i) = \mathcal R(\mathbf{1}_{S_i}) \leq \rho} . Then certainly {W = \mathrm{span}(\mathbf{1}_{S_1}, \ldots, \mathbf{1}_{S_k})} is a {k} -dimensional subspace of {\ell^2(V)} . On the other hand, every {f \in W} satisfies {\mathcal R(f) \leq 2 \rho} because if {f = \alpha_1 \mathbf{1}_{S_1} + \cdots + \alpha_k \mathbf{1}_{S_k}} , then

\displaystyle  \sum_{u \sim v} (f(u)-f(v))^2 \leq 2 \sum_{i=1}^k \alpha_i^2 \sum_{u \sim v} |\mathbf{1}_{S_i}(u)-\mathbf{1}_{S_i}(v)|^2\,,

where we have used the fact that if {u \in S_i} and {v \in S_j} , then

\begin{array}{rcl} |\alpha_i \mathbf{1}_{S_i}(u) - \alpha_j \mathbf{1}_{S_j}(u)|^2 &\leq & 2(\alpha_i^2 + \alpha_j^2) \\ &=& 2\alpha_i^2 |\mathbf{1}_{S_i}(u) - \mathbf{1}_{S_i}(v)|^2 + 2\alpha_j^2 |\mathbf{1}_{S_j}(u)-\mathbf{1}_{S_j}(v)|^2\,.\end{array}

But now using (1), the subspace {W} witnesses the fact that {\lambda_k \leq 2\rho} .

To prove the more difficult RHS of (2), we will use the following discrete Cheeger inequality with boundary conditions.

Lemma 2 For any {f : V\rightarrow \mathbb R} , there is a subset {U \subseteq \{ v \in V : f(v) \neq 0\}} such that

\displaystyle  \phi(U) \leq \sqrt{2 \mathcal R(f)}\,.

Proof: Let {U_t = \{ v \in V : f(v)^2 \geq t \}} . Observe that for each {t > 0} , one has {U_t \subseteq \{ v \in V : f(v) \neq 0\}} . For {S \subseteq V} , let {E(S,\bar S)} denote the edges of {G} with exactly one endpoint in {S} . Then we have

\displaystyle  \begin{array}{rcl}  \int_0^{\infty} |E(U_t, \bar U_t)|\,dt &=& \sum_{\{u,v\} \in E} \left|f(u)^2 - f(v)^2\right| \\ &= & \sum_{\{u,v\} \in E} |f(u) + f(v)| |f(u)-f(v)| \\ &\leq & \sqrt{\sum_{\{u,v\} \in E} (|f(u)|+|f(v)|)^2} \sqrt{\sum_{\{u,v\} \in E} |f(u)-f(v)|^2} \\ &\leq & \sqrt{2 d \sum_{u \in V} f(u)^2}\sqrt{\sum_{\{u,v\} \in E} |f(u)-f(v)|^2}. \end{array}

On the other hand, {\int_0^{\infty} d|U_t|\,dt = d \sum_{u \in V} |f(u)|^2,} thus

\displaystyle  \int_0^{\infty} |E(U_t, \bar U_t)|\,dt \leq \sqrt{2 \mathcal R(f)} \int_0^{\infty} d|U_t|\,dt\,,

implying there exists a {t > 0} such that {\phi(U_t) = \frac{|E(U_t,\bar U_t)|}{d |U_t|} \leq \sqrt{2 \mathcal R(f)}} . \Box

In light of the preceding lemma, to prove the RHS of (2), it suffices to find {k} disjointly supported functions {\psi_1, \psi_2, \ldots, \psi_k : V \rightarrow \mathbb R} such that {\mathcal R(\psi_i) \leq (30)^2 k^6 \lambda_k} for each {i=1,2,\ldots,k} . Then Lemma 2 guarantees the existence of disjoint subsets of vertices satisfying our desired conclusion.

Toward this end, let {f_1, f_2, \ldots, f_k : V \rightarrow \mathbb R} denote {k} orthonormal functions satisfying {\mathcal L f_i = \lambda_i f_i} for each {i=1,2,\ldots,k} , i.e. consider the first {k} eigenfunctions of the Laplacian. Define the map {F : V \rightarrow \mathbb R^k} via

\displaystyle  F(v) = \left(f_1(v), f_2(v), \ldots, f_k(v)\right)\,.

We also put {\hat F(v) = F(v)/\|F(v)\|} . (Since {\lambda_1=0} , the function {f_1} takes value {1/\sqrt{n}} everywhere, hence {F} is never zero and this is well-defined.)

The next lemma shows that, in order to find disjointly supported functions, it suffices to find “separated sets.”

Lemma 3 Suppose that for some {\varepsilon > 0} and {0 < \delta \leq 1} , there exist {k} subsets {U_1, U_2, \ldots, U_k \subseteq V} satisfying the conditions:

  1. For every {i=1,2,\ldots,k} , we have {\sum_{v \in U_i} \|F(v)\|^2 \geq \varepsilon} , and
  2. For every {i \neq j} , if {u \in U_i} and {v \in U_j} , then

    \displaystyle  \|\hat F(u)-\hat F(v)\| \geq \delta

Then there are disjointly supported functions {\psi_1, \psi_2, \ldots, \psi_k : V \rightarrow \mathbb R} such that {\mathcal R(\psi_i) \leq 25k\lambda_k/(\varepsilon\delta^2)} .

Proof: For each {i=1,2,\ldots,k} , we define maps {\theta_i : V \rightarrow \mathbb R} by

\displaystyle  \theta_i(v) = \max\left(0, 1 - \frac{2}{\delta} \min_{u \in U_i} \|\hat F(u)-\hat F(v)\|\right)\,.

Observe that, by the triangle inequality, for {u,v \in V} , we have

\displaystyle  |\theta_i(u)-\theta_i(v)| \leq \frac{2}{\delta} \|\hat F(u)-\hat F(v)\|\,. \ \ \ \ \ (3)

Next, we define

\displaystyle  \psi_i(v) = \theta_i(v) \|F(v)\|\,.

Observe that since {\theta_i} is identically {1} on {U_i} , we have

\displaystyle  \sum_{v \in V} \psi_i(v)^2 \geq \sum_{v \in U_i} \|F(v)\|^2 \geq \varepsilon \ \ \ \ \ (4)

by condition (i).

Next, we argue that the functions {\{\theta_i\}} are disjointly supported. This immediately implies that the {\{\psi_i\}} are disjointly supported. If {u \in \mathrm{supp}(\theta_i) \cap \mathrm{supp}(\theta_j)} for some {i \neq j} , then there are points {v \in U_i} and {v' \in U_j} such that {\|\hat F(u)-\hat F(v)\| < \delta/2} and {\|\hat F(u)-\hat F(v')\| < \delta/2} . But then by the triangle inequality, {\|\hat F(v)-\hat F(v')\| < \delta} , violating condition (ii).

Finally, we bound {\mathcal R(\psi_i)} . For any {u,v \in V} and {i=1,2,\ldots,k} , we have

\displaystyle  |\psi_i(u)-\psi_i(v)| \leq \theta_i(u) \left|\vphantom{\bigoplus}\|F(u)\|-\|F(v)\|\right| + \|F(v)\| \cdot |\theta_i(u)-\theta_i(v)|\,. \ \ \ \ \ (5)

Now using (3),

\displaystyle  \|F(v)\| \cdot |\theta_i(u)-\theta_i(v)| \leq \frac{2}{\delta} \|F(v)\| \cdot \|\hat F(u)-\hat F(v)\| \leq \frac{4}{\delta} \|F(u)-F(v)\|\,, \ \ \ \ \ (6)

where we have used the fact that for any non-zero vectors {x,y \in \mathbb R^k} , we have

\displaystyle  \|x\| \left\|\frac{x}{\|x\|} - \frac{y}{\|y\|}\right\| = \left\|x - \frac{\|x\|}{\|y\|} y\right\| \leq \|x-y\| + \left\|y - \frac{\|x\|}{\|y\|} y\right\| \leq 2 \,\|x-y\|\,.

Using (6) and the fact that {\theta_i(u) \leq 1} in (5) yields

\displaystyle  |\psi_i(u)-\psi_i(v)| \leq \left(1+\frac{4}{\delta}\right) \|F(u)-F(v)\|\,. \ \ \ \ \ (7)

Finally, observe that

\displaystyle  \sum_{u \sim v} \|F(u)-F(v)\|^2 = \sum_{i=1}^k \sum_{u \sim v} |f_i(u)-f_i(v)|^2 = d (\lambda_1 + \cdots + \lambda_k) \leq dk\lambda_k\,.

Combining this with (7) and (4) shows that {\mathcal R(\psi_i) \leq \lambda_k \frac{k}{\varepsilon}(1+\frac{4}{\delta})^2 \leq 25k\lambda_k/(\varepsilon \delta^2).} \Box

We will first make the task of finding separated sets slightly easier.

Lemma 4 Suppose that for some {0 < \delta \leq 1} and {m \geq 1} , there are subsets {T_1, T_2, \ldots, T_m \subseteq V} which satisfy:

  1. {\displaystyle \sum_{i=1}^m \sum_{v \in T_i} \|F(v)\|^2 \geq k - \frac14} .
  2. For every {i \neq j} , if {u \in T_i} and {v \in T_j} , then { \|\hat F(u)-\hat F(v)\| \geq \delta }
  3. For every {i=1,2,\ldots, m} ,

    \displaystyle  \sum_{v \in T_i} \|F(v)\|^2 \leq 1 + \frac{1}{2k}\,.

Then there are {k} sets {U_1, U_2, \ldots, U_k \subseteq V} satisfying the assumption of Lemma 3 with {\varepsilon = \frac{1}{4}} .

Proof: We can form the desired sets {\{U_i\}} by taking disjoint unions of the sets {\{T_i\}} . Begin with the family {\{T_1, T_2, \ldots, T_m\}} and repeatedly replace two sets {T} and {T'} by their union if they each satisfy {\sum_{v \in T} \|F(v)\|^2 < \frac12} .

When we finish, we are left with a collection of sets each member of which satisfies {\sum_{v \in T} \|F(v)\|^2 \in [\frac12, 1+\frac{1}{2k}]} , and possibly one set for which {\sum_{v \in T} \|F(v)\|^2 < 1/2} . By (i), we must end with at least {k} sets which each satisfy {\sum_{v \in T} \|F(v)\|^2 \geq 1/4} . \Box

Our final lemma, which finishes the proof of Theorem 1, simply asserts that such sets exist. We will no longer need the fact that {F} comes from eigenfunctions.

Lemma 5 Suppose that {g_1, g_2, \ldots, g_k : V \rightarrow \mathbb R} are orthornormal in {\ell^2(V)} . Let

\displaystyle G(v) = (g_1(v), g_2(v), \ldots, g_k(v))

and {\hat G(v) = G(v)/\|G(v)\|} . Then there is an {m \geq 1} and subsets {T_1, T_2, \ldots, T_m \subseteq V} such that

  1. {\displaystyle\sum_{i=1}^m \sum_{v \in T_i} \|G(v)\|^2 \geq k - \frac14} .
  2. For every {i \neq j} , if {u \in T_i} and {v \in T_j} , then

    \displaystyle \|\hat G(u)-\hat G(v)\| \geq \frac{1}{2\sqrt{2} k^{3}}\,.

  3. For every {i = 1,2,\ldots,m} ,

    \displaystyle  \sum_{v \in T_i} \|G(v)\|^2 \leq 1 + \frac{1}{2k}\,.

Proof: Consider the {n \times k} matrix {A} which has columns {g_1, g_2, \ldots, g_k} . For any {x \in \mathbb R^k} , we have

\displaystyle  \sum_{v \in V} \langle x, G(v)\rangle^2 = \|Ax\|^2 = x^T A^T A x = x^T x = \|x\|^2\,, \ \ \ \ \ (8)

since the columns of {A} are orthornormal

For a subset {X \subseteq S^{k-1}} of the unit sphere in {\mathbb R^k} , we put {V(X) = \{ v \in V : \hat G(v) \in X \}} . Fix some {x \in X} and let {\mathsf{diam}(X)} denote the diameter of {X} , then by (8), we have

\displaystyle  1 = \sum_{v \in V(X)} \langle x, G(v)\rangle^2 = \sum_{v \in V(X)} \|G(v)\|^2 \left(1-\frac{\|\hat G(v)-x\|^2}{2}\right)^2 \geq \sum_{v \in V(X)} \|G(v)\|^2 \left(1-\frac{\mathrm{diam}(X)^2}{2}\right)^2\,.

We conclude that if {\mathrm{diam}(X) \leq 1/\sqrt{2k}} , then

\displaystyle  \sum_{v \in V(X)} \|G(v)\|^2 \leq 1 + \frac{1}{2k}\,. \ \ \ \ \ (9)

Now, let {\mathcal{P}} be a partition of {\mathbb R^k} into axis-parallel squares of side length {L = \frac{1}{k \sqrt{2}}} . For any such square {Q \in \mathcal P} , we let {\tilde Q \subseteq Q} denote the set of points which are at Euclidean distance at least {\frac{L}{4k^2}} from every side of {Q} . Observe that

\displaystyle  \mathrm{vol}(\tilde Q) \geq \left(1-\frac{1}{4k^2}\right)^k \mathrm{vol}(Q) \geq \left(1-\frac{1}{4k}\right) \mathrm{vol}(Q)\,. \ \ \ \ \ (10)

Consider now the collection of sets {\{V(\tilde Q) : Q \in \mathcal P\}} . Since {\mathrm{diam}(\tilde Q) \leq L \sqrt{k} = \frac{1}{\sqrt{2k}}} , (9) implies that {\sum_{v \in V(\tilde Q)} \|G(v)\|^2 \leq 1 + \frac{1}{2k}} . Furthermore, by construction, for any {u \in V(\tilde Q)} and {v \in V(\tilde Q')} with {Q \neq Q'} , we have

\displaystyle  \|\hat G(u)-\hat G(v)\| \geq 2 \frac{L}{4k^2} \geq \frac{1}{2\sqrt{2} k^{3}}\,.

Thus the collection of sets {\{V(\tilde Q) : Q \in \mathcal P\}} satisfy both conditions (ii) and (iii) of the lemma. We are left to verify (i).

Note that {\sum_{v \in V} \|G(v)\|^2 = \sum_{i=1}^k \sum_{v \in V} g_i(v)^2 = k} . If we choose a uniformly random axis-parallel translation {\mathcal P'} of the partition {\mathcal P} , then (10) implies

\displaystyle  \mathbb{E} \sum_{Q \in \mathcal P'} \sum_{v \in V(\tilde Q)} \|G(v)\|^2 \geq \left(1-\frac{1}{4k}\right) \sum_{v \in V(Q)} \|G(v)\|^2 \geq k - \frac14\,.

In particular, there exists some fixed partition that achieves this bound. For this partition {\mathcal P} , the sets {\{V(\tilde Q) : Q \in \mathcal P\}} satisfy all three conditions of the lemma, completing the proof.