tcs math – some mathematics of theoretical computer science

January 11, 2012

A simpler proof of the KPR theorem

1. Introduction

The Klein-Plotkin-Rao (KPR) Theorem is a powerful statement about the geometry of planar graphs and their generalizations. Here, I’ll present a new, very simple proof of the theorem that was discovered in joint work with Cyrus Rashtchian. (This will appear in a preprint soon, together with some new results.) In the next post, I’ll give some applications in geometry and algorithms.

Recall that a graph is planar if it can be drawn in the plane without any edge crossings. Wagner’s theorem gives an intrinsic characterization of planar graphs in terms of excluded minors. Recall that a graph {H} is a minor of a graph {G} if {H} can be obtained from {G} by a sequence of (i) edge and vertex deletions and (ii) contraction of edges. A graph {G} excludes {H} as a minor if {H} is not a minor of {G}. Kuratowki’s theorem states that planar graphs are precisely those which exclude both {K_5} and {K_{3,3}} as minors, where we use {K_h} and {K_{h,h}} to denote the complete graph on {h} vertices and the complete {h}-by-{h} bipartite graph, respectively. In this post, we are particularly concerned with {K_h}-minor-free graphs, i.e. those which exclude {K_h} as a minor for some {h \geq 2}.

I’ll first state and prove a simpler version of the KPR theorem. In the next post, I’ll discuss a stronger statement (in the language of random partitions) that follows directly from the proof. Then using these partitions, we will show that the observable diameter of every {K_h}-minor-free graph is “large,” and use that fact to prove an upper bound on the uniform multi-commodity flow gap in such graphs.

2. Low-diameter graph partitioning

Consider a finite graph {G=(V,E)} equipped with its shortest-path metric {d_G} (much of what we say here extends to infinite graphs). For now, all the edges of {G} will have length one, although we will generalize to arbitrary weighted graphs for some applications in the next post.

Given a subset {S \subseteq V}, we write {\mathsf{diam}(S) = \max_{x,y \in S} d_G(x,y)\,.} A weak but simpler version of the KPR Theorem can be stated as follows.

Theorem 1 Let {G=(V,E)} be a graph that excludes {K_h} as a minor. Then for every number {\Delta \geq 1}, there exists a partition {V = S_1 \cup S_2 \cup \cdots \cup S_m} such that {\mathsf{diam}(S_i) \leq \Delta} for every {i=1,2,\ldots,m} and at most an {O(h^2/\Delta)}-fraction of edges of {G} go between different sets in the partition.

The theorem was originally proved with a dependence of {O(h^3)}, but this was improved to {O(h^2)} by Fakcharoenphol and Talwar. Today I will prove the {O(h^2)} bound. The partitioning will be accomplished via an iterative operation which we will call chopping.

Consider any connected graph {H} and a number {\tau \geq 1}. We will describe an operation which we call a {\tau}-chop of {H}. Let {x_0 \in V(H)} be any node of {H}, which we will call the “root node” of the chop, and let {r_0 \in [1,\tau]} (the “initial offset”).

Figure 1

The chopping operation is as follows: We partition {V(H)=\bigcup_{j=0}^{\infty} A_j}, where {A_0 = \{ v \in V(H) : d_H(x_0, v) < r_0 \}}, and the rest of the sets are the disjoint annuli,

\displaystyle  A_j =\left\{ v \in V(H) : r_0 + (j-1)\tau \leq d_H(x_0, v) < r_0 + j\tau \right\}\,,

for {j=1,2,\ldots}. See Figure 1 for an example of these cuts (the red and blue alternating regions) on a grid graph. Of course, since {H} is finite, eventually the annuli are empty.

We define a {\tau}-chop on a possibly disconnected graph as the partition arising from doing a {\tau}-chop on each of its connected components. Finally, we define a {\tau}-chop on a sequence of disjoint sets {S_1, S_2, \ldots, S_k \subseteq V} as the result of doing a {\tau}-chop on each of the induced graphs {G[S_i]} for {i=1,2,\ldots,m}. Thus if we have an initial partition {P} of {V}, then a {\tau}-chop of {P} produces a refined partition {P'} of {V}. See Figure 2 for the result of two iterated \tau-chops applied to the grid graph. The yellow circles represent the root nodes in the second chop.

Figure 2

We can now state the main technical result needed to prove Theorem 1.

Lemma 2 If {G=(V,E)} excludes {K_h} as a minor, then for any {\tau \geq 1}, any sequence of {h-1} iterated {\tau}-chops on {V} results in a partition {V = S_1 \cup S_2 \cup \cdots \cup S_m} such that {\mathsf{diam}(S_i) \leq O(h \tau)} for each {i=1,2,\ldots,m}.

Observe that {\mathsf{diam}(\cdot)} refers to the diameter in {d_G}, the shortest-path metric on {G}. Also, note that we do not constrain the root node or the initial offset of the chops. Klein, Plotkin, and Rao prove this lemma with a dependence of {O(h^2 \tau)} on the diameter. FT use a more complicated approach.

To see how Lemma 2 implies Theorem 1, one proceeds as follows. First, let {C} be large enough so that setting {\tau = \Delta/(Ch)} in Lemma 2 yields a partition into sets of diameter at most {\Delta}. After fixing the root node for a {\tau}-chop of {G}, one can consider the initial offsets {r_0=1,2,\ldots,\tau}. An edge {\{x,y\}} can be cut (i.e. {x} and {y} end up in distinct sets of the partition) in at most one of these offsets. Thus there exists an offset that cuts only a {1/\tau = O(h/\Delta)}-fraction of edges. Since we perform {h-1} iterated {\tau}-chops, there exists a choice of initial offsets that cuts at most {(h-1)/\tau = O(h^2/\Delta)}-fraction of edges. That completes the reduction.

3. A sketch

Before moving onto the formal argument, I’ll present a simple sketch that contains the main ideas. The proof is by contradiction; if we perform a sequence of {h} {\tau}-chops and the diameter of any remaining piece fails to be {O(h \tau)}, then we will construct a {K_{h+1}} minor in {G}.

First, we give an equivalent characterization of when a graph {G=(V,E)} has a graph {H} as a minor: There exist disjoint connected subsets {S_1, S_2, \ldots, S_{|V(H)|} \subseteq V}, one corresponding to each vertex of {H}. We call these supernodes. Furthermore, there should be an edge between supernodes whenever there is an edge between the corresponding vertices in {H}.

Figure 3

Now, the proof is by induction. Note that the base case {h=0} is trivial since a {K_1} minor is a single vertex. By induction, we can assume that if a sequence of {h} chops fails, then there must be a {K_h} minor contained in some offending annulus. See Figure 3. If we could ensure that every supernode of the minor touched the upper boundary of the annulus as in Figure 3, we could easily construct a {K_{h+1}} minor and be done, by simplying choosing the {(h+1)}-st supernode to be a ball around {x_0}.

Thus we need to enforce this extra property of our {K_h} minor. The (very simple) idea is contained in Figure 4.

Figure 4

After finding a {K_h} minor that intersects the annuli, we extend the supernodes to touch the upper boundary of the annulus from the preceding chop (which is represented by the purple line in the picture). The point is that we can choose these paths to be contained above the red boundary (and thus disjoint from the {K_h} supernodes), and also each of length at most {2\tau+1} since the width of the “purple” annulus is only {\tau}. The same can be done for {x_0}. (If we didn’t care that the paths have to be above the red boundary, we could choose them of length only {\tau}.)

The only issue is that we need these new paths to be disjoint. Since the paths are always short (length at most {O(\tau)}), we can enforce this by making sure that each supernode contains a representative and these representatives are pairwise far apart; then we grow the paths from the representatives. Initially, the representatives will be {\Omega(h \tau)} apart, and then as we go up the inductive chain, they will get closer by at most {O(\tau)} at every step. By choosing the initial separation large enough, they will remain disjoint. That’s the sketch; it should be possible to reproduce the proof from the sketch alone, but we now present a more formal proof.

4. The proof

We need a couple definitions. First, given a subset of vertices {V_0 \subseteq V} and a number {\tau \geq 0}, we say that a set {S} is {\tau}-dense in {V_0} if every element of {V_0} can reach an element of {S} by a path of length {\tau} that is contained completely in {G[V_0]} (the induced graph on {V_0}). Second, we say that an {H}-minor is {R}-represented by {S} if every supernode of {H} contains a representative from {S} and these representatives are pairwise distance more than {R} apart in the metric {d_G} (the global shortest-path metric on {G}). We now state a lemma that we can prove by induction and implies Lemma 2.

Lemma 3 Let {h \geq j \geq 0} and {\tau \geq 1} be any numbers. Suppose {G=(V,E)} is a graph and there is any sequence of {j} iterated {\tau}-chops on {G} that leaves a component of diameter more than {16 h\tau}. Then for any set {S} that is {\tau}-dense in {V}, one can find a {K_{j+1}}-minor that is {6(h-j)\tau}-represented by {S}.

One can recover Lemma 2 by setting {h=j} and {S=V}.

Proof: We proceed by induction on {j}. The case {j=0} is trivial since a {K_1} minor is simply a single vertex. Thus we may assume that {h \geq j \geq 1}.

The following figure will be a useful reference.

Figure 5

In general, we argue as follows. Let {S} be any set satisfying the assumptions of the lemma. Assume there is a sequence of {j} iterated {\tau}-chops that leaves a component of diameter more than {16 h \tau}. Then there must be some annulus {A_k} of the first {\tau}-chop such that {j-1} iterated {\tau}-chops on {A_k} leaves a component of diameter more than {16 h \tau}.

Suppose the first {\tau}-chop has root node {x_0 \in V} and initial offset {r_0}. Let {S' \subseteq A_k} be the set of nodes at distance exactly {r_0 + (k-1) \tau}, i.e. the upper boundary of {A_k}. Observe that {S'} is {\tau}-dense in {A_k} by construction. Thus by induction, there is a {K_{j}}-minor in {G[A_k]} that is {6(h-j+1)\tau}-represented by {S'}. Let {V_1, V_2, \ldots, V_j} be the supernodes of this minor, and let {\{ v_i \in S' \cap V_i : i=1,2,\ldots,j \}} be the representatives which are further than {6(h-j+1)\tau} apart.

We now extend this to a {K_{j+1}}-minor which is {6(h-j)\tau}-represented by {S}. First, we may assume that {k \geq 6h+1}. Otherwise, all the points of {A_k} lie in a ball of radius at most {r_0 + k \tau \leq (6h+2)\tau}, and hence {A_k} has diameter at most {(12h+4)\tau \leq 16h \tau}. In particular, we know that {d_G(x_0, v_i) \geq 6h\tau} for every {i=1,2,\ldots, j}.

Now for each {i=1,2,\ldots,j}, we choose a point {v_i' \in S \setminus A_i} such that {v_i'} is connected to {v_i} by a path of length at most {3\tau} in {G}. This can be done by first going up a shortest-path from {v_i} to {x_0} of length {\tau+1} to reach a point {v_i''}, and then choosing any point of {S} within distance {\tau} of {v_i''} (which can always be done since {S} is {\tau}-dense in {V}). We add this path to {V_i} to get a new supernode {V_i'}. Observe that the sets {\{V_i' : i=1,2,\ldots,j\}} are all connected and pairwise disjoint since the new paths are outside the annulus {A_k} and the paths themselves are pairwise disjoint because they are each of length at most {3\tau}, but they emanate from representatives that are more than {6(h-j+1)\tau \geq 6\tau} apart. In fact, this also shows that the representatives {\{ v_i' \in S : i=1,2,\ldots,j \}} are further than {6(h-j)\tau} apart, as required.

Finally, we construct a new supernode {V_0} as follows. For each {i=1,2,\ldots, j}, let {u_i \in V_i'} be the closest node to {x_0}, and let {s_0 \in S} be the closest node in {S} to {x_0}. For each {i}, let {P_i} be a shortest-path from {x_0} to {u_i} without its endpoint {u_i}, and let {P} be a shortest-path to {s_0}, including {s_0}. We now set {V_0 = P \cup P_1 \cup P_2 \cup \cdots \cup P_j}. We claim that {\{ V_0, V_1', \ldots, V_j' \}} forms a {K_{j+1}}-minor which is {6(h-j)\tau}-represented by {S}. First, it is clear that {d_G(s_0, v_i) > 6(h-j)\tau} since {d_G(s_0, x_0) \leq \tau} (again, because {S} is {\tau}-dense in {V}). For the same reason, the path {P} is disjoint from all the sets {\{V_i'\}}.

Thus the only possible obstruction to having a valid {K_{j+1}}-minor is if some path {P_i} intersects a set {V'_{\ell}} for {i \neq \ell}. We now show that this cannot happen. We know that if {P_i} intersects {V'_{\ell}}, then it must have already traveled distance at least {(k-1)\tau - 3\tau} away from {x_0}. But {P_i} contains a node adjacent to {V_i} (by construction), which means it continues an additional distance of {3\tau} (the distance between {V'_{\ell} \setminus V_{\ell}} and {V'_{i} \setminus V_i)}. This additional distance is also moving away from {x_0}, implying that {P_i} intersects {A_k}, which is impossible. This completes the proof.

5. The dependence on {h}

The best-known lower bound requires the conclusion of Theorem 1 to cut at least an {\Omega((\log h)/\Delta)}-fraction of edges. (One can take {G=(V,E)} to be an {n}-vertex 3-regular expander graph, which obviously excludes {K_n} as a minor. Now it is easy to see that for some constant {c}, partitioning into pieces of diameter at most {c \log n} must cut at least an {\Omega(1)}-fraction of edges.) In some special cases, e.g. graphs of genus {g} (which exclude {K_{c \lceil \sqrt{g}\rceil}} as a minor for some constant {c > 0}), one can reduce the bound to {O(\log g)} (see this joint work with Sidiropoulos). This leads to the following open problem.

Open problem: Show that under the assumptions of Theorem 1, one can find a partition that cuts only an {O((\log h)/\Delta)}-fraction of edges.

A positive resolution would yield an optimal unifom multi-commodity flow/cut gap for {K_h}-minor-free graphs. (See the next post for details.)

February 23, 2011

PSD lifting and Unique Games integrality gaps

Filed under: Math, Open question — Tags: , , — James Lee @ 9:51 am

By now, it is known that integrality gaps for the standard Unique Games SDP (see the paper of Khot and Vishnoi or Section 5.2 of this post) can be used to obtain integrality gaps for many other optimization problems, and often for very strong SDPs coming from various methods of SDP tightening; see, for instance, the paper of Raghavendra and Steurer.

Problematically, the Khot-Vishnoi gap is rather inefficient: To achieve the optimal gap for Unique Games with alphabet size {L}, one needs an instance of size {\exp(\Omega(L))}. As far as I know, there is no obstacle to achieving a gap instance where the number of variables is only {\mathrm{poly}(L)}.

The Walsh-Hadamard code

The Khot-Vishnoi construction is based on the Hadamard code.
(See Section 5.2 here for a complete description.) If we use {L^2(\{-1,1\}^k)} to denote the Hilbert space of real-valued functions {f : \{-1,1\}^k \rightarrow \mathbb R}, then the Walsh-Hadamard basis of {L^2(\{-1,1\}^k))} is the set of functions of the form

\displaystyle  u_S(x) = \prod_{i \in S} x_i,

where {S \subseteq \{1,2,\ldots,k\}}.

Of course, for two such sets {S \neq T}, we have the orthogonality relations,

\displaystyle  \langle u_S, u_T \rangle = 0.

In their construction, the variables are essentially all functions of the form {f : \{-1,1\}^k \rightarrow \{-1,1\}}, of which there are {2^{2^k}}, while there are only {2^k} basis elements {\{u_S\}_{S \subseteq [k]}} which act as the alphabet for the underlying Unique Games instance. This is what leads to the exponential relationship between the number of variables and the label size.

A PSD lifting question

In an effort to improve this dependence, one could start with a much larger set of nearly orthogonal vectors, and then somehow lift them to a higher-dimensional space where they would become orthogonal. In order for the value of the SDP not to blow up, it would be necessary that this map has some kind of Lipschitz property. We are thus led to the following (possibly naïve) question.

Let {C(d,\varepsilon)} be the smallest number such that the following holds. (Here, {S^{d-1} \subseteq \mathbb R^d} denotes the {(d-1)}-dimensional unit sphere and S(L^2) denotes the unit-sphere of L^2.)

There exists a map {F : S^{d-1} \rightarrow S(L^2)} such that {\|F\|_{\mathrm{Lip}} \leq C(d,\varepsilon)} and whenever {u,v \in \mathbb R^d} satisfy {|\langle u,v\rangle| \leq \varepsilon}, we have {\langle F(u), F(v)\rangle = 0}.

(Recall that \|F\|_{\mathrm{Lip}} = \sup_{x \neq y \in S^{d-1}} \|F(x)-F(y)\|/\|x-y\|.)

One can show that

\displaystyle C(d,\varepsilon) \lesssim \frac{\sqrt{d}}{1-\varepsilon}

by randomly partitioning {S^{d-1}} so that all vectors satisfying {|\langle u,v\rangle| \leq \varepsilon} end up in different sets of the partition, and then mapping all the points in a set to a different orthogonal vector.

My question is simply: Is a better dependence on {d} possible? Can one rule out that {C(d,\varepsilon)} could be independent of {d}? Note that any solution which randomly maps points to orthogonal vectors must incur such a blowup (this is essentially rounding the SDP to an integral solution).

January 7, 2011

Reading for a rainy day

Filed under: Math, Reading — James Lee @ 4:16 am

Today it’s raining in both Seattle and Paris. Here are a few things I think are worth reading while the weather clears up.

  • Stop collecting coupons. Recently, Batson, Spielman, and Srivastava gave a beautiful sparsification result for graphs: Every graph has a linear-sized spectral sparsifier. Here is one statement of their result (taken from Srivastava’s thesis):

    Theorem [BSS]: For every \varepsilon > 0 and m,n \in \mathbb N, the following holds. For every x_1, x_2, \ldots, x_m \in \mathbb R^n, there are non-negative numbers s_1, s_2, \ldots, s_m of which at most O(n/\varepsilon^2) are non-zero, and such that for all y \in \mathbb R^n,

    \displaystyle (1-\varepsilon)^2 \sum_{i=1}^m \langle x_i, y \rangle^2 \leq \sum_{i=1}^m s_i \langle x_i,y\rangle^2 \leq (1+\varepsilon)^2 \sum_{i=1}^m \langle x_i, y\rangle^2.

    Assaf Naor has given a very nice account of some recent geometric applications of this idea. These involves breaking the n \log n “coupon collecting” barrier from a number of results which were previously based on random sampling. There is still an outstanding open problem left open here: Embedding n-dimensional subspaces of L_1 into \ell_1^{O(n)}.

  • Lecture notes that are lecture notes. Some people write lecture notes like they are preliminary book drafts. These lecture notes by Ryan O’Donnell for an undergraduate course on “Probability and Computing” are like transcribed lectures. They’re conversational, with philosophical asides–a great example of the style.

  • Metaphors in systolic geometry. Larry Guth and Nets Katz recently made awesome progress on the Erdos distance problem in the plane. On a different note, here is a great informal survey of Guth on Gromov’s systolic inequality (which itself answers the question—when would an isometric embedding into L_{\infty} ever be employed for the usefulness of the ambient space?!)

  • An important epsilon. Finally, there is the recent result of Gharan, Saberi, and Singh giving a 3/2 - \varepsilon approximation for the Traveling Salesman Problem in unweighted graphs. I haven’t gotten a chance to digest the paper yet. Here’s hoping someone else will write an overview of the key ideas.

January 4, 2011

Metric 2011: Metric geometry, groups, and algorithms

Filed under: Announcement, Math — Tags: , , , — James Lee @ 3:29 pm

There will be a program at the Institut Henri Poincaré from Jan 5 to Mar 31 on “Metric geometry, algorithms, and groups.” Visit the web site, or that of the sister program on Discrete Analysis at the Newton Institute.

Notes from the program and related courses and talks will be posted at the Metric 2011 blog.

Metric 2011

December 9, 2010

Open question: Cover times and the Gaussian free field

Filed under: Math, Open question — Tags: , , , — James Lee @ 7:32 pm

Here are a few fascinating open questions coming from my work with Jian Ding and Yuval Peres on cover times of graphs and the Gaussian free field. (Also, here are my slides for the corresponding talk.)

1. Cover times and the Gaussian free field

Consider a finite, connected graph {G=(V,E)} and the simple random walk on {G} (which, at every step, moves from a vertex to a uniformly random neighbor). If we let {\tau_{\mathrm{cov}}} denote the first (random) time at which every vertex of {G} has been visited, and we use {\mathop{\mathbb E}_v} to denote expectation over the random walk started at {v \in V}, then the cover time of {G} is defined by

\displaystyle  t_{\mathrm{cov}}(G) = \max_{v \in V} \mathop{\mathbb E}_v \tau_{\mathrm{cov}}.

On the other hand, consider the following Gaussian process {\{g_v\}_{v \in V}}, called the (discrete) Gaussian free field (GFF) on {G}. Such a process is specified uniquely by its covariance structure. Fix some {v_0 \in V}, and put {g_{v_0}=0}. Then the rest of the structure is specified by the relation

\displaystyle  \mathop{\mathbb E}(g_u-g_v)^2 = R_{\mathrm{eff}}(u,v)

for all {u,v \in V}, where {R_{\mathrm{eff}}(u,v)} the effective resistance between {u} and {v} when {G} is thought of as an electrical network. Equivalently, the density of {\{g_v\}_{v \in V}} is proportional to

\displaystyle  \exp\left(-\frac12 \sum_{u \sim v} (g_u-g_v)^2\right),

where the sum is over edges of G.
One of our main theorems can be stated as follows.

Theorem 1 (Ding-L-Peres) For every graph {G=(V,E)}, one has

\displaystyle  t_{\mathrm{cov}}(G) \asymp |E| \left(\mathop{\mathbb E} \max_{v \in V} g_v\right)^2,

where {\{g_v\}_{v \in V}} denotes the Gaussian free field on {G}.

GFF on the 2D lattice; courtesy of Scott Sheffield.

Here {A \asymp B} is the assertion that {A} and {B} are within universal constant factors. In particular, we use this theorem to give a deterministic {O(1)}-approximation to the cover time, answering a question of Aldous and Fill, and to resolve the Winkler-Zuckerman blanket time conjectures.
This follows some partial progress on these questions by Kahn, Kim, Lovasz, and Vu (1999).

2. Derandomizing the cover time

The cover time is one of the few basic graph parameters which can easily be computed by a randomized polynomial-time algorithm, but for which we don’t know of a deterministic counterpart. More precisely, for every {\varepsilon > 0}, we can compute a {(1+\varepsilon)}-approximation to the cover time by simulating the random walk enough times and taking the median estimate, but even given the above results, the best we can do in deterministic polynomial-time is an {O(1)}-approximation.

We now describe one conjectural path to a better derandomization. Let {t_{\mathrm{hit}}(G) = \max_{u,v \in V} H(u,v)} denote the maximal hitting time in {G}, where {H(u,v)} is the expected number of steps needed to hit {v} from a random walk started at {u}. We prove the following more precise estimate.

Theorem 2 There is a constant {C > 0} such that for every graph {G=(V,E)},

\displaystyle  t_{\mathrm{cov}}(G) \leq \left(1+C\sqrt{\frac{t_{\mathrm{hit}}(G)}{t_{\mathrm{cov}}(G)}}\right) |E| \left(\mathop{\mathbb E} \max_{v \in V} g_v\right)^2.

This prompts the following conjecture, which describes a potentially deeper connection between cover times and the GFF.

Conjecture 1 For a sequence of graphs {\{G_n=(V_n,E_n)\}} with {t_{\mathrm{hit}}(G_n) = o(t_{\mathrm{cov}}(G_n))},

\displaystyle  t_{\mathrm{cov}}(G_n) \sim |E_n| \left(\mathop{\mathbb E} \max_{v\in V_n} g^{(n)}_v\right)^2,

where {\{g^{(n)}_v\}} denotes the GFF on {G_n}.

Here, we use {a_n \sim b_n} to denote {\lim a_n/b_n = 1}. The conjecture holds in some interesting cases, including the complete graph, the discrete {2D} torus, and complete {d}-ary trees. (See the full paper for details; except for the complete graph, these exact estimates are entire papers in themselves.)

Since the proof of Theorem 1 makes heavy use of the Fernique-Talagrand majorizing measures theory for estimating {\mathop{\mathbb E} \max_{v \in V} g_v}, and this theory is bound to lose a large multiplicative constant, it seems that new techniques will be needed in order to resolve the conjecture. In particular, using the isomorphism theory discussed below, it seems that understanding the structure of the near-maxima, i.e. those points {g_v} which achieve {g_v \geq (1-\varepsilon) \left(\mathop{\mathbb E} \max_{v \in V} g_v\right)}, will be an essential part of any study.

The second part of such a derandomization strategy is the ability to compute a deterministic {(1+\varepsilon)}-approximation to {\mathop{\mathbb E} \max_{v \in V} g_v}.

Question 1 For every {\varepsilon > 0}, is there a deterministic polynomial-time algorithm that, given a finite set of points {X \subseteq \mathbb R^n}, computes a {(1+\varepsilon)}-approximation to the value

\displaystyle  \mathop{\mathbb E} \max_{x \in X} \langle g,x\rangle,

where {g} is a standard {n}-dimensional Gaussian? Is this possible if we know that {X} has the covariance structure of a Gaussian free field?

3. Understanding the Dynkin Isomorphism Theory

Besides majorizing measures, another major tool used in our work is the theory of isomorphisms between Markov processes and Gaussian processes. We now switch to considering the continuous-time random walk on a graph {G}. This makes the same transitions as the simple discrete-time walk, but now spends an exponential (with mean one) amount of time at every vertex. We define the local time at {v} at time {t} by

\displaystyle  L_t^v = \frac{\textrm{amount of time spent at } v}{\mathrm{deg}(v)}

when we have run the random walk for time {t}.

Work of Ray and Knight in the 1960’s characterized the local times of Brownian motion, and then in 1980, Dynkin described a general connection between the local times of Markov processes and associated Gaussian processes. The version we use is due to Eisenbaum, Kaspi, Marcus, Rosen, and Shi (2000).

Theorem 3 Let {G=(V,E)} and fix some {v_0 \in V}, which is the origin of the random walk. Let {\ell > 0} be given, and define the (random) time {T=T(\ell)=\inf \{ t: L_t^{v_0} \geq \ell \}}. If {\{g_v\}_{v \in V}} is the GFF with {g_{v_0}=0}, then

\displaystyle  \left\{ L_T^x + \frac12 g_x^2 : x \in V \right\} \stackrel{\mathrm{law}}{=} \left\{ \frac12 (g_x - \sqrt{2\ell})^2 : x \in V\right\}.

Note that on the left hand side, the local times {\{L_T^x\}} and the GFF {\{g_x\}} are independent. This remarkable theorem (and many others like it) are proved in the book of Marcus and Rosen. In the introduction, the authors describe the “wonderful, mysterious isomorphism theorems.” They continue,

Another confession we must make is that we do not really understand the actual relationship between local times… and their associated Gaussian processes. If one asks us, as is often the case during lectures, to give an intuitive description… we must answer that we cannot. We leave this extremely interesting question to you.

So I will now pass the question along. The proof of the isomorphism theorems proceeds by taking Laplace transforms and then doing some involved combinatorics. It’s analogous to the situation in enumerative combinatorics where we have a generating function proof of some equality, but not a bijective proof where you can really get your hands on what’s happening.

What is the simplest isomorphism-esque statement which has no intuitive proof? The following lemma is used in the proof of the original Ray-Knight theorem on Brownian motion (see Lemma 6.32 in the Morters-Peres book). It can be proved in a few lines using Laplace transforms, yet one suspects there should be an explicit coupling.

Lemma 4 For any {\ell > 0}, the following holds. Suppose that {X} is a standard normal, {Z_1, Z_2, \ldots} are i.i.d. standard exponentials, and {N} is Poisson with parameter {\ell^2/2}. If all these random variables are independent, then

\displaystyle  (X+\ell)^2 \stackrel{\mathrm{law}}{=} X^2 + 2 \sum_{j=1}^N Z_j.

Amazing! The last open question is to explain this equality of distributions in a satisfactory manner, as a first step to understanding what’s really going on in the isomorphism theory.

July 18, 2010

The majorizing measures theorem

Filed under: lecture, Math — Tags: , , , — James Lee @ 4:05 am

We will now prove Talagrand’s majorizing measures theorem, showing that the generic chaining bound is tight for Gaussian processes. The proof here will be a bit more long-winded than the proof from Talagrand’s book, but also (I think), a bit more accessible as well. Most importantly, we will highlight the key idea with a simple combinatorial argument.

First, let’s recall the bound we proved earlier.

Theorem 1 Let {\{X_t\}_{t \in T}} be a Gaussian process, and let {T_0 \subseteq T_1 \subseteq \cdots \subseteq T} be a sequence of subsets such that {|T_0|=1} and {|T_n| \leq 2^{2^{n}}} for {n \geq 1}. Then,

\displaystyle   \mathop{\mathbb E}\sup_{t \in T} X_t \leq O(1) \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} \, d(t, T_n). \ \ \ \ \ (1)

In order to make things slightly easier to work with, we look at an essentially equivalent way to state (1). Consider a Gaussian process {\{X_t\}_{t \in T}} and a sequence of increasing partitions {\{\mathcal A_n\}_{n \geq 0}} of {T}, where increasing means that {\mathcal A_{n+1}} is a refinement of {\mathcal A_n} for {n \geq 0}. Say that such a sequence {\{\mathcal A_n\}} is admissible if {\mathcal A_0 = \{T\}} and {|\mathcal A_n| \leq 2^{2^n}} for all {n \geq 1}. Also, for a partition {P} and a point {t \in T}, we will use the notation {P(t)} for the unique set in {P} which contains {t}.

By choosing {T_n} to be any set of points with one element in each piece of the partition {\mathcal A_n}, (1) yields,

\displaystyle  \mathop{\mathbb E}\sup_{t \in T} X_t \leq O(1) \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} \,\mathrm{diam}(\mathcal A_n(t)). \ \ \ \ \ (2)

We can now state our main theorem, which shows that this is essentially the only way to bound {\mathop{\mathbb E} \sup_{t \in T} X_t}.

Theorem 2 There is a constant {L > 0} such that for any Gaussian process {\{X_t\}_{t \in T}}, there exists an admissible sequence {\{\mathcal A_n\}} which satisfies,

\displaystyle  \mathop{\mathbb E} \sup_{t \in T} X_t \geq L \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} \,\mathrm{diam}(\mathcal A_n(t)). \ \ \ \ \ (3)

Recall that for a subset {A \subseteq T}, we defined {g(A) = \mathop{\mathbb E} \sup_{t \in A} X_t}, and in the last post, we proved the following “Sudakov inequality.”

Theorem 3 For some constants {\kappa > 0} and {r \geq 4}, the following holds. Suppose {\{X_t\}_{t \in T}} is a Gaussian process, and let {t_1, t_2, \ldots, t_m \in T} be such that {d(t_i,t_j) \geq \alpha} for {i \neq j}. Then,

\displaystyle   g(T) \geq \kappa \alpha \sqrt{\log_2 m} + \min_{i=1,2,\ldots,m} g(B(t_i, \alpha/r)). \ \ \ \ \ (4)

We will use only Theorem 3 and the fact that {g(A) \leq g(B)} whenever {A \subseteq B} to prove Theorem 2 (so, in fact, Theorem 2 holds with {\mathop{\mathbb E} \sup_{t \in T} X_t} replaced by more general functionals satisfying an inequality like (4)).

The partitioning scheme

First, we will specify the partitioning scheme to form an admissible sequence {\{\mathcal A_n\}}, and then we will move on to its analysis. As discussed in earlier posts, we may assume that {T} is finite. Every set {C \in \mathcal A_n} will have a value {\mathrm{rad}(C)} associated with it, such that {\mathrm{rad}(C)} is always an upper bound on the radius of the set {C}, i.e. there exists a point {x \in C} such that {C \subseteq B(x, \mathrm{rad}(C))}.

Initially, we set {\mathcal A_0 = \{T\}} and {\mathrm{rad}(T) = \mathrm{diam}(T)}. Now, we assume that we have constructed {\mathcal A_n}, and show how to form the partition {\mathcal A_{n+1}}. To do this, we will break every set {C \in \mathcal A_n} into at most {2^{2^n}} pieces. This will ensure that

\displaystyle |\mathcal A_{n+1}| \leq 2^{2^n} \cdot |\mathcal A_n| \leq 2^{2^n} \cdot 2^{2^n} = 2^{2^{n+1}}.

Let {r} be the constant from Theorem 3. Put {m = 2^{2^n}}, and let {\Delta = \mathrm{rad}(C)}. We partition {C} into {m} pieces as follows. First, choose {t_1 \in C} which maximizes the value

\displaystyle  g(B(t_1, \Delta/r^2) \cap C).

Then, set {C_1 = B(t_1, \Delta/r) \cap C}. We put {\mathrm{rad}(C_1) = \Delta/r}.

A remark: The whole idea here is that we have chosen the “largest possible piece,” (in terms of {g}-value), but we have done this with respect to the {\Delta/r^2} ball, while we cut out the {\Delta/r} ball. The reason for this will not become completely clear until the analysis, but we can offer a short explanation here. Looking at the lower bound (4), observe that the balls {B(t_i, \alpha/3)} are disjoint under the assumptions, but we only get “credit” for the {B(t_i, \alpha/r)} balls. When we apply this lower bound, it seems that we are throwing a lot of the space away. At some point, we will have to make sure that this thrown away part doesn’t have all the interesting stuff! The reason for our choice of {\Delta/r} vs. {\Delta/r^2} is essentially this: We want to guarantee that if we miss the interesting stuff at this level, then the previous level took care of it. To have this be the case, we will have to look forward (a level down), which (sort of) explains our choice of optimizing for the {\Delta/r^2} ball.

Now we continue in this fashion. Let {D_{\ell} = C \setminus \bigcup_{i=1}^{\ell-1} C_i} be the remaining space after we have cut out {\ell-1} pieces. For {\ell \leq m}, choose {t_{\ell} \in C} to maximize the value

\displaystyle  g(B(t_{\ell}, \Delta/r^2) \cap D_{\ell}).

For {\ell < m}, set {C_{\ell} = B(t_{\ell}, \Delta/r) \cap D_{\ell}}, and put {\mathrm{rad}(C_{\ell}) = \Delta/r}.

So far, we have been chopping the space into smaller pieces. If {D_{\ell} = \emptyset} for some {\ell \leq m}, we have finished our construction of {\mathcal A_{n+1}}. But maybe we have already chopped out {m-1} pieces, and still some remains. In that case, we put {C_m = D_m}, i.e. we throw everything else into {C_m}. Since we cannot reduce our estimate on the radius, we also put {\mathrm{rad}(C_m) = \Delta}.

We continue this process until {T} is exhausted, i.e. eventually for some {n} large enough, {\mathcal A_n} only contains singletons. This completes our description of the partitioning.

The tree

For the analysis, it will help to consider our partitioning process as having constructed a tree (in the most natural way). The root of the the tree is the set {T}, and its children are the sets of {\mathcal A_1}, and so on. Let’s call this tree {\mathcal W}. It will help to draw and describe {\mathcal W} in a specific way. First, we will assign values to the edges of the tree. If {C \in \mathcal A_n} and {C_j} is a child of {C} (i.e., {C_j \in \mathcal A_{n+1}} and {C_j \subseteq C}), then the edge {(C,C_j)} is given value:

\displaystyle  \kappa \cdot \frac{\mathrm{rad}(C)}{r} \cdot 2^{n/2}, \ \ \ \ \ (5)

where {\kappa} and {r} are the constants from Theorem 3.

If we define the value of a root-leaf path in {\mathcal W} as the sum of the edge lengths on that path, then for any {t \in T},

\displaystyle  \sum_{n \geq 0} 2^{n/2}\,\mathrm{diam}(\mathcal A_n(t)) \leq 2 \frac{r}{\kappa} \left(\textrm{value of the path from the root to } t\right),

simply using {\mathrm{diam}(\mathcal A_n(t)) \leq 2\, \mathrm{rad}(\mathcal A_n(t))}.

Thus in order to prove Theorem 2, which states that for some {L > 0},

\displaystyle  g(T) \geq L \sup_{t \in T} \sum_{n \geq 0} 2^{n/2}\,\mathrm{diam}(\mathcal A_n(t)),

it will suffice to show that for some (other) constant {L > 0}, for any root-leaf path {P} in {\mathcal W}, we have

\displaystyle  g(T) \geq L \cdot \mathrm{value}(P). \ \ \ \ \ (6)

Before doing this, we will fix a convention for drawing parts of {\mathcal W}. If a node {C \in \mathcal A_n} has children {C_1, C_2, \ldots, C_m}, we will draw them from left to right. We will call an edge {(C,C_m)} a right turn and every other edge will be referred to as a left turn. Note that some node {C} may not have any right turn coming out of it (if the partitioning finished before the last step). Also, observe that along a left turn, the radius always drops by a factor of {r}, while along a right turn, it remains the same.

We now make two observations about computing the value {\mathrm{value}(P)} up to a universal constant.

Observation (1): In computing the value of a root-leaf path {P}, we only need to consider right turns.

To see this, suppose that we have a right turn followed by a consecutive sequence of left turns. If the value of the right turn is {\frac{\kappa}{r} \Delta 2^{n/2}}, then the value of the following sequence of left turns is, in total, at most

\displaystyle  \frac{\kappa}{r} \sum_{j=1}^{\infty} 2^{(n+j)/2} \frac{\Delta}{r^j} \leq O(1) \frac{\kappa}{r} \Delta 2^{n/2}.

In other words, because the radius decreases by a factor of {r} along every left turn, their values decrease geometrically, making the whole sum comparable to the preceding right turn. (Recall that {r \geq 4}, so indeed the sum is geometric.)

If the problem of possibly of having no right turn in the path {P} bothers you, note that we could artificially add an initial right turn into the root with value {\mathrm{diam}(T)}. This is justified since {g(T) \geq \frac12 \mathrm{diam}(T)} always holds. A different way of saying this is that if the path really contained no right turn, then its value is {O(\mathrm{diam}(T))}, and we can easily prove (6).

Observation (2): In computing the value of a root-leaf path {P}, we need only consider the last right turn in any consecutive sequence of right turns.

Consider a sequence of consecutive right turns, and the fact that the radius does not decrease. The values (taking away the {\kappa/r} factor) look like {\Delta 2^{n/2}, \Delta 2^{(n+1)/2}, \Delta 2^{(n+2)/2}, \ldots}. In other words, they are geometrically increasing, and thus using only the last right turn in every sequence, we only lose a constant factor.

We will abbreviate last right turn to LRT, and write {\mathrm{value}_{\mathrm{LRT}}(P)} to denote the value of {P}, just counting last right turns. By the two observations, to show (6) (and hence finish the proof), it suffices to show that, for every root-leaf path {P} in {\mathcal W},

\displaystyle  2 \cdot g(T) \geq \mathrm{value}_{\mathrm{LRT}}(P). \ \ \ \ \ (7)

The analysis

Recall that our tree {\mathcal W} has values on the edges, defined in (5). We will also put some natural values on the nodes. For a node {C} (which, recall, is just a subset {C \subseteq T}), we put {\mathrm{value}(C) = g(C)}. So the edges have values and the nodes have values. Thus given any subset of nodes and edges in {\mathcal W}, we can talk about the value of the subset, which will be the sum of the values of the objects it contains. We will prove (7) by a sequence of inequalities on subsets.

Fix a root-leaf path {P}, for which we will prove (7). Let’s prove the fundamental inequality now. We will consider two consecutive LRTs along {P}. (If there is only one LRT in {P}, then we are done by the preceding remarks.) See the figure below. The dashed lines represent a (possibly empty) sequence of left turns and then right turns. The two LRTs are marked.

We will prove the following inequality, which is the heart of the proof. One should understand that the inequality is on the values of the subsets marked in red. The first subset contains two nodes, and the second contains two nodes and an edge.

Figure A.

Figure A.

With this inequality proved, the proof is complete. Let’s see why. We start with the first LRT. Since g(T) \geq g(C) for any node C in \mathcal W, we have the inequality:

This gets us started. Now we apply the inequality of Figure A repeatedly to each pair of consecutive LRTs in the path {P}. What do we have when we’ve exhausted the path P? Well, precisely all the LRTs in {P} are marked, yielding {2 \cdot g(T) \geq \mathrm{value}_{\mathrm{LRT}}(P)}, as desired.

The LRT inequality

Now we are left to prove the inequality in Figure A. First, let’s label some of the nodes. Let {\Delta = \mathrm{rad}(C)}, and suppose that {C \in \mathcal A_n}. The purple values are not the radii of the corresponding nodes, but they are upper bounds on the radii, recalling that along every left turn, the radius decreases by a factor of {r}. Since there are at least two left turns in the picture, we get a {\Delta/r^2} upper bound on the radius of {J}.

Part of the inequality is easy: We have {g(A) \geq g(B)} since {B \subseteq A}. So we can transfer the red mark from {A} to {B}. We are thus left to prove that

\displaystyle  g(C) \geq \frac{\kappa}{r} \Delta 2^{n/2} + g(J). \ \ \ \ \ (8)

This will allow us to transfer the red mark from {C} to the LRT coming out of {C} and to {J}.

When {C} was partitioned into {m = 2^{2^n}} pieces, this was by our greedy partitioning algorithm using centers {t_1, t_2, \ldots, t_m}. Since we cut out the {\Delta/r} ball around each center, we have {d(t_i, t_j) \geq \Delta/r} for all {i \neq j}. Applying the Sudakov inequality (Theorem 3), we have

\displaystyle  \begin{array}{rl} g(C) &\displaystyle \geq \kappa \frac{\Delta}{r} \sqrt{\log_2 m} + \min_{i=1,\ldots,m} g(B(t_i, \Delta/r^2)) \\ \\ &\displaystyle = \frac{\kappa}{r} \Delta 2^{n/2} + \min_{i=1,\ldots,m} g(B(t_i, \Delta/r^2)) \\ \\ &\displaystyle \geq \frac{\kappa}{r} \Delta 2^{n/2} + \min_{i=1,\ldots,m} g(B(t_i, \Delta/r^2) \cap D_i) \\ \\ &\displaystyle = \frac{\kappa}{r} \Delta 2^{n/2} + g(B(t_m, \Delta/r^2) \cap D_m), \\ \end{array}

where the last line follows from the greedy manner in which the {t_i}‘s were chosen.

But now we claim that

\displaystyle  g(B(t_m, \Delta/r^2) \cap D_m) \geq g(J). \ \ \ \ \ (9)

This follows from two facts. First, {J \subseteq D_m} (since {D_m=C_m} actually). Secondly, the radius of {J} is at most {\Delta/r^2}! But {t_m} was chosen to maximize the value of {g(B(t_m, \Delta/r^2) \cap D_m)} over all balls of radius {\Delta/r^2}, so in particular its {g}-value is at least that of the {\Delta/r^2} ball containing {J}.

Combining (9) and the preceding inequality, we prove (8), and thus that the inequality of Figure A is valid. This completes the proof.

July 8, 2010

Majorizing measures: Gaussian tools

Filed under: lecture, Math — Tags: , , , — James Lee @ 2:33 am

In order to prove that the chaining argument is tight, we will need some additional properties of Gaussian processes. For the chaining upper bound, we used a series of union bounds specified by a tree structure. As a first step in producing a good lower bound, we will look at a way in which the union bound is tight.

Theorem 1 (Sudakov inequality) For some constant C > 0, the following holds. Let {\{X_t\}_{t \in T}} be a Gaussian process such that for every distinct {s,t \in T}, we have {d(s,t) \geq \alpha}. Then,

\displaystyle  \mathop{\mathbb E} \sup_{t \in T} X_t \geq C \alpha \sqrt{\log |T|}.

The claim is an elementary calculation for a sequence of i.i.d. {N(0,1)} random variables {g_1, g_2, \ldots, g_n} (i.e. {\mathop{\mathbb E} \sup_i g_i \geq C\sqrt{\log n}}). We will reduce the general case to this one using Slepian’s comparison lemma.

Lemma 2 (Slepian’s Lemma) Let {\{X_t\}_{t \in T}} and {\{Y_t\}_{t \in T}} be two Gaussian processes such that for all {s,t \in T},

\displaystyle   \mathop{\mathbb E} \,|X_s - X_t|^2 \geq \mathop{\mathbb E} \,|Y_s - Y_t|^2. \ \ \ \ \ (1)

Then {\mathop{\mathbb E} \sup_{t \in T} X_t \geq \mathop{\mathbb E} \sup_{t \in T} Y_t}.

There is a fairly elementary proof of Slepian’s Lemma (see, e.g. the Ledoux-Talagrand book), if one is satisfied with the weaker conclusion {2\, \mathop{\mathbb E}\,|X_s-X_t|^2 \geq \mathop{\mathbb E}\,|Y_s-Y_t|^2}, which suffices for our purposes.

To see that Lemma 2 yields Theorem 1, take a family {\{X_t\}_{t \in T}} with {d(s,t) \geq \alpha} for all {s \neq t \in T} and consider the associated variables {Y_t = \frac{\alpha}{\sqrt{2}} g_t} where {\{g_t\}_{t \in T}} is a family of i.i.d. {N(0,1)} random variables. It is straightforward to verify that (1) holds, hence by the lemma, {\mathop{\mathbb E} \sup_{t \in T} X_t \geq \frac{\alpha}{\sqrt{2}} \mathop{\mathbb E} \sup_{t \in T} g_t}, and the result follows from the i.i.d. case.

The Sudakov inequality gives us “one level” of a lower bound; the following strengthening will allow us to use it recursively. If we have a Gaussian process {\{X_t\}_{t \in T}} and {A \subseteq T}, we will use the notation

\displaystyle  g(A) = \mathop{\mathbb E} \sup_{t \in A} X_t.

For {t \in T} and {R \geq 0}, we also use the notation

\displaystyle  B(t,R) = \{ s \in T : d(s,t) \leq R \}.

Here is the main theorem of this post; its statement is all we will require for our proof of the majorizing measures theorem:

Theorem 3 For some constants C > 0 and {r > 1}, the following holds. Suppose {\{X_t\}_{t \in T}} is a Gaussian process, and let {t_1, t_2, \ldots, t_m \in T} be such that {d(t_i,t_j) \geq \alpha} for {i \neq j}. Then,

\displaystyle  g(T) \geq C \alpha \sqrt{\log m} + \min_{i=1,2,\ldots,m} g(B(t_i, \alpha/r)).

The proof of the preceding theorem relies on the a strong concentration property for Gaussian processes. First, we recall the classical isoperimetric inequality for Gaussian space (see, for instance, (2.9) here).
We remind the reader that for a function {F : \mathbb R^n \rightarrow \mathbb R},

\displaystyle \|F\|_{\mathrm{Lip}} = \sup_{x \neq y \in \mathbb R^n} \frac{|F(x)-F(y)|}{\|x-y\|}.

Theorem 4 Let {F : \mathbb R^n \rightarrow \mathbb R}, and let {\mu = \int F \,d\gamma_n}, where {\gamma_n} is the standard {n}-dimensional Gaussian measure. Then,

\displaystyle   \gamma_n\left(x \in \mathbb R^n : |F(x)-\mu| > \lambda\right) \leq 2\,\exp\left(\frac{-\lambda^2}{2 \|F\|_{\mathrm{Lip}}}\right). \ \ \ \ \ (2)

Using this, we can prove the following remarkable fact.

Theorem 5 Let {\{X_t\}_{t \in T}} be a Gaussian process, then

\displaystyle   \mathbb P\left(\left|\sup_{t \in T} X_t - \mathop{\mathbb E} \sup_{t \in T} X_t\right| > \lambda\right) \leq 2\,\exp\left(\frac{-\lambda^2}{2 \sup_{t \in T} \mathop{\mathbb E}(X_t^2)}\right). \ \ \ \ \ (3)

A notable aspect of this statement is that only the maximum variance affects the concentration, not the number of random variables. We now prove Theorem 5 using Theorem 4.

Proof: We will prove it in the case {|T|=n}, but of course our bound is independent of {n}. The idea is that given a Gaussian process {\{X_1, X_2, \ldots, X_n\}}, we can write

\displaystyle  X_i = a_{i1} \,g_1 + a_{i2}\, g_2 + \cdots + a_{in}\, g_n,

for {i=1,2,\ldots, n}, where {\{g_i\}_{i=1}^n} are standard i.i.d. normals, and the matrix {A=(a_{i,j})} is a matrix of real coefficients. In this case, if {g = (g_1, g_2, \ldots, g_n)} is a standard {n}-dimensional Gaussian, then the vector {Ag} is distributed as {(X_1, X_2, \ldots, X_n)}.

If we put {F(x)=\max \{ (Ax)_i : i=1,\ldots,n\}}, then Theorem 4 yields (3) as long as {\|F\|_{\mathrm{Lip}} \leq \max_i \sqrt{\mathop{\mathbb E}(X_i^2)}}. It is easy to see that

\displaystyle  \|F\|_{\mathrm{Lip}} = \|A\|_{2 \rightarrow \infty} = \sup_{\|x\|_2 = 1} \|A x\|_{\infty}.

But {\|A\|_{2 \rightarrow \infty}} is just the maximum {\ell_2} norm of any row of {A}, and the {\ell_2} norm of row {i} is

\displaystyle \sqrt{\sum_{j=1}^n a_{ij}^2} = \sqrt{\mathop{\mathbb E}(X_i^2)}.


Using this theorem, we are ready to prove Theorem 3. I will only give a sketch here, but filling in the details is not too difficult.

Assume that the conditions of Theorem 3 hold. Pick an arbitrary {t_0 \in T}, and recall that we can write

\displaystyle g(T) = \mathop{\mathbb E} \sup_{t \in T} X_t = \mathop{\mathbb E} \sup_{t \in T} (X_t - X_{t_0})

since our gaussians are centered.

Now, by Theorem 1,

\displaystyle \mathop{\mathbb E} \max_{i=1,\ldots,m} \left(X_{t_i} - X_{t_0}\right) \geq C \alpha \sqrt{\log m}.

Suppose that {t_1} achieves this. By definition,

\displaystyle g(B(t_1, \alpha/r)) = \mathop{\mathbb E} \sup_{t \in B(t_1, \alpha/r)} \left(X_t - X_{t_1}\right),

so we could hope that for some {t \in B(t_1,\alpha/r)}, we simultaneously have {X_t - X_{t_1} \geq g(B(t_1,\alpha/r))}, yielding

\displaystyle  X_t - X_{t_0} = (X_t - X_{t_1}) + (X_{t_1} - X_{t_0}) \geq C\alpha \sqrt{\log m} + g(B(t_1, \alpha/r)). \ \ \ \ \ (4)

The problem, of course, is that the events we are discussing are not independent.

This is where Theorem 5 comes in. For any {i}, all the variances of the variables {\{X_t - X_{t_i} : t \in B(t_i,\alpha/r)\}} are bounded by {d(t,t_i)^2 \leq (\alpha/r)^2}. This implies that we can choose a constant {c_0 > 0} such that

\displaystyle  \mathbb P\left(\left|\sup_{t \in B(t_i,\alpha/r)} X_t - g(B(t_i, \alpha/r))\right| > c_0 (\alpha/r) \sqrt{\log m}\right) \leq m^{-2}.

So, in fact, we can expect that none of the {m} random variables {\sup_{t \in B(t_i,\alpha/r)} X_t} will deviate from its expected value by more than {c_0 (\alpha/r) \sqrt{\log m}}. Which means we can (morally) replace (4) by

\displaystyle  \begin{array}{rl} X_t - X_{t_0} &= (X_t - X_{t_1}) + (X_{t_1} - X_{t_0}) \\ &\geq C\alpha \sqrt{\log m} + g(B(t_1, \alpha/r)) - c_0 (\alpha/r) \sqrt{\log m}. \end{array}

But now by choosing {r = 2 C c_0}, the error term is absorbed.

June 29, 2010

Aram Harrow, Quantum Information, and Dvoretzky’s theorem

Filed under: Math — Tags: , , , — James Lee @ 1:42 pm

I’m delighted to announce that Aram Harrow will be joining us at UW as a visiting professor for the coming academic year. Aram is one of the young leaders in quantum information and computation. During a recent visit, Aram started to explain to me how ideas from asymptotic geometric analysis have started to become important in quantum information.

The capacity of quantum channels

A mixed quantum state on {\mathbb C^d} is represented by a density matrix {\rho}, which is {d \times d} Hermitian, positive semi-definite matrix with {\mathrm{tr}(\rho) = 1}. The von Neumann entropy of {\rho} is given by

\displaystyle  S(\rho) = - \mathrm{tr}(\rho \log \rho).

If {\mathcal M_d} is the space of {d \times d} matrices with complex entries, then a quantum channel is a mapping

\displaystyle  \Phi : \mathcal M_d \rightarrow \mathcal M_k

for some {d,k \geq 1}, which sends PSD matrices to PSD matrices and preserves the trace. Finally, the minimal output entropy of a quantum channel is

\displaystyle  S_{\min}(\Phi) = \min_{\rho} S(\Phi(\rho)),

where {\rho} ranges over all mixed states on {\mathbb C^d}. Since the entropy is a concave function, this minimum is achieved at a pure state.

Work of Shor showed that the following conjecture is equivalent to a number of long-standing problems in quantum information theory.

Additivity conjecture for minimal output von Neumann entropy: Is is true that, for every two quantum channels {\Phi_1} and {\Phi_2}, we have:

\displaystyle  S_{\min}(\Phi_1 \otimes \Phi_2) = S_{\min}(\Phi_1) + S_{\min}(\Phi_2).

Observe that we always have {\leq} above, by considering the product state {\rho_1 \otimes \rho_2} where {\rho_1} and {\rho_2} are minimizers for {\Phi_1} and {\Phi_2}, respectively. The conjecture is about whether we can get a smaller output entropy by entangling the the inputs to {\Phi_1 \otimes \Phi_2}.

Hastings’ counterexample and Dvoretzky’s theorem

Hastings proved that the conjecture is false: By appropriately choosing randomly constructed quantum channels, it is possible to achieve

\displaystyle  S_{\min}(\Phi_1 \otimes \Phi_2) < S_{\min}(\Phi_1) + S_{\min}(\Phi_2).

Remarkably, it has recently been shown by Aubrun, Szarek, and Werner that this counterexample follows from an appropriate version of Dvoretzky’s theorem on almost-Euclidean subspaces of normed spaces (for a refresher, see this earlier post on pseudorandom subspaces).

Let {\mathcal M_{k,d}} be the space of {k \times d} matrices with complex entries. We will consider two norms on matrices {M \in \mathcal M_{k,d}}: The Hilbert-Schmidt norm {\|M\|_2 = \sqrt{\mathrm{tr} |M|^2} = \sqrt{\sum_{i,j} M_{ij}^2}}, and the Schatten {4}-norm {\|M\|_4 = \left(\mathrm{tr} |M|^4\right)^{1/4}}, where {|M| = \sqrt{M^* M}}. This latter quantity is precisely the {\ell^4} norm of the singular values of {M}. Dvoretzky’s theorem tells us (qualitatively) that an appropriate random subspace of the normed space {(\mathcal M_{k,d}, \|\cdot\|_4)} will be nearly Euclidean.

More precisely, for every {k \geq 1}, a uniformly random subspace {\mathcal W \subseteq \mathcal M_{k,O(k^2)}} of dimension {\Omega(k^2)} will, with high probability, satisfy: For every {M \in \mathcal W},

\displaystyle  k^{-1/4} \|M\|_2 \leq \|M\|_4 \leq \left(1+\frac{O(1)}{k}\right) k^{-1/4} \|M\|_2,

i.e. it will be {\left(1+O(1/k)\right)}-Euclidean. This strong quantitative version follows along the lines of enhancements to Milman’s proof of Dvoretzky’s theorem.

Oscillations and chaining

The proof of the preceding quantitative bound is based on a version of Dvoretzky’s theorem for Lipschitz functions, which I will state here in the real case for simplicity. We will use {S^{n-1} \subseteq \mathbb R^n} for the {(n-1)}-dimensional unit sphere, and then for a function {f : S^{n-1} \rightarrow \mathbb R} and a subset {A \subseteq S^{n-1}}, we write

\displaystyle  \mathrm{osc}(f,A) = \sup_{x \in A} |f(x)-\mu(f)|,

where {\mu(f)} is the average value of {f} on {S^{n-1}}.

Theorem 1 If {f : S^{n-1} \rightarrow \mathbb R} is a 1-Lipschitz function, then for every {\varepsilon > 0}, if {E \subseteq \mathbb R^n} is a random subspace of dimension {\lfloor \varepsilon^2 n\rfloor}, then with high probability,

\displaystyle  \mathrm{osc}(f,E \cap S^{n-1}) \leq O(\varepsilon). \ \ \ \ \ (1)

A standard way to prove this lemma would be to use the fact that every {1}-Lipschitz function is highly concentrated on {S^{n-1}} (Levy’s lemma) and then to take a union bound over an {\varepsilon}-net on {S^{n-1}} whose size is bounded by {(1+2/\varepsilon)^n}. Unfortunately, this leads to an extra {\log(1/\varepsilon)} term on the right-hand side of (1) which, remarkably, would lead to a version of Dvoretzky’s theorem that is too weak to imply Hastings’ counterexample to the additivity conjecture.

Gordon originally proved Theorem 1 using comparison results for Gaussian processes. (In fact, we will see a main technical step called Slepian’s Lemma in our next post on majorizing measures.) Alternately, Schechtman showed that one can use concentration of measure and a more sophisticated chaining argument, of the type discussed in our previous post.

June 24, 2010

The Gödel Prize, TSP, and volume growth

Filed under: Math, Open question — Tags: , , , , — James Lee @ 3:57 pm

Recently, Sanjeev Arora and Joe Mitchell won the Gödel prize for their work on Euclidean TSP. They show that given n points in \mathbb R^2, and a parameter \varepsilon > 0, it is possible to compute, in polynomial time, a traveling salesman tour of the input whose length is at most a factor 1+\varepsilon more than the length of the optimum tour. (This is called a polynomial-time approximation scheme, or PTAS.)

Later, Arora extended this to work in \mathbb R^k for every fixed k \geq 2. What properties of \mathbb R^k are really needed to get such an algorithm?

Certainly a key property is that the volume of a ball of radius r in \mathbb R^k only grows like O(r^k). This ensures that one can choose an \epsilon-net of size at most O(1/\epsilon)^k in a ball of radius {O(1)}, which is essential for using dynamic programming. In my opinion, this leads to the most fascinating problem left open in this area:

Is bounded volume growth the only property needed to get a PTAS?

This would imply that the use of Euclidean geometry in Arora’s algorithm is non-essential. We can state the question formally as follows. Let (X,d) be a metric space, and let \lambda(X,d) be the smallest number \lambda such that for every r > 0,every ball of radius r in X can be covered by \lambda balls of radius r/2. Is there a PTAS for TSP in X? (In other words, the running time should be bounded by a polynomial in n =|X| whose degree depends only on \lambda(X,d).)

The problem is open, though there are some partial results by Talwar.

June 22, 2010

Random tree embeddings

Filed under: lecture, Math — Tags: , , , — James Lee @ 10:28 am

Random embeddings

When it is impossible to achieve a low-distortion embedding of some space {X} into another space {Y}, we can consider more lenient kinds of mappings which are still suitable for many applications. For example, consider the unweighted {n}-cycle {C_n}. It is known that any embedding of {C_n} into a tree metric incurs distortion {\Omega(n)}. On the other hand, if we delete a uniformly random edge of {C_n}, this leaves us with a random tree (actually a path) {C'_n} such that for any {x,y \in C_n}, we have

\displaystyle  d_{C'_n}(x,y) \geq d_{C_n}(x,y) \geq \frac12\, \mathop{\mathbb E}[d_{C'_n}(x,y)].

In other words, in expectation the distortion is only 2.

Let {(X,d)} be a finite metric space, and let {\mathcal F} be a family of finite metric spaces. A stochastic embedding from {X} into {\mathcal F} is a random pair {(F,Y)} where {Y \in \mathcal F} and {F : X \rightarrow Y} is a non-contractive mapping, i.e. such that {d_Y(F(x),F(y)) \geq d_X(x,y)} for all {x,y \in X}. The distortion of {F} is defined by

\displaystyle  \max_{x,y \in X} \frac{\mathop{\mathbb E} \left[d_Y(F(x),F(y))\right]}{d_X(x,y)}\,.

We will now argue that every finite metric space admits a surprisingly good stochastic embedding into tree metrics. The next result is from Fakcharoenphol, Rao, and Talwar, following work by Bartal and Alon, Karp, Peleg, and West.

Theorem 1 Every {n}-point metric space {(X,d)} admits a stochastic embedding into the family of tree metrics, with distortion {O(\log n)}.

We will need the random partitioning theorem we proved last time:

Theorem 2 For every {\Delta > 0}, there is a {\Delta}-bounded random partition {\mathcal P} of {X} which satisfies, for every {x \in X} and {0 < r \leq \Delta/8},

\displaystyle   \mathop{\mathbb P}\left(B(x,r) \nsubseteq \mathcal P(x)\right) \leq \frac{8r}{\Delta} H\left(|B(x,\Delta/8)|,|B(x,\Delta)|\right). \ \ \ \ \ (1)

From partitions to trees. Before proving the theorem, we discuss a general way of constructing a tree metric from a sequence of partitions. Assume (by perhaps scaling the metric first) that {1 < d(x,y) \leq 2^M} for all {x,y \in X} with {x \neq y}. Let {P_0, P_1, \ldots, P_M} be partitions of {X}, where {P_k} is {2^k}-bounded. We will assume that {P_M = \{X\}} and {P_0} is a partition of {X} into singleton sets.



Now we inductively construct a tree metric {T = T(P_0, P_1, \ldots, P_M)} as follows. The nodes of the tree will be of the form {(k,S)} for {k \in \{0,1,\ldots,M\}} and {S \subseteq X}. The root is {(M,X)}. In general, if the tree has a node of the form {(j,S)} for {j > 0}, then {(j,S)} will have children

\displaystyle \{ (j-1, S \cap T) : T \in P_{j-1} \textrm{ and } S \cap T \neq \emptyset\}.

The length of an edge of the form {\{(j,S), (j-1,S')\}} is {2^{j}}. This specifies the entire tree {T}. We also specify a map {F : X \rightarrow T} by {F(x) = (0,\{x\})}. We leave the following claim to the reader.

Claim 1 For every {x,y \in X},

\displaystyle  d_{T}(F(x),F(y)) = 2 \sum_{k=1}^{j+1} 2^k = 2^{j+3}-4.

where {j \in \{0,1,\ldots,M\}} is the largest index with {P_j(x) \neq P_j(y)}.

Note, in particular, that {F : X \rightarrow T} is non-contracting because if {2^j < d(x,y) \leq 2^{j+1}} for some {j \geq 0}, then {P_j(x) \neq P_j(y)} since {P_j} is {2^{j}}-bounded, implying that {d_{T}(F(x),F(y)) \geq 2^{j+3}-4 \geq 2^{j+1}}.

Now we are ready to prove Theorem 1.

Proof: Again, assume that {1 < d(x,y) \leq 2^M} for all {x,y \in X}. For {0 < k < M}, let {\mathcal P_k} be the {2^k}-bounded random partition guaranteed by Theorem 2. Let {\mathcal P_0} be a partition into singletons, and let {\mathcal P_M = \{X\}}. Finally, let {\mathcal T=\mathcal T(\mathcal P_0, \ldots, \mathcal P_M)} be the tree constructed above, and let {F : X \rightarrow \mathcal T} be the corresponding (random) non-contractive mapping.

Now, fix {x,y \in X} and an integer {h} such that {2^h < d(x,y) \leq 2^{h+1}}. Using Claim 1 and (1), we have,

\displaystyle  \mathop{\mathbb E}\left[d_{\mathcal T}(F(x),F(y))\right] \leq \sum_{j=0}^M \mathop{\mathbb P}[\mathcal P_j(x) \neq \mathcal P_j(y)] \cdot 2^{j+3}

    \displaystyle  \leq O(d(x,y)) + \sum_{j=h+3}^M \frac{8\,d(x,y)}{2^j} 2^{j+3} H\left(|B(x, 2^{j-3})|,|B(x,2^{j})|\right)

    \displaystyle  \leq O(d(x,y)) + \sum_{j=h+3}^M \frac{8\,d(x,y)}{2^j} 2^{j+3} H\left(|B(x, 2^{j-3})|,|B(x,2^{j})|\right)

   \displaystyle  \leq O(d(x,y)) + O(1) \,d(x,y) \sum_{j=h+3}^M H\left(|B(x, 2^{j-3})|,|B(x,2^{j})|\right)

   \displaystyle  \leq O(H(1,n))\,d(x,y)

   \displaystyle  \leq O(\log n)\,d(x,y),

where in the penultimate line, we have evaluated three disjoint telescoping sums: For any numbers {1 \leq a_1 \leq a_2 \leq \cdots \leq a_k},

\displaystyle H(a_1,a_2)+H(a_2,a_3)+\cdots+H(a_{k-1},a_k)=H(a_1,a_k).


Since every tree embeds isometrically into {\ell_1}, this offers an alternate proof of Bourgain’s theorem when the target space is {\ell_1}. Since we know that expander graphs require {\Omega(\log n)} distortion into {{\ell}_1}, this also shows that Theorem 1 is asymptotically tight.

« Newer PostsOlder Posts »

The Shocking Blue Green Theme. Blog at


Get every new post delivered to your Inbox.

Join 78 other followers