tcs math – some mathematics of theoretical computer science

July 18, 2010

The majorizing measures theorem

Filed under: lecture, Math — Tags: , , , — James Lee @ 4:05 am

We will now prove Talagrand’s majorizing measures theorem, showing that the generic chaining bound is tight for Gaussian processes. The proof here will be a bit more long-winded than the proof from Talagrand’s book, but also (I think), a bit more accessible as well. Most importantly, we will highlight the key idea with a simple combinatorial argument.

First, let’s recall the bound we proved earlier.

Theorem 1 Let {\{X_t\}_{t \in T}} be a Gaussian process, and let {T_0 \subseteq T_1 \subseteq \cdots \subseteq T} be a sequence of subsets such that {|T_0|=1} and {|T_n| \leq 2^{2^{n}}} for {n \geq 1}. Then,

\displaystyle   \mathop{\mathbb E}\sup_{t \in T} X_t \leq O(1) \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} \, d(t, T_n). \ \ \ \ \ (1)

In order to make things slightly easier to work with, we look at an essentially equivalent way to state (1). Consider a Gaussian process {\{X_t\}_{t \in T}} and a sequence of increasing partitions {\{\mathcal A_n\}_{n \geq 0}} of {T}, where increasing means that {\mathcal A_{n+1}} is a refinement of {\mathcal A_n} for {n \geq 0}. Say that such a sequence {\{\mathcal A_n\}} is admissible if {\mathcal A_0 = \{T\}} and {|\mathcal A_n| \leq 2^{2^n}} for all {n \geq 1}. Also, for a partition {P} and a point {t \in T}, we will use the notation {P(t)} for the unique set in {P} which contains {t}.

By choosing {T_n} to be any set of points with one element in each piece of the partition {\mathcal A_n}, (1) yields,

\displaystyle  \mathop{\mathbb E}\sup_{t \in T} X_t \leq O(1) \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} \,\mathrm{diam}(\mathcal A_n(t)). \ \ \ \ \ (2)

We can now state our main theorem, which shows that this is essentially the only way to bound {\mathop{\mathbb E} \sup_{t \in T} X_t}.

Theorem 2 There is a constant {L > 0} such that for any Gaussian process {\{X_t\}_{t \in T}}, there exists an admissible sequence {\{\mathcal A_n\}} which satisfies,

\displaystyle  \mathop{\mathbb E} \sup_{t \in T} X_t \geq L \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} \,\mathrm{diam}(\mathcal A_n(t)). \ \ \ \ \ (3)

Recall that for a subset {A \subseteq T}, we defined {g(A) = \mathop{\mathbb E} \sup_{t \in A} X_t}, and in the last post, we proved the following “Sudakov inequality.”

Theorem 3 For some constants {\kappa > 0} and {r \geq 4}, the following holds. Suppose {\{X_t\}_{t \in T}} is a Gaussian process, and let {t_1, t_2, \ldots, t_m \in T} be such that {d(t_i,t_j) \geq \alpha} for {i \neq j}. Then,

\displaystyle   g(T) \geq \kappa \alpha \sqrt{\log_2 m} + \min_{i=1,2,\ldots,m} g(B(t_i, \alpha/r)). \ \ \ \ \ (4)

We will use only Theorem 3 and the fact that {g(A) \leq g(B)} whenever {A \subseteq B} to prove Theorem 2 (so, in fact, Theorem 2 holds with {\mathop{\mathbb E} \sup_{t \in T} X_t} replaced by more general functionals satisfying an inequality like (4)).

The partitioning scheme

First, we will specify the partitioning scheme to form an admissible sequence {\{\mathcal A_n\}}, and then we will move on to its analysis. As discussed in earlier posts, we may assume that {T} is finite. Every set {C \in \mathcal A_n} will have a value {\mathrm{rad}(C)} associated with it, such that {\mathrm{rad}(C)} is always an upper bound on the radius of the set {C}, i.e. there exists a point {x \in C} such that {C \subseteq B(x, \mathrm{rad}(C))}.

Initially, we set {\mathcal A_0 = \{T\}} and {\mathrm{rad}(T) = \mathrm{diam}(T)}. Now, we assume that we have constructed {\mathcal A_n}, and show how to form the partition {\mathcal A_{n+1}}. To do this, we will break every set {C \in \mathcal A_n} into at most {2^{2^n}} pieces. This will ensure that

\displaystyle |\mathcal A_{n+1}| \leq 2^{2^n} \cdot |\mathcal A_n| \leq 2^{2^n} \cdot 2^{2^n} = 2^{2^{n+1}}.

Let {r} be the constant from Theorem 3. Put {m = 2^{2^n}}, and let {\Delta = \mathrm{rad}(C)}. We partition {C} into {m} pieces as follows. First, choose {t_1 \in C} which maximizes the value

\displaystyle  g(B(t_1, \Delta/r^2) \cap C).

Then, set {C_1 = B(t_1, \Delta/r) \cap C}. We put {\mathrm{rad}(C_1) = \Delta/r}.

A remark: The whole idea here is that we have chosen the “largest possible piece,” (in terms of {g}-value), but we have done this with respect to the {\Delta/r^2} ball, while we cut out the {\Delta/r} ball. The reason for this will not become completely clear until the analysis, but we can offer a short explanation here. Looking at the lower bound (4), observe that the balls {B(t_i, \alpha/3)} are disjoint under the assumptions, but we only get “credit” for the {B(t_i, \alpha/r)} balls. When we apply this lower bound, it seems that we are throwing a lot of the space away. At some point, we will have to make sure that this thrown away part doesn’t have all the interesting stuff! The reason for our choice of {\Delta/r} vs. {\Delta/r^2} is essentially this: We want to guarantee that if we miss the interesting stuff at this level, then the previous level took care of it. To have this be the case, we will have to look forward (a level down), which (sort of) explains our choice of optimizing for the {\Delta/r^2} ball.

Now we continue in this fashion. Let {D_{\ell} = C \setminus \bigcup_{i=1}^{\ell-1} C_i} be the remaining space after we have cut out {\ell-1} pieces. For {\ell \leq m}, choose {t_{\ell} \in C} to maximize the value

\displaystyle  g(B(t_{\ell}, \Delta/r^2) \cap D_{\ell}).

For {\ell < m}, set {C_{\ell} = B(t_{\ell}, \Delta/r) \cap D_{\ell}}, and put {\mathrm{rad}(C_{\ell}) = \Delta/r}.

So far, we have been chopping the space into smaller pieces. If {D_{\ell} = \emptyset} for some {\ell \leq m}, we have finished our construction of {\mathcal A_{n+1}}. But maybe we have already chopped out {m-1} pieces, and still some remains. In that case, we put {C_m = D_m}, i.e. we throw everything else into {C_m}. Since we cannot reduce our estimate on the radius, we also put {\mathrm{rad}(C_m) = \Delta}.

We continue this process until {T} is exhausted, i.e. eventually for some {n} large enough, {\mathcal A_n} only contains singletons. This completes our description of the partitioning.

The tree

For the analysis, it will help to consider our partitioning process as having constructed a tree (in the most natural way). The root of the the tree is the set {T}, and its children are the sets of {\mathcal A_1}, and so on. Let’s call this tree {\mathcal W}. It will help to draw and describe {\mathcal W} in a specific way. First, we will assign values to the edges of the tree. If {C \in \mathcal A_n} and {C_j} is a child of {C} (i.e., {C_j \in \mathcal A_{n+1}} and {C_j \subseteq C}), then the edge {(C,C_j)} is given value:

\displaystyle  \kappa \cdot \frac{\mathrm{rad}(C)}{r} \cdot 2^{n/2}, \ \ \ \ \ (5)

where {\kappa} and {r} are the constants from Theorem 3.

If we define the value of a root-leaf path in {\mathcal W} as the sum of the edge lengths on that path, then for any {t \in T},

\displaystyle  \sum_{n \geq 0} 2^{n/2}\,\mathrm{diam}(\mathcal A_n(t)) \leq 2 \frac{r}{\kappa} \left(\textrm{value of the path from the root to } t\right),

simply using {\mathrm{diam}(\mathcal A_n(t)) \leq 2\, \mathrm{rad}(\mathcal A_n(t))}.

Thus in order to prove Theorem 2, which states that for some {L > 0},

\displaystyle  g(T) \geq L \sup_{t \in T} \sum_{n \geq 0} 2^{n/2}\,\mathrm{diam}(\mathcal A_n(t)),

it will suffice to show that for some (other) constant {L > 0}, for any root-leaf path {P} in {\mathcal W}, we have

\displaystyle  g(T) \geq L \cdot \mathrm{value}(P). \ \ \ \ \ (6)

Before doing this, we will fix a convention for drawing parts of {\mathcal W}. If a node {C \in \mathcal A_n} has children {C_1, C_2, \ldots, C_m}, we will draw them from left to right. We will call an edge {(C,C_m)} a right turn and every other edge will be referred to as a left turn. Note that some node {C} may not have any right turn coming out of it (if the partitioning finished before the last step). Also, observe that along a left turn, the radius always drops by a factor of {r}, while along a right turn, it remains the same.

We now make two observations about computing the value {\mathrm{value}(P)} up to a universal constant.

Observation (1): In computing the value of a root-leaf path {P}, we only need to consider right turns.

To see this, suppose that we have a right turn followed by a consecutive sequence of left turns. If the value of the right turn is {\frac{\kappa}{r} \Delta 2^{n/2}}, then the value of the following sequence of left turns is, in total, at most

\displaystyle  \frac{\kappa}{r} \sum_{j=1}^{\infty} 2^{(n+j)/2} \frac{\Delta}{r^j} \leq O(1) \frac{\kappa}{r} \Delta 2^{n/2}.

In other words, because the radius decreases by a factor of {r} along every left turn, their values decrease geometrically, making the whole sum comparable to the preceding right turn. (Recall that {r \geq 4}, so indeed the sum is geometric.)

If the problem of possibly of having no right turn in the path {P} bothers you, note that we could artificially add an initial right turn into the root with value {\mathrm{diam}(T)}. This is justified since {g(T) \geq \frac12 \mathrm{diam}(T)} always holds. A different way of saying this is that if the path really contained no right turn, then its value is {O(\mathrm{diam}(T))}, and we can easily prove (6).

Observation (2): In computing the value of a root-leaf path {P}, we need only consider the last right turn in any consecutive sequence of right turns.

Consider a sequence of consecutive right turns, and the fact that the radius does not decrease. The values (taking away the {\kappa/r} factor) look like {\Delta 2^{n/2}, \Delta 2^{(n+1)/2}, \Delta 2^{(n+2)/2}, \ldots}. In other words, they are geometrically increasing, and thus using only the last right turn in every sequence, we only lose a constant factor.

We will abbreviate last right turn to LRT, and write {\mathrm{value}_{\mathrm{LRT}}(P)} to denote the value of {P}, just counting last right turns. By the two observations, to show (6) (and hence finish the proof), it suffices to show that, for every root-leaf path {P} in {\mathcal W},

\displaystyle  2 \cdot g(T) \geq \mathrm{value}_{\mathrm{LRT}}(P). \ \ \ \ \ (7)

The analysis

Recall that our tree {\mathcal W} has values on the edges, defined in (5). We will also put some natural values on the nodes. For a node {C} (which, recall, is just a subset {C \subseteq T}), we put {\mathrm{value}(C) = g(C)}. So the edges have values and the nodes have values. Thus given any subset of nodes and edges in {\mathcal W}, we can talk about the value of the subset, which will be the sum of the values of the objects it contains. We will prove (7) by a sequence of inequalities on subsets.

Fix a root-leaf path {P}, for which we will prove (7). Let’s prove the fundamental inequality now. We will consider two consecutive LRTs along {P}. (If there is only one LRT in {P}, then we are done by the preceding remarks.) See the figure below. The dashed lines represent a (possibly empty) sequence of left turns and then right turns. The two LRTs are marked.

We will prove the following inequality, which is the heart of the proof. One should understand that the inequality is on the values of the subsets marked in red. The first subset contains two nodes, and the second contains two nodes and an edge.

Figure A.

Figure A.

With this inequality proved, the proof is complete. Let’s see why. We start with the first LRT. Since g(T) \geq g(C) for any node C in \mathcal W, we have the inequality:

This gets us started. Now we apply the inequality of Figure A repeatedly to each pair of consecutive LRTs in the path {P}. What do we have when we’ve exhausted the path P? Well, precisely all the LRTs in {P} are marked, yielding {2 \cdot g(T) \geq \mathrm{value}_{\mathrm{LRT}}(P)}, as desired.

The LRT inequality

Now we are left to prove the inequality in Figure A. First, let’s label some of the nodes. Let {\Delta = \mathrm{rad}(C)}, and suppose that {C \in \mathcal A_n}. The purple values are not the radii of the corresponding nodes, but they are upper bounds on the radii, recalling that along every left turn, the radius decreases by a factor of {r}. Since there are at least two left turns in the picture, we get a {\Delta/r^2} upper bound on the radius of {J}.

Part of the inequality is easy: We have {g(A) \geq g(B)} since {B \subseteq A}. So we can transfer the red mark from {A} to {B}. We are thus left to prove that

\displaystyle  g(C) \geq \frac{\kappa}{r} \Delta 2^{n/2} + g(J). \ \ \ \ \ (8)

This will allow us to transfer the red mark from {C} to the LRT coming out of {C} and to {J}.

When {C} was partitioned into {m = 2^{2^n}} pieces, this was by our greedy partitioning algorithm using centers {t_1, t_2, \ldots, t_m}. Since we cut out the {\Delta/r} ball around each center, we have {d(t_i, t_j) \geq \Delta/r} for all {i \neq j}. Applying the Sudakov inequality (Theorem 3), we have

\displaystyle  \begin{array}{rl} g(C) &\displaystyle \geq \kappa \frac{\Delta}{r} \sqrt{\log_2 m} + \min_{i=1,\ldots,m} g(B(t_i, \Delta/r^2)) \\ \\ &\displaystyle = \frac{\kappa}{r} \Delta 2^{n/2} + \min_{i=1,\ldots,m} g(B(t_i, \Delta/r^2)) \\ \\ &\displaystyle \geq \frac{\kappa}{r} \Delta 2^{n/2} + \min_{i=1,\ldots,m} g(B(t_i, \Delta/r^2) \cap D_i) \\ \\ &\displaystyle = \frac{\kappa}{r} \Delta 2^{n/2} + g(B(t_m, \Delta/r^2) \cap D_m), \\ \end{array}

where the last line follows from the greedy manner in which the {t_i}‘s were chosen.

But now we claim that

\displaystyle  g(B(t_m, \Delta/r^2) \cap D_m) \geq g(J). \ \ \ \ \ (9)

This follows from two facts. First, {J \subseteq D_m} (since {D_m=C_m} actually). Secondly, the radius of {J} is at most {\Delta/r^2}! But {t_m} was chosen to maximize the value of {g(B(t_m, \Delta/r^2) \cap D_m)} over all balls of radius {\Delta/r^2}, so in particular its {g}-value is at least that of the {\Delta/r^2} ball containing {J}.

Combining (9) and the preceding inequality, we prove (8), and thus that the inequality of Figure A is valid. This completes the proof.

July 8, 2010

Majorizing measures: Gaussian tools

Filed under: lecture, Math — Tags: , , , — James Lee @ 2:33 am

In order to prove that the chaining argument is tight, we will need some additional properties of Gaussian processes. For the chaining upper bound, we used a series of union bounds specified by a tree structure. As a first step in producing a good lower bound, we will look at a way in which the union bound is tight.

Theorem 1 (Sudakov inequality) For some constant C > 0, the following holds. Let {\{X_t\}_{t \in T}} be a Gaussian process such that for every distinct {s,t \in T}, we have {d(s,t) \geq \alpha}. Then,

\displaystyle  \mathop{\mathbb E} \sup_{t \in T} X_t \geq C \alpha \sqrt{\log |T|}.

The claim is an elementary calculation for a sequence of i.i.d. {N(0,1)} random variables {g_1, g_2, \ldots, g_n} (i.e. {\mathop{\mathbb E} \sup_i g_i \geq C\sqrt{\log n}}). We will reduce the general case to this one using Slepian’s comparison lemma.

Lemma 2 (Slepian’s Lemma) Let {\{X_t\}_{t \in T}} and {\{Y_t\}_{t \in T}} be two Gaussian processes such that for all {s,t \in T},

\displaystyle   \mathop{\mathbb E} \,|X_s - X_t|^2 \geq \mathop{\mathbb E} \,|Y_s - Y_t|^2. \ \ \ \ \ (1)

Then {\mathop{\mathbb E} \sup_{t \in T} X_t \geq \mathop{\mathbb E} \sup_{t \in T} Y_t}.

There is a fairly elementary proof of Slepian’s Lemma (see, e.g. the Ledoux-Talagrand book), if one is satisfied with the weaker conclusion {2\, \mathop{\mathbb E}\,|X_s-X_t|^2 \geq \mathop{\mathbb E}\,|Y_s-Y_t|^2}, which suffices for our purposes.

To see that Lemma 2 yields Theorem 1, take a family {\{X_t\}_{t \in T}} with {d(s,t) \geq \alpha} for all {s \neq t \in T} and consider the associated variables {Y_t = \frac{\alpha}{\sqrt{2}} g_t} where {\{g_t\}_{t \in T}} is a family of i.i.d. {N(0,1)} random variables. It is straightforward to verify that (1) holds, hence by the lemma, {\mathop{\mathbb E} \sup_{t \in T} X_t \geq \frac{\alpha}{\sqrt{2}} \mathop{\mathbb E} \sup_{t \in T} g_t}, and the result follows from the i.i.d. case.

The Sudakov inequality gives us “one level” of a lower bound; the following strengthening will allow us to use it recursively. If we have a Gaussian process {\{X_t\}_{t \in T}} and {A \subseteq T}, we will use the notation

\displaystyle  g(A) = \mathop{\mathbb E} \sup_{t \in A} X_t.

For {t \in T} and {R \geq 0}, we also use the notation

\displaystyle  B(t,R) = \{ s \in T : d(s,t) \leq R \}.

Here is the main theorem of this post; its statement is all we will require for our proof of the majorizing measures theorem:

Theorem 3 For some constants C > 0 and {r > 1}, the following holds. Suppose {\{X_t\}_{t \in T}} is a Gaussian process, and let {t_1, t_2, \ldots, t_m \in T} be such that {d(t_i,t_j) \geq \alpha} for {i \neq j}. Then,

\displaystyle  g(T) \geq C \alpha \sqrt{\log m} + \min_{i=1,2,\ldots,m} g(B(t_i, \alpha/r)).

The proof of the preceding theorem relies on the a strong concentration property for Gaussian processes. First, we recall the classical isoperimetric inequality for Gaussian space (see, for instance, (2.9) here).
We remind the reader that for a function {F : \mathbb R^n \rightarrow \mathbb R},

\displaystyle \|F\|_{\mathrm{Lip}} = \sup_{x \neq y \in \mathbb R^n} \frac{|F(x)-F(y)|}{\|x-y\|}.

Theorem 4 Let {F : \mathbb R^n \rightarrow \mathbb R}, and let {\mu = \int F \,d\gamma_n}, where {\gamma_n} is the standard {n}-dimensional Gaussian measure. Then,

\displaystyle   \gamma_n\left(x \in \mathbb R^n : |F(x)-\mu| > \lambda\right) \leq 2\,\exp\left(\frac{-\lambda^2}{2 \|F\|_{\mathrm{Lip}}}\right). \ \ \ \ \ (2)

Using this, we can prove the following remarkable fact.

Theorem 5 Let {\{X_t\}_{t \in T}} be a Gaussian process, then

\displaystyle   \mathbb P\left(\left|\sup_{t \in T} X_t - \mathop{\mathbb E} \sup_{t \in T} X_t\right| > \lambda\right) \leq 2\,\exp\left(\frac{-\lambda^2}{2 \sup_{t \in T} \mathop{\mathbb E}(X_t^2)}\right). \ \ \ \ \ (3)

A notable aspect of this statement is that only the maximum variance affects the concentration, not the number of random variables. We now prove Theorem 5 using Theorem 4.

Proof: We will prove it in the case {|T|=n}, but of course our bound is independent of {n}. The idea is that given a Gaussian process {\{X_1, X_2, \ldots, X_n\}}, we can write

\displaystyle  X_i = a_{i1} \,g_1 + a_{i2}\, g_2 + \cdots + a_{in}\, g_n,

for {i=1,2,\ldots, n}, where {\{g_i\}_{i=1}^n} are standard i.i.d. normals, and the matrix {A=(a_{i,j})} is a matrix of real coefficients. In this case, if {g = (g_1, g_2, \ldots, g_n)} is a standard {n}-dimensional Gaussian, then the vector {Ag} is distributed as {(X_1, X_2, \ldots, X_n)}.

If we put {F(x)=\max \{ (Ax)_i : i=1,\ldots,n\}}, then Theorem 4 yields (3) as long as {\|F\|_{\mathrm{Lip}} \leq \max_i \sqrt{\mathop{\mathbb E}(X_i^2)}}. It is easy to see that

\displaystyle  \|F\|_{\mathrm{Lip}} = \|A\|_{2 \rightarrow \infty} = \sup_{\|x\|_2 = 1} \|A x\|_{\infty}.

But {\|A\|_{2 \rightarrow \infty}} is just the maximum {\ell_2} norm of any row of {A}, and the {\ell_2} norm of row {i} is

\displaystyle \sqrt{\sum_{j=1}^n a_{ij}^2} = \sqrt{\mathop{\mathbb E}(X_i^2)}.


Using this theorem, we are ready to prove Theorem 3. I will only give a sketch here, but filling in the details is not too difficult.

Assume that the conditions of Theorem 3 hold. Pick an arbitrary {t_0 \in T}, and recall that we can write

\displaystyle g(T) = \mathop{\mathbb E} \sup_{t \in T} X_t = \mathop{\mathbb E} \sup_{t \in T} (X_t - X_{t_0})

since our gaussians are centered.

Now, by Theorem 1,

\displaystyle \mathop{\mathbb E} \max_{i=1,\ldots,m} \left(X_{t_i} - X_{t_0}\right) \geq C \alpha \sqrt{\log m}.

Suppose that {t_1} achieves this. By definition,

\displaystyle g(B(t_1, \alpha/r)) = \mathop{\mathbb E} \sup_{t \in B(t_1, \alpha/r)} \left(X_t - X_{t_1}\right),

so we could hope that for some {t \in B(t_1,\alpha/r)}, we simultaneously have {X_t - X_{t_1} \geq g(B(t_1,\alpha/r))}, yielding

\displaystyle  X_t - X_{t_0} = (X_t - X_{t_1}) + (X_{t_1} - X_{t_0}) \geq C\alpha \sqrt{\log m} + g(B(t_1, \alpha/r)). \ \ \ \ \ (4)

The problem, of course, is that the events we are discussing are not independent.

This is where Theorem 5 comes in. For any {i}, all the variances of the variables {\{X_t - X_{t_i} : t \in B(t_i,\alpha/r)\}} are bounded by {d(t,t_i)^2 \leq (\alpha/r)^2}. This implies that we can choose a constant {c_0 > 0} such that

\displaystyle  \mathbb P\left(\left|\sup_{t \in B(t_i,\alpha/r)} X_t - g(B(t_i, \alpha/r))\right| > c_0 (\alpha/r) \sqrt{\log m}\right) \leq m^{-2}.

So, in fact, we can expect that none of the {m} random variables {\sup_{t \in B(t_i,\alpha/r)} X_t} will deviate from its expected value by more than {c_0 (\alpha/r) \sqrt{\log m}}. Which means we can (morally) replace (4) by

\displaystyle  \begin{array}{rl} X_t - X_{t_0} &= (X_t - X_{t_1}) + (X_{t_1} - X_{t_0}) \\ &\geq C\alpha \sqrt{\log m} + g(B(t_1, \alpha/r)) - c_0 (\alpha/r) \sqrt{\log m}. \end{array}

But now by choosing {r = 2 C c_0}, the error term is absorbed.

June 22, 2010

Random tree embeddings

Filed under: lecture, Math — Tags: , , , — James Lee @ 10:28 am

Random embeddings

When it is impossible to achieve a low-distortion embedding of some space {X} into another space {Y}, we can consider more lenient kinds of mappings which are still suitable for many applications. For example, consider the unweighted {n}-cycle {C_n}. It is known that any embedding of {C_n} into a tree metric incurs distortion {\Omega(n)}. On the other hand, if we delete a uniformly random edge of {C_n}, this leaves us with a random tree (actually a path) {C'_n} such that for any {x,y \in C_n}, we have

\displaystyle  d_{C'_n}(x,y) \geq d_{C_n}(x,y) \geq \frac12\, \mathop{\mathbb E}[d_{C'_n}(x,y)].

In other words, in expectation the distortion is only 2.

Let {(X,d)} be a finite metric space, and let {\mathcal F} be a family of finite metric spaces. A stochastic embedding from {X} into {\mathcal F} is a random pair {(F,Y)} where {Y \in \mathcal F} and {F : X \rightarrow Y} is a non-contractive mapping, i.e. such that {d_Y(F(x),F(y)) \geq d_X(x,y)} for all {x,y \in X}. The distortion of {F} is defined by

\displaystyle  \max_{x,y \in X} \frac{\mathop{\mathbb E} \left[d_Y(F(x),F(y))\right]}{d_X(x,y)}\,.

We will now argue that every finite metric space admits a surprisingly good stochastic embedding into tree metrics. The next result is from Fakcharoenphol, Rao, and Talwar, following work by Bartal and Alon, Karp, Peleg, and West.

Theorem 1 Every {n}-point metric space {(X,d)} admits a stochastic embedding into the family of tree metrics, with distortion {O(\log n)}.

We will need the random partitioning theorem we proved last time:

Theorem 2 For every {\Delta > 0}, there is a {\Delta}-bounded random partition {\mathcal P} of {X} which satisfies, for every {x \in X} and {0 < r \leq \Delta/8},

\displaystyle   \mathop{\mathbb P}\left(B(x,r) \nsubseteq \mathcal P(x)\right) \leq \frac{8r}{\Delta} H\left(|B(x,\Delta/8)|,|B(x,\Delta)|\right). \ \ \ \ \ (1)

From partitions to trees. Before proving the theorem, we discuss a general way of constructing a tree metric from a sequence of partitions. Assume (by perhaps scaling the metric first) that {1 < d(x,y) \leq 2^M} for all {x,y \in X} with {x \neq y}. Let {P_0, P_1, \ldots, P_M} be partitions of {X}, where {P_k} is {2^k}-bounded. We will assume that {P_M = \{X\}} and {P_0} is a partition of {X} into singleton sets.



Now we inductively construct a tree metric {T = T(P_0, P_1, \ldots, P_M)} as follows. The nodes of the tree will be of the form {(k,S)} for {k \in \{0,1,\ldots,M\}} and {S \subseteq X}. The root is {(M,X)}. In general, if the tree has a node of the form {(j,S)} for {j > 0}, then {(j,S)} will have children

\displaystyle \{ (j-1, S \cap T) : T \in P_{j-1} \textrm{ and } S \cap T \neq \emptyset\}.

The length of an edge of the form {\{(j,S), (j-1,S')\}} is {2^{j}}. This specifies the entire tree {T}. We also specify a map {F : X \rightarrow T} by {F(x) = (0,\{x\})}. We leave the following claim to the reader.

Claim 1 For every {x,y \in X},

\displaystyle  d_{T}(F(x),F(y)) = 2 \sum_{k=1}^{j+1} 2^k = 2^{j+3}-4.

where {j \in \{0,1,\ldots,M\}} is the largest index with {P_j(x) \neq P_j(y)}.

Note, in particular, that {F : X \rightarrow T} is non-contracting because if {2^j < d(x,y) \leq 2^{j+1}} for some {j \geq 0}, then {P_j(x) \neq P_j(y)} since {P_j} is {2^{j}}-bounded, implying that {d_{T}(F(x),F(y)) \geq 2^{j+3}-4 \geq 2^{j+1}}.

Now we are ready to prove Theorem 1.

Proof: Again, assume that {1 < d(x,y) \leq 2^M} for all {x,y \in X}. For {0 < k < M}, let {\mathcal P_k} be the {2^k}-bounded random partition guaranteed by Theorem 2. Let {\mathcal P_0} be a partition into singletons, and let {\mathcal P_M = \{X\}}. Finally, let {\mathcal T=\mathcal T(\mathcal P_0, \ldots, \mathcal P_M)} be the tree constructed above, and let {F : X \rightarrow \mathcal T} be the corresponding (random) non-contractive mapping.

Now, fix {x,y \in X} and an integer {h} such that {2^h < d(x,y) \leq 2^{h+1}}. Using Claim 1 and (1), we have,

\displaystyle  \mathop{\mathbb E}\left[d_{\mathcal T}(F(x),F(y))\right] \leq \sum_{j=0}^M \mathop{\mathbb P}[\mathcal P_j(x) \neq \mathcal P_j(y)] \cdot 2^{j+3}

    \displaystyle  \leq O(d(x,y)) + \sum_{j=h+3}^M \frac{8\,d(x,y)}{2^j} 2^{j+3} H\left(|B(x, 2^{j-3})|,|B(x,2^{j})|\right)

    \displaystyle  \leq O(d(x,y)) + \sum_{j=h+3}^M \frac{8\,d(x,y)}{2^j} 2^{j+3} H\left(|B(x, 2^{j-3})|,|B(x,2^{j})|\right)

   \displaystyle  \leq O(d(x,y)) + O(1) \,d(x,y) \sum_{j=h+3}^M H\left(|B(x, 2^{j-3})|,|B(x,2^{j})|\right)

   \displaystyle  \leq O(H(1,n))\,d(x,y)

   \displaystyle  \leq O(\log n)\,d(x,y),

where in the penultimate line, we have evaluated three disjoint telescoping sums: For any numbers {1 \leq a_1 \leq a_2 \leq \cdots \leq a_k},

\displaystyle H(a_1,a_2)+H(a_2,a_3)+\cdots+H(a_{k-1},a_k)=H(a_1,a_k).


Since every tree embeds isometrically into {\ell_1}, this offers an alternate proof of Bourgain’s theorem when the target space is {\ell_1}. Since we know that expander graphs require {\Omega(\log n)} distortion into {{\ell}_1}, this also shows that Theorem 1 is asymptotically tight.

June 19, 2010

Random partitions of metric spaces

Filed under: lecture, Math — Tags: , , — James Lee @ 3:13 pm

In addition to the current sequence on Talagrand’s majorizing measures theory, I’m also going to be putting up a series of lecture notes about embeddings of finite metric spaces. The first few will concern random partitions of metric spaces, and their many applications.

Random partitions

Often when one is confronted with the problem of analyzing some problem on a metric space {(X,d)}, a natural way to proceed is by divide and conquer: Break the space into small pieces, do something (perhaps recursively) on each piece, and then glue these local solutions together to achieve a global result. This is certainly an important theme in areas like differential geometry, harmonic analysis, and computational geometry.

In the metric setting, “small” often means pieces of small diameter. Of course we could just break the space into singletons {X = \{x_1\} \cup \{x_2\} \cup \cdots}, but this ignores the second important aspect in a “divide and conquer” approach: what goes on at the boundary. If we are going to combine our local solutions together effectively, we want the “boundary” between distinct pieces of the partition to be small. Formalizing this idea in various ways leads to a number of interesting ideas which I’ll explore in a sequence of posts.

Example: The unit square

Consider the unit square {[0,1]^2} in the plane, equipped with the {\ell_1} metric. If we break the square into pieces of side length {\delta/2}, then every piece has diameter at most {\delta}, and a good fraction of the space is far from the boundary of the partition. In fact, it is easy to see that e.g. 60% of the measure is at least distance {\delta/20} from the boundary (the red dotted line).


But there is a problem with using this as a motivating example. First, observe that we had both a metric structure (the {\ell_1} distance) and a measure (in this case, the uniform measure on {[0,1]^2}). In many potential applications, there is not necessarily a natural measure to work with (e.g. for finite metric spaces, where counting the number of points is often a poor way of measuring the “size” of various pieces).

To escape this apparent conundrum, note that the uniform measure is really just a distraction: The same result holds for any measure on {[0,1]^2}. This is a simple application of the probabilistic method: If we uniformly shift the {\delta/2}-grid at random, then for any point {x \in [0,1]^2}, we have

\displaystyle {\mathbb P}(x \textrm{ is } \delta/20\textrm{-far from the boundary}) \geq 0.6.

Thus, by averaging, for any measure {\mu} on {[0,1]^2} there exists a partition where 60% of the {\mu}-measure is {\delta/20}-far from the boundary. This leads us to the idea that, in many situations, instead of talking about measures, it is better to talk about distributions over partitions. In this case, we want the “boundary” to be small on “average.”

Lipschitz random partitions

We will work primarily with finite metric spaces to avoid having to deal with continuous probability spaces; much of the theory carries over to general metric spaces without significant effort (but the development becomes less clean, as this requires certain measurability assumptions in the theorem statements).

Let {(X,d)} be a finite metric space. If {P} is a partition of {X}, and {x \in X}, we will write {P(x)} for the unique set in {P} containing {x}. We say that {P} is {\Delta}-bounded if {\mathrm{diam}(S) \leq \Delta} for all {S \in P}. We will also say that a random partition {\mathcal P} is {{\Delta}}-bounded if it is supported only on {\Delta}-bounded partitions of {X}.

Let’s now look at one way to formalize the idea that the “boundary” of a random partition {\mathcal P} is small.

A random partition {\mathcal P} of {X} is {{L}}-Lipschitz if, for every {x,y \in X},

\displaystyle \mathbb P\left(\mathcal P(x) \neq \mathcal P(y)\right) \leq L \cdot d(x,y).

Intuitively, the boundary is small if nearby points tend to end up in the same piece of the partition. There is a tradeoff between a random partition {\mathcal P} being {\Delta}-bounded and {{L}}-Lipschitz. As {\Delta} increases, we expect that we can make {{L}}, and hence the “boundary effect,” smaller and smaller. The following theorem can be derived from work of Leighton and Rao, or Linial and Saks. The form stated below comes from work of Bartal.

Theorem 1

If {(X,d)} is an {n}-point metric space and {n \geq 2}, then for every {\Delta > 0}, there is an {\frac{O(\log n)}{\Delta}}-Lipschitz, {\Delta}-bounded random partition {\mathcal P} of {X}.

We will prove a more general fact that will be essential later. For positive numbers {a,b}, define {H(a,b) = \sum_{i=a+1}^b \frac{1}{i}} (note that {H(a,a)=0}). The next theorem and proof are from Calinescu, Karloff, and Rabani.

Theorem 2

For every {\Delta > 0}, there is a {\Delta}-bounded random partition {\mathcal P} of {X} which satisfies, for every {x \in X} and {0 < r \leq \Delta/8},

\displaystyle {\mathbb P}\left(B(x,r) \nsubseteq \mathcal P(x)\right) \leq \frac{8r}{\Delta} H\left(|B(x,\Delta/8)|,|B(x,\Delta)|\right). \ \ \ \ \ (1)

Observe that Theorem 1 follows from Theorem 2 by setting {r = d(x,y)}, and noting that {\mathcal P(x) \neq \mathcal P(y)} implies {B(x,r) \nsubseteq \mathcal P(x)}. We also use {H(1,n) = O(\log n)} for {n \geq 2}.

Suppose that {|X|=n}. Let {\alpha \in [\frac14, \frac12]} be chosen uniformly at random, and let {\pi} be a uniformly random bijection {\pi : \{1,2,\ldots,n\} \rightarrow X}. We will think of {\pi} as giving a random ordering of {X}. We define our random partition {\mathcal P} as follows.

For {i \geq 1}, define

\displaystyle C_i = B(\pi(i), \alpha \Delta) \setminus \bigcup_{j=1}^{i-1} C_j.

It is straightforward that {\mathcal P = C_1 \cup C_2 \cup \cdots \cup C_n} is a partition of X, with perhaps many of the sets {C_i} being empty. Furthermore, by construction, {\mathcal P} is {\Delta}-bounded. Thus we are left to verify (1).

To this end, fix {x \in X} and {r > 0}, and enumerate the points of {X = \{y_1, y_2, \ldots, y_n\}} so that {d(x,y_1) \leq d(x,y_2) \leq \cdots \leq d(x,y_n)}. Let {B = B(x,r)}. We will say that a point {y_i} sees {B} if {d(y_i, x) \leq \alpha \Delta + r}, and we will say that {y_i} cuts {B} if {\alpha \Delta - r\leq d(y_i, x) \leq \alpha \Delta + r}.
(In the picture below, y_1 and y_2 see B, while y_3 does not. Only y_2 cuts B.)

Observe that for any {y \in X},

\displaystyle \mathbb P\left(y \textrm{ cuts } B\right) = \mathbb P\left(\alpha \Delta \in [d(x,y)-r,d(x,y)+r]\right) \leq \frac{2r}{\Delta/4} = \frac{8r}{\Delta}.\ \ \ \ \ (2)

Using this language, we can now reveal our analysis strategy: Let {y_{\min}} be the minimal element (according to the ordering {\pi}) which sees {B}. Then {B(x,r) \nsubseteq \mathcal P(x)} can only occur if {y_{\min}} also cuts {B}. The point is that the fate of {B(x,r)} is not decided until some point sees {B}. Then, if {y_{\min}} does not cut {B}, we have {B(x,r) \subseteq C_{\pi^{-1}(y_{\min})}}, hence {B(x,r) \subseteq \mathcal P(x)}.

Thus we can write,

\displaystyle \mathbb P\left(B(x,r) \nsubseteq \mathcal P(x)\right) \leq \mathbb P\left(y_{\min} \textrm{ cuts } B \right) \leq \sum_{i=1}^n \mathbb P\left(y_{\min}=y_i \textrm{ and } y_i \textrm{ cuts } B\right).\ \ \ \ \ (3)

To analyze the latter sum, first note that if {y \notin B(x,\Delta/2+r)}, then {y} can never see {B} since {\alpha \Delta \leq \Delta/2} always. On the other hand, if {y \in B(x,\Delta/4-r)} then {y} always sees {B}, but can never cut {B} since {\alpha \Delta \geq \Delta/4} always.
Recalling that {r \leq \Delta/8} by assumption, and setting {a=|B(x,\Delta/8)|} and let {b=|B(x,\Delta)|}, we can use (3) to write

\displaystyle \mathbb P\left(B(x,r) \nsubseteq \mathcal P(x)\right)\leq \sum_{i=a+1}^b \mathbb P\left(y_{\min}=y_i \textrm{ and } y_i \textrm{ cuts }B\right). \ \ \ \ \ (4)

Now we come to the heart of the analysis: For any {i \geq 1},

\displaystyle \mathbb P\left(y_{\min} = y_i\,|\, y_i \textrm{ cuts } B\right) \leq \frac{1}{i}.\ \ \ \ \ (5)

The idea is that if {y_i} cuts {B}, then {\alpha \Delta \geq d(x,y_i)-r}. Thus if any {y_j} for {j < i} comes before {y_i} in the {\pi}-ordering, then {y_j} also sees {B} since {d(x,y_j) \leq d(x,y_i)}, hence {y_i \neq y_{\min}}. But the probability that {y_i} is the {\pi}-minimal element of {\{y_1, \ldots, y_i\}} is precisely {\frac{1}{i}}, proving (5).

To finish the proof, we combine (2), (4), and (5), yielding

\displaystyle \mathbb P\left(B(x,r) \nsubseteq \mathcal P(x)\right) \leq \sum_{i=a+1}^b \mathbb P\left(y_i \textrm{ cuts } B\right) \cdot \mathbb P\left(y_i = y_{\min}\,|\, y_i \textrm{ cuts } B\right) \leq  \frac{8r}{\Delta} \sum_{i=a+1}^b \frac{1}{i}\,.


Tightness for expanders. Before ending this section, let us mention that Theorem 1 is optimal up to a constant factor. Let {\{G_k\}} be a family of degree-3 expander graphs, equipped with their shortest-path metric {d_k}. Assume that {|E(S, \bar S)| \geq c|S|} for some {c > 0} and all {|S| \leq \frac12 |G_k|}. Let {\Delta_k = \frac{\log |G_k|}{10}}, and suppose that {G_k} admits a {\Delta_k}-bounded {{L}}-Lipschitz random partition. By averaging, this means there is a fixed {\Delta_k}-bounded partition {P} of {G_k} which cuts at most an {{L}}-fraction of edges. But every {S \in P} has {\mathrm{diam}(S) \leq \Delta_k}, which implies {|S| < n/2}. Hence the partition must cut an {\Omega(1)} fraction of edges, implying {L = \Omega(1)}.

In the next post, we’ll see how these random partitions allow us to reduce may questions on finite metric spaces to questions on trees.

June 15, 2010

The generic chaining

Filed under: lecture, Math — Tags: , , , — James Lee @ 11:27 am

In the last post, we considered a Gaussian process {\{X_t\}_{t \in T}} and were trying to find upper bounds on the quantity {\mathop{\mathbb E}\sup_{t \in T} X_t}. We saw that one could hope to improve over the union bound by clustering the points and then taking mini union bounds in each cluster.

Hierarchical clustering

To specify a clustering, we’ll take a sequence of progressively finer approximations to our set {T}. First, recall that we fixed {t_0 \in T}, and we have used the observation that {\mathop{\mathbb E}\sup_{t \in T} X_t = \mathop{\mathbb E}\sup_{t \in T} (X_t-X_{t_0})}.

Now, assume that {T} is finite. Write {T_0 = \{t_0\}}, and consider a sequence of subsets {\{T_n\}} such that {T_0 \subseteq T_1 \subseteq T_2 \subseteq \cdots \subseteq T}. We will assume that for some large enough {m}, we have {T_n = T} for {n \geq m}. For every {n \geq 0}, let {\pi_n : T \rightarrow T_n} denote a “closest point map” which sends t \in T to the closest point in T_n.

The main point is that we can now write, for any {t \in T},

\displaystyle   X_t - X_{t_0} = \sum_{n \geq 1} X_{\pi_n(t)} - X_{\pi_{n-1}(t)}. \ \ \ \ \ (1)

This decomposition is where the term “chaining” arises, and now the idea is to bound the probability that {X_t - X_{t_0}} is large in terms of the segments in the chain.

What should {T_n} look like?

One question that arises is how we should think about choosing the approximations {T_n}. We are trading off two measures of quality: The denser {T_n} is in the set {T} (or, more precisely, in the set {T_{n-1}}) the smaller the variances of the segments {X_{\pi_n(t)}-X_{\pi_{n-1}(t)}} will be. On the other hand, the larger {T_n} is, the more segments we’ll have to take a union bound over.

So far, we haven’t used any property of our random variables except for the fact that they are centered. To make a more informed decision about how to choose the sets {\{T_n\}}, let’s recall the classical Gaussian concentration bound.

Lemma 1 For every {s,t \in T} and {\lambda > 0},

\displaystyle   \mathop{\mathbb P}(X_s - X_t > \lambda) \leq \exp\left(-\frac{\lambda^2}{2\, d(s,t)^2}\right). \ \ \ \ \ (2)

This should look familiar: {X_s-X_t} is a mean-zero Gaussian with variance {d(s,t)^2}.

Now, a first instinct might be to choose the sets {T_n} to be progressively denser in {T}. In this case, a natural choice would be to insist on something like {T_n} being a {2^{-n}}-net in {T}. If one continues down this path in the right way, a similar theory would develop. We’re going to take a different route and consider the other side of the tradeoff.

Instead of insisting that {T_n} has a certain level of accuracy, we’ll insist that {T_n} is at most a certain size. Should we require {|T_n| \leq n} or {|T_n| \leq 2^n}, or use some other function? To figure out the right bound, we look at (2). Suppose that {g_1, g_2, \ldots, g_m} are i.i.d. {N(0,1)} random variables. In that case, applying (2) and a union bound, we see that to achieve

\displaystyle  \mathop{\mathbb P}(\exists i : g_i > B) \leq m \mathop{\mathbb P}(g_1 > B) < 1,

we need to select {B \asymp \sqrt{\log m}}. If we look instead at {m^2} points instead of {m} points, the bound grows to {\sqrt{2 \log m}}. Thus we can generally square the number of points before the union bound has to pay a constant factor increase. This suggests that the right scaling is something like {|T_{n+1}| = |T_n|^2}. So we’ll require that {|T_n| \leq 2^{2^n}} for all {n \geq 1}.

The generic chaining

This leads us to the generic chaining bound, due to Fernique (though the formulation we state here is from Talagrand).

Theorem 2 Let {\{X_t\}_{t \in T}} be a Gaussian process, and let {T_0 \subseteq T_1 \subseteq \cdots \subseteq T} be a sequence of subsets such that {|T_0|=1} and {|T_n| \leq 2^{2^{n}}} for {n \geq 1}. Then,

\displaystyle   \mathop{\mathbb E}\sup_{t \in T} X_t \leq O(1) \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} d(t, T_n). \ \ \ \ \ (3)

Proof: As before, let {\pi_n : T \rightarrow T_n} denote the closest point map and let {T_0 = \{t_0\}}. Using (2), for any {n \geq 1}, {t \in T}, and {u > 0}, we have

\displaystyle  \mathop{\mathbb P}\left(|X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| > u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right) \leq \exp\left(-\frac{u^2}{2} 2^n\right).

Now, the number of pairs {(\pi_n(t),\pi_{n-1}(t))} can be bounded by {|T_n| \cdot |T_{n-1}| \leq 2^{2^{n+1}}}, so we have

\displaystyle   \mathop{\mathbb P}\left(\exists t : |X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| > u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right) \leq 2^{2^{n+1}} \exp\left(-\frac{u^2}{2} 2^n\right). \ \ \ \ \ (4)

If we define the event

\displaystyle  \Omega_u = \left\{ \forall n \geq 1, t \in T : |X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| \leq u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right\},

then summing (4) yields,

\displaystyle   \mathop{\mathbb P}(\overline{\Omega_u}) \leq \sum_{n \geq 1} 2^{2^{n+1}} \exp\left(-\frac{u^2}{2} 2^n\right) \leq O(1)\, e^{-u^2} \ \ \ \ \ (5)

for {u \geq 4}, since we get geometrically decreasing summands.


\displaystyle  S = \sup_{t \in T} \sum_{n \geq 1} 2^{n/2} d(\pi_n(t), \pi_{n-1}(t)).

Note that if {\Omega_u} occurs, then {\sup_{t \in T} (X_t - X_{t_0}) \leq uS}. Thus (5) implies that for u \geq 4,

\displaystyle  \mathop{\mathbb P}(\sup_{t \in T} X_t - X_{t_0} > uS) \leq O(1) \, e^{-u^2},

which implies that

\displaystyle   \mathop{\mathbb E} \sup_{t \in T} X_t \leq O(S) \leq O(1) \sup_{t \in T} \sum_{n \geq 1} 2^{n/2} d(\pi_n(t), \pi_{n-1}(t)). \ \ \ \ \ (6)

Finally, by the triangle inequality,

\displaystyle d(\pi_n(t), \pi_{n-1}(t)) \leq d(t, T_n) + d(t, T_{n-1}) \leq 2\,d(t,T_{n-1}).

Plugging this into (6) recovers (3). \Box

Theorem 1.2 gives us a fairly natural way to upper bound the expected supremum using a hierarchical clustering of {T}. Rather amazingly, as we’ll see in the next post, this upper bound is tight. Talagrand’s majorizing measure theorem states that if we take the best choice of {\{T_n\}} in Theorem 1.2, then the upper bound in (3) is within a constant factor of {\mathop{\mathbb E} \sup_{t \in T} X_t}.

December 19, 2008

Lecture 8a. A primer on simplicial complexes and collapsibility

Filed under: CSE 599S, lecture, Math — Tags: , , — James Lee @ 3:17 am

Before we can apply more advanced fixed point theorems to the Evasiveness Conjecture, we need a little background on simplicial complexes, and everything starts with simplices.


It’s most intuitive to start with the geometric viewpoint, in which case an n-simplex is defined to be the convex hull of n+1 affinely independent points in \mathbb R^n.  These points are called the vertices of the simplex.  Here are examples for n=0,1,2,3.



A simplicial complex is then a collection of simplices glued together along lower-dimensional simplices.  More formally, if S \subseteq \mathbb R^n is a (geometric) simplex, then a face of S is a subset F \subseteq S formed by taking the convex hull of a subset of the vertices of S.

Finally, a (geometric) simplicial complex is a collection \mathcal K of simplices such that

  1. If S \in \mathcal K and F is a face of S, then F \in \mathcal K, and
  2. If S,S' \in \mathcal K and S \cap S' \neq \emptyset, then S \cap S' is a face of both S and S'.

Property (1) gives us downward closure, and property (2) specifies how simplices can be glued together (only along faces).  For instance, the first picture depicts a simplicial complex.  The second does not.



October 23, 2008

Lecture 7: The evasiveness conjecture

Filed under: CSE 599S, lecture, Math — Tags: , — James Lee @ 2:59 am

Continuing our look at some toplogical methods, today we’ll see the evasiveness conjecture in decision tree complexity.  In the next lecture, we’ll see how we can sometimes analyze the complexity using fixed point theorems, and their generalizations (like the Hopf index formula), following the work of Kahn, Saks, and Sturtevant.  These two lectures are co-blogged with Elisa Celis, with a lot of input from Lovasz’s lecture notes.

Decision tree complexity and evasiveness

Consider a boolean function f : \{0,1\}^n \to \{0,1\} on n bits.  We define the decision tree complexity of f as follows.  Given an unknown input x \in \{0,1\}^n, you are allowed to ask about the values of various bits of x, e.g. x_{17}, x_{34}, x_3, \ldots.  Your goal is to compute f(x) using as few questions as possible, and your questions can be adaptive, depending on answers to previous questions.  The complexity of such a strategy is the maximum number of questions asked for any x \in \{0,1\}^n.  The decision tree complexity, written D(f), is the minimum complexity of any strategy that computes f.  (There are many other interesting models of decision complexity, see e.g. this survey.)

Clearly D(f) \leq n, because we can trivially query all the bits of x, and then output f(x).  A function f is called evasive if this upper bound is met, i.e. D(f)=n.  As an example, consider the parity function \mathsf{PARITY}(x) = x_1 \oplus x_2 \oplus \cdots \oplus x_n, where \oplus is addition modulo 2.  Clearly \mathsf{PARITY} is evasive because after n-1 bits of x are asked about, the setting of the final bit determines the value of f.

For a more general example, consider any f : \{0,1\}^n \to \{0,1\} such that \#\{ x : f(x)=1 \} is odd.  In this case, for every i =1,2,\ldots, n, exactly one of f|_{x_i=0} or f|_{x_i=1} has the same property that the number of inputs resulting in a 1 is odd.  (These two functions are the natural restriction of f to functions on n-1 bits, which results from fixing the value of the ith bit.)  Thus an adversary could keep answering questions “x_i?” so that the restricted function retains this property.  Since the number of inputs yielding a 1 is always odd, the restricted function always takes both possible values, implying that f is evasive–the advesary ensures that the value cannot be determined until all n possible questions are asked.

For an example of a non-evasive property, think of a point x \in \{0,1\}^{N \choose 2} as speciying a directed graph on N vertices, where there is exactly one directed edges connecting every pair of vertices, and x specifies the direction of this edge (this is called a tournament).  Thinking of the vertices as players, a directed edge from u to v means that u defeats v.  Now f(x)=1 if the digraph specified by x has one vertex that defeats everyone else.  What is D(f)?

Well, first we can conduct a single elimination tournament, where vertex 1 plays vertex 2, and the winner players vertex 3, and the winner of that players vertex 4, etc.  At the end, there is only one remaining vertex i that remains undefeated.  Now asking N-2 more questions, we can determine whether i indeeds defeats everyone else.  The total number of questions was (N-1)+(N-2) = 2N-3, hence D(f) \leq 2N-3 \ll {N \choose 2}, implying that f is not evasive.

Montone graph properties and the evasiveness conjecture

Let m = {N \choose 2}.  In general, we can encode an arbitrary undirected N-vertex graph as an element G \in \{0,1\}^{m}.  A function f: \{0,1\}^m \to \{0,1\} is called a graph property if relabeling the vertices of G doesn’t affect the value of f(G).  The function f is monotone if the value of the function can never change from 1 to 0 when flipping one of the input bits from 0 to 1.  In the setting of graph properties, this corresponds to those which are maintained under addition of edges to the graph, e.g. f(G) = “is G connected?” or f(G)= “does G have a k-clique?”

Evasiveness Conjecture (Aanderaa-Karp-Rosenberg): Every non-trivial monotone graph property is evasive.

Here, non-trivial means that f(\bar 0) \neq f(\bar 1), where \bar 0 and \bar 1 denote the all-zeros and all-ones strings, respectively.

For example, consider the example f(G) = “is G connected?”  The adversary is simple:  When asked about a possible edge \{i,j\}, she answers NO unless this answer would imply that the graph is disconnected.  In other words, she answers NO unless she has answered NO already for all edges across a cut (S,\bar S) except for \{i,j\}, in which case she has to answer YES.

Now, suppose there is a strategy which figures out the connectivity of G without asking a question about some edge \{i,j\}.  In this case, the conclusion must be that f(G) = \textrm{ YES} because the adversary always maintains that by answering everything in the future YES, she could force the graph to be connected.  In this case, the edges answered YES have to form a spanning tree of G (otherwise by answering all unasked questions NO, the graph would become disconnected).  Consider a path P from i to j in this YES spanning tree.  Let \{u,v\} be the edge of P which was asked about last.  Clearly the adversary answered YES for \{u,v\}, but this contradicts the advesary’s strategy.  Since \{i,j\} has not been asked yet, the adversary is safe to answer NO for \{u,v\}, and still later by answering YES on \{i,j\}, she could force the graph to be connected.  Thus no such strategy exists, and connectivity is evasive.


October 16, 2008

Lecture 6: Borsuk-Ulam and some combinatorial applications

Filed under: CSE 599S, lecture, Math — Tags: , , , — James Lee @ 4:07 am

Now we’ll move away from spectral methods, and into a few lectures on topological methods.  Today we’ll look at the Borsuk-Ulam theorem, and see a stunning application to combinatorics, given by Lovász in the late 70′s.  A great reference for this material is Matousek’s book, from which I borrow heavily.  I’ll also discuss why the Lovász-Kneser theorem arises in theoretical CS.

The Borsuk-Ulam Theorem

We begin with a statement of the theorem.  We will think of the n-dimensional sphere as the subset of \mathbb R^{n+1} given by

S^n = \left\{ (x_1, \ldots, x_{n+1}) : x_1^2 + \cdots + x_{n+1}^2 = 1\right\}.

Borsuk-Ulam Theorem: For every continuous mapping f : S^n \to \mathbb R^n, there exists a point x \in S^n with f(x)=f(-x).

Pairs of points x,-x \in S^n are called antipodal.

There are a couple of common illustrative examples for the case n=2.  The theorem says that if you take the air out of a basketball, crumple it (no tearing), and flatten it out, then there are two points directly on top of each other which were antipodal before.  Another common example states that at every point in time, there must be two points on the earth which both have exactly the same temperature and barometric pressure (assuming, of course, that these two parameters vary continuously over the surface of the eath).

The n=1 case is completely elementary.  For the rest of the lecture, let’s use N = (0,0,\ldots,1) and S=(0,0,\ldots,-1) to denote the north and south poles (the dimension will be obvious from context).  To prove the n=1 case, simply trace out the path in \mathbb R starting at f(N) and going clockwise around S^1.  Simultaneously, trace out the path starting at f(S) and going counter-clockwise at the same speed.  It is easy to see that eventually these two paths have to collide:  At the point of collision, f(x)=f(-x).

We will give the sketch of a proof for n \geq 2.  Let g(x)=f(x)-f(-x), and note that our goal is to prove that g(x)=0 for some x \in S^n.  Note that g is antipodal in the sense that g(x)=-g(-x) for all x \in S^n.  Now, if g(x) \neq 0 for every x, then by compactness there exists an \epsilon > 0 such that \|g(x)\| > \epsilon for all x \in S^n.  Because of this, we may approximate g arbitrarily well by a smooth map, and prove that the approximation has a 0.  So we will assume that g itself is smooth.

Now, define h : S^n \to \mathbb R^n by h(x_1, \ldots, x_{n+1}) = (x_1, \ldots, x_n), i.e. the north/south projection map.  Let X = S^n \times [0,1] be a hollow cylinder, and let F : X \to \mathbb R^n be given by F(x,t)=t g(x) + (1-t)h(x) so that F linearly interpolates between g and h.

Also, let’s define an antipodality on X by \nu(x,t) = (-x,t).  Note that F is antipodal with respect to \nu, i.e. F(\nu(x,t))=-F(x,t), because both g and h are antipodal.  For the sake of contradiction, assume that g(x) \neq 0 for all x \in S^n.

Now let’s consider the structure of the zero set Z = F^{-1}(0).  Certainly (N,0), (S,0) \in Z since h(N)=h(S)=0, and these are h’s only zeros.  Here comes the sketchy part:  Since g is smooth, F is also smooth, and thus locally F can be approximated by an affine mapping F_{\mathrm{loc}} : \mathbb R^{n+1} \to \mathbb R^n.  It follows that if F_{\mathrm{loc}}^{-1}(0) is not empty, then it should be a subspace of dimension at least one.  By an arbitrarily small perturbation of the initial g, ensuring that F is sufficiently generic, we can ensure that F_{\mathrm{loc}}^{-1}(0) is either empty or a subspace of dimension one.  Thus locally, F^{-1}(0) should look like a two-sided curve, except at the boundaries t=0 and t=1, where F^{-1}(0) (if non-empty) would look like a one-sided curve.  But, for instance, F^{-1}(0) cannot contain any Y-shaped branches.

It follows that Z is a union of closed cycles and paths whose endpoints must lie at the boundaries t=0 and t=1.  (Z is represented by red lines in the cylinder above.)  But since there are only two zeros on the t=0 sphere, and none on the t=1 sphere, Z must contain a path \gamma from (N,0) to (S,0).  Since F is antipodal with respect to \nu, \gamma must also satisfy this symmetry, making it impossible for the segment initiating at N to ever meet up with the segment initiating at S.  Thus we arrive at a contradiction, implying that g must take the value 0.

Notice that the only important property we used about h (other than its smoothness) is that is has a number of zeros which is twice an odd number.  If h had, e.g. four zeros, then we could have two \nu-symmetric paths emanating from and returning to the bottom.  But if h has six zeros, then we would again reach a contradiction.


October 13, 2008

Lecture 5: Uniformizing graphs, multi-flows, and eigenvalues

In the previous lecture, we gave an upper bound on the second eigenvalue of the Laplacian of (bounded degree) planar graphs in order to analyze a simple spectral partitioning algorithm.  A natural question is whether these bounds extend to more general families of graphs.  Well-known generalizations of planar graphs are those which can be embedded on a surface of fixed genus, and, more generally, families of graphs that arise by forbidding minors.  In fact, Spielman and Teng conjectured that for any graph excluding K_h as a minor, one should have \lambda_2 \lesssim \mathrm{poly}(h) d_{\max}/n.   Of course planar graphs have genus 0, and by Wagner’s theorem, are precisely the graphs which exclude K_5 and K_{3,3} as minors.  In this lecture, we will follow an intrinsic approach of Biswal, myself, and Rao which, in particular, is able to resolve the conjecture of Spielman and Teng.  First, we see why even pushing the conformal approach to bounded genus graphs is difficult.

Bounded genus graphs

For graphs of bounded genus, there is hope to use an approach based on conformal mappings.  In 1980, Yang and Yau proved that

\displaystyle \lambda_2(M) \lesssim \frac{g+1}{\mathrm{vol}(M)}

for any compact Riemannian surface M of genus g.  (Note that for the Laplace-Beltrami operator, one usually writes \lambda_1 as the first non-zero eigenvalue, rather than \lambda_2.)  In analog with Hersch’s proof of the genus 0 case, they use Riemann-Roch to obtain a degree-(g+1) conformal mapping to the Riemann sphere, then try to pull back a second eigenfunction.  A factor of the degree is lost in the Rayleigh quotient (hence the g+1 factor in the preceding bound), and Hersch’s Möbius trick is still required.

An analogous proof for graphs G of bounded genus would proceed by constructing a circle packing of G on the sphere S^2, but instead of the circles having disjoint interiors, we would be assured that every point of S^2 is contained in at most g circles.  Unfortunately, such a result is impossible (this has to do with the handling of branch points in the discrete setting).  Kelner has to take a different approach in his proof that \lambda_2(G) \leq d_{\max}^{O(1)} (g+1)/n for graphs G of genus at most g.

He starts with a circle packing of G on a compact surface \mathbb S_0 of genus g (whose existence follows from results of Beardon and Stephenon and He and Schramm).  Then Kelner randomly subdivides G repeatedly, and these subdivisions give progressively better approximations to some sequence of surfaces \{\mathbb S_i\}.  Once the approximation is of high enough quality, one applies Riemann-Roch to \mathbb S_k, and infers something about a subdivision of G.  The final element is to track how the second eigenvalue of G changes (in expectation) under random subdivision.

Needless to say, this approach is already quite delicate, and for graphs that can’t be equipped with some kind of conformal structure, we seem to have reached a dead end.  In this lecture, we’ll see how to use intrinsic deformations of the geometry of G in order to bound its eigenvalues.  Eventually, this will reduce to the study of certain kinds of multi-commodity flows.

Metrics on graphs

Let G=(V,E) be an arbitrary n-vertex graph with maximum degree d_{\max}.  Recall that we can write

\displaystyle \lambda_2 = \min_{f \neq 0 : \sum_{x \in V} f(x)=0} \frac{\sum_{xy \in E} |f(x)-f(y)|^2}{\sum_{x \in V} f(x)^2}.

where f : V \to \mathbb R.  (Also recall that we can replace \mathbb R by any Hilbert space, and the same formula holds.)  The first step is to prepare this equality for “non-linearization” by getting rid of the linear condition \sum_{x \in V} f(x)=0 and the sum \sum_{x \in V} f(x)^2.  (This is a popular sort of passage in the non-linear geometry of Banach spaces, which also plays a rather important role in applications to the theoretical CS.)  The goal is to get only terms that look like |f(x)-f(y)|.  Fortunately, there is a well-known way to do this:

\displaystyle \lambda_2 = 2 n \cdot \min_{f : V \to \mathbb R} \frac{\sum_{xy \in E} |f(x)-f(y)|^2}{\sum_{x,y \in V} |f(x)-f(y)|^2},

which follows easily from the equality \sum_{x,y \in V} |f(x)-f(y)|^2 = 2n \sum_{x \in V} f(x)^2 when \sum_{x \in V} f(x)=0.

Thus if we want to bound \lambda_2 = O(1/n), we need to find an f : V \to \mathbb R for which the latter ratio (without the 2n) is O(1/n^2).  Now, for someone who works a lot with linear programming relaxations, it’s very natural to consider a “relaxation”

\displaystyle \gamma(G) = \min_{d} \frac{\sum_{xy \in E} d(x,y)^2}{\sum_{x,y \in V} d(x,y)^2},

where the minimization is over all pseudo-metrics d, i.e. symmetric non-negative functions d : V \times V \to \mathbb R which satisfy the triangle inequality, but might have d(x,y)=0 even for x \neq y.  Certainly \gamma(G) \leq \lambda_2/2n, but Bourgain’s embedding theorem (which states that every n-point metric space embeds into a Hilbert space with distortion at most O(\log n)) also assures us that \lambda_2(G) \leq O(n \log^2 n) \gamma(G).  Since we are trying to show that \gamma(G) = O(1/n^2), this O(\log^2 n) term is morally negligible.  One can see the paper for a more advanced embedding argument that doesn’t lose this factor, but for now we concentrate on proving that \gamma(G) = O(1/n^2).  The embedding theorems allow us to concentrate on finding an intrinsic metric on the graph with small “Rayleigh quotient,” without having to worry about an eventual geometric representation.

As a brief preview… we are going to find a good metric d by taking a certain kind of all-pairs multi-commodity flow at optimality, and weighting the edges by their congestion in the optimal flow.  Thus as the flow spreads out on the graph, it has the effect of “uniformizing” its geometry.

Discrete Riemannian metrics, convexification, and duality

Let’s now assume that G is planar.  We want to show that \gamma(G) = O(d_{\max}/n^2).  First, let’s restrict ourselves to vertex weighted metrics on G.  Given any non-negative weight function \omega : V \to \mathbb R, we can define the length of a path P in G by summing the weights of vertices along it:  \mathsf{len}_{\omega}(P) = \sum_{x \in P} \omega(x).  Then we can define a vertex-weighted shortest-path pseudo-metric on G in the natural way

\displaystyle \mathsf{dist}_{\omega}(x,y) = \min \left\{ \mathsf{len}_{\omega}(P) : P \in \mathcal P_{xy}\right\},

where \mathcal P_{uv} is the set of all u-v paths in G.  We also have the nice relationship

\displaystyle \sum_{xy \in E} \mathsf{dist}_{\omega}(x,y)^2 \leq 2 d_{\max} \sum_{x \in V} \omega(x)^2,\qquad(1)

since \mathsf{dist}_{\omega}(x,y) \leq [\omega(x)+\omega(y)]^2.

So if we define

\displaystyle \Lambda_0(\omega) = \frac{\displaystyle \sum_{x \in V} \omega(x)^2}{\displaystyle \sum_{x,y \in V} \mathsf{dist}_{\omega}(x,y)^2}

then by (1), we have \gamma(G) \leq 2 d_{\max} \min_{\omega} \Lambda_0(\omega).

Examples. Let’s try to exhibit weights \omega for two well-known examples:  the grid, and the complete binary tree.

For the \sqrt{n} \times \sqrt{n} grid, we can simply take \omega(x)=1 for all x \in V.  Clearly \sum_{x \in V} \omega(x)^2 = n.  On the other hand, a random pair of points in the grid is \Omega(\sqrt{n}) apart, hence \sum_{x,y \in V} \mathsf{dist}_{\omega}(x,y)^2 \approx n^2 \cdot (\sqrt{n})^2 = n^3.  It follows that \Lambda_0(\omega) = O(1/n^2), as desired.

For the complete binary tree with root r, we can simply put \omega(r)=1 and \omega(x)=0 for x \neq r.  (Astute readers will guess the geometrically decreasing weights are actually the optimal choice.)  In this case, \sum_{x \in V} \omega(x)^2 = 1, while all the pairs x,y on opposite sides of the root have \mathsf{dist}_{\omega}(x,y)=1.  It again follows that \Lambda_0(\omega) = O(1/n^2).  Our goal is to provide such a weight \omega for any planar graph.


October 8, 2008

Lecture 4: Conformal mappings, circle packings, and spectral geometry

In Lecture 2, we used spectral partitioning to rule out the existence of a strong parallel repetition theorem for unique games.  In practice, spectral methods are a very successful heuristic for graph partitioning, and in the present lecture we’ll see how to analyze these partitioning algorithms for some common families of graphs.

Balanced separators, eigenvalues, and Cheeger’s inequality

Lipton and Tarjan proved that every planar graph has a negligibly small set of nodes whose remval splits the graph into two roughly equal pieces.  More specifically, every n-node planar graph can be partitioned into three disjoint sets A,B,S such that there are no edges from A to B, the separator S has at most O(\sqrt{n}) nodes, and |A|,|B| \geq n/3.  This allows one to do all sorts of things, e.g. a simple divide-and-conquer algorithm gives a linear time (1+\epsilon)-approximation for the maximum independent set problem in such graphs, for any \epsilon > 0.

So there is a natural question of how well spectral methods do, for example, on planar graphs.  Spielman and Teng showed that for bounded-degree planar graphs, a simple recursive spectral algorithm recovers a partition V=A \cup B of the vertex set so that |E(A,B)| = O(\sqrt{n}).  In other words, for bounded-degree planar graphs, spectral methods recover the Lipton-Tarjan separator theorem!  This is proved by combining Cheeger’s inequality with their main theorem.

Theorem [Spielman-Teng]: Every n-node planar graph with maximum degree d_{\max} has \displaystyle \lambda_2(G) = O\left(\frac{d_{\max}}{n}\right), where \lambda_2(G) is the second eigenvalue of the combinatorial Laplacian on G.

Recall that we introduced the combinatorial Laplacian in Lecture 2.  If G=(V,E) is an arbitrary finite graph, in this lecture it will make more sense to think about the Laplacian \Delta as an operator on functions f : V \to \mathbb R given by

\displaystyle \Delta f(x) = \mathrm{deg}(x) f(x) - \sum_{y : xy \in E} f(y).

If we define the standard inner product \langle f,g\rangle = \sum_{x \in V} f(x)g(x), then one can easily check that for any such f, we have \langle f, \Delta f\rangle = \sum_{xy \in E} |f(x)-f(y)|^2.  In particular, this implies that \Delta is a positive semi-definite operator.  If we denote its eigenvalues by \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n, then it is also easy to check that \lambda_1 = 0, with corresponding eigenfunction f(x)=1 for every x\in V.

Thus by standard variational principles, we have

\displaystyle \lambda_2 = \min_{f \neq 0 : \sum_{x \in V} f(x)=0} \frac{\sum_{xy \in E} |f(x)-f(y)|^2}{\sum_{x \in V} f(x)^2}.

Let us also define the Cheeger constant h_G.  For an arbitrary subset S \subseteq V, let

\displaystyle h(S) = \frac{|E(S, \bar S)|}{\min(|S|,|\bar S|)},

note that this definition varies from the h we defined in Lecture 2, because we will be discussing eigenfunctions without boundary conditions.  Now one defines h_G = \min_{S \subseteq V} h(S).

Finally, we have the version of Cheeger’s inequality (proved by Alon and Milman in the discrete setting) for graphs without boundary.

Cheeger’s inequality: If G=(V,E) is any graph with maximum degree d_{\max}, then

\displaystyle \lambda_2(G) \geq \frac{h(G)^2}{2d_{\max}}.

This follows fairly easy from the Dirichlet version of Cheeger’s inequality presented in Lecture 2.  Here’s a sketch:  Let f : V \to \mathbb R satisfy \Delta f = \lambda_2 f, and suppose, without loss of generality, that V_+ = \{ x : f(x) > 0 \} has |V_+| \geq n/2.  Define f_+(x)=f(x) for f(x) > 0 and f_+(x)=0 otherwise.  Then f_+|_B = 0 for B = V \setminus V_+, so we can plug f_+ into the Dirichlet version of Cheeger’s inequality with boundary conditions on B.  For the full analysis, see this note which essentially follows this approach.  By examining the proof, note that one can find a subset S \subseteq V with h(S) \leq \sqrt{2 d_{\max} \lambda_2} by a simple “sweep” algorithm:  Arrange the vertices V = \{v_1, v_2, \ldots, v_n\} so that f(v_1) \leq f(v_2) \leq \cdots \leq f(v_n), and output the best of the n-1 cuts \{v_1, \ldots, v_i\}, \{v_{i+1}, \ldots, v_n\}.

So using the eigenvalue theorem of Spielman and Teng, along with Cheeger’s inequality, we can find a set S \subseteq V with h(S) \lesssim \sqrt{d_{\max}/n}.  While this cut has the right Cheeger constant, it is not necessarily balanced (i.e. \min(S, \bar S) could be very small).  But one can apply this algorithm recursively, perhaps continually cutting small chunks off of the graph until a balanced cut is collected.  Refer to the Spielman-Teng paper for details.  A great open question is how one might use spectral information about G to recover a balanced cut immediately, without the need for recursion.

Conformal mappings and circle packings

Now we focus on proving the bound \lambda_2(G) \lesssim d_{\max}/n for any planar graph G.  A natural analog is to look at what happens for the Laplace-Beltrami operator for a Riemannian metric on the 2-sphere.  In fact, Hersch considered this problem almost 40 years ago and proved that \lambda_2(M) \lesssim 1/\mathrm{vol}(M), for any such Riemannian manifold M.  His approach was to first use the uniformization theorem to get a conformal mapping from M onto S^2, and then try to pull-back the standard second eigenfunctions on S^2 \subseteq \mathbb R^3 (which are just the three coordinate projections).  Since the Dirichlet energy is conformally invariant in dimension 2, this almost works, except that the pulled-back map might not be orthogonal to the constant function.  To fix this, he has to post-process the initial conformal mapping with an appropriate Möbius transformation.

Unaware of Hersch’s work, Spielman and Teng derived eigenvalue bounds for planar graphs using the discrete analog of this approach:  Circle packings replace conformal mappings, and one still has to show the existence of an appropriate post-processing Möbius transformation.


Older Posts »

The Shocking Blue Green Theme Blog at


Get every new post delivered to your Inbox.

Join 59 other followers