The generic chaining

In the last post, we considered a Gaussian process {\{X_t\}_{t \in T}} and were trying to find upper bounds on the quantity {\mathop{\mathbb E}\sup_{t \in T} X_t}. We saw that one could hope to improve over the union bound by clustering the points and then taking mini union bounds in each cluster.

Hierarchical clustering

To specify a clustering, we’ll take a sequence of progressively finer approximations to our set {T}. First, recall that we fixed {t_0 \in T}, and we have used the observation that {\mathop{\mathbb E}\sup_{t \in T} X_t = \mathop{\mathbb E}\sup_{t \in T} (X_t-X_{t_0})}.

Now, assume that {T} is finite. Write {T_0 = \{t_0\}}, and consider a sequence of subsets {\{T_n\}} such that {T_0 \subseteq T_1 \subseteq T_2 \subseteq \cdots \subseteq T}. We will assume that for some large enough {m}, we have {T_n = T} for {n \geq m}. For every {n \geq 0}, let {\pi_n : T \rightarrow T_n} denote a “closest point map” which sends t \in T to the closest point in T_n.

The main point is that we can now write, for any {t \in T},

\displaystyle   X_t - X_{t_0} = \sum_{n \geq 1} X_{\pi_n(t)} - X_{\pi_{n-1}(t)}. \ \ \ \ \ (1)

This decomposition is where the term “chaining” arises, and now the idea is to bound the probability that {X_t - X_{t_0}} is large in terms of the segments in the chain.

What should {T_n} look like?

One question that arises is how we should think about choosing the approximations {T_n}. We are trading off two measures of quality: The denser {T_n} is in the set {T} (or, more precisely, in the set {T_{n-1}}) the smaller the variances of the segments {X_{\pi_n(t)}-X_{\pi_{n-1}(t)}} will be. On the other hand, the larger {T_n} is, the more segments we’ll have to take a union bound over.

So far, we haven’t used any property of our random variables except for the fact that they are centered. To make a more informed decision about how to choose the sets {\{T_n\}}, let’s recall the classical Gaussian concentration bound.

Lemma 1 For every {s,t \in T} and {\lambda > 0},

\displaystyle   \mathop{\mathbb P}(X_s - X_t > \lambda) \leq \exp\left(-\frac{\lambda^2}{2\, d(s,t)^2}\right). \ \ \ \ \ (2)

This should look familiar: {X_s-X_t} is a mean-zero Gaussian with variance {d(s,t)^2}.

Now, a first instinct might be to choose the sets {T_n} to be progressively denser in {T}. In this case, a natural choice would be to insist on something like {T_n} being a {2^{-n}}-net in {T}. If one continues down this path in the right way, a similar theory would develop. We’re going to take a different route and consider the other side of the tradeoff.

Instead of insisting that {T_n} has a certain level of accuracy, we’ll insist that {T_n} is at most a certain size. Should we require {|T_n| \leq n} or {|T_n| \leq 2^n}, or use some other function? To figure out the right bound, we look at (2). Suppose that {g_1, g_2, \ldots, g_m} are i.i.d. {N(0,1)} random variables. In that case, applying (2) and a union bound, we see that to achieve

\displaystyle  \mathop{\mathbb P}(\exists i : g_i > B) \leq m \mathop{\mathbb P}(g_1 > B) < 1,

we need to select {B \asymp \sqrt{\log m}}. If we look instead at {m^2} points instead of {m} points, the bound grows to {\sqrt{2 \log m}}. Thus we can generally square the number of points before the union bound has to pay a constant factor increase. This suggests that the right scaling is something like {|T_{n+1}| = |T_n|^2}. So we’ll require that {|T_n| \leq 2^{2^n}} for all {n \geq 1}.

The generic chaining

This leads us to the generic chaining bound, due to Fernique (though the formulation we state here is from Talagrand).

Theorem 2 Let {\{X_t\}_{t \in T}} be a Gaussian process, and let {T_0 \subseteq T_1 \subseteq \cdots \subseteq T} be a sequence of subsets such that {|T_0|=1} and {|T_n| \leq 2^{2^{n}}} for {n \geq 1}. Then,

\displaystyle   \mathop{\mathbb E}\sup_{t \in T} X_t \leq O(1) \sup_{t \in T} \sum_{n \geq 0} 2^{n/2} d(t, T_n). \ \ \ \ \ (3)

Proof: As before, let {\pi_n : T \rightarrow T_n} denote the closest point map and let {T_0 = \{t_0\}}. Using (2), for any {n \geq 1}, {t \in T}, and {u > 0}, we have

\displaystyle  \mathop{\mathbb P}\left(|X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| > u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right) \leq \exp\left(-\frac{u^2}{2} 2^n\right).

Now, the number of pairs {(\pi_n(t),\pi_{n-1}(t))} can be bounded by {|T_n| \cdot |T_{n-1}| \leq 2^{2^{n+1}}}, so we have

\displaystyle   \mathop{\mathbb P}\left(\exists t : |X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| > u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right) \leq 2^{2^{n+1}} \exp\left(-\frac{u^2}{2} 2^n\right). \ \ \ \ \ (4)

If we define the event

\displaystyle  \Omega_u = \left\{ \forall n \geq 1, t \in T : |X_{\pi_n(t)} - X_{\pi_{n-1}(t)}| \leq u 2^{n/2} d(\pi_n(t),\pi_{n-1}(t))\right\},

then summing (4) yields,

\displaystyle   \mathop{\mathbb P}(\overline{\Omega_u}) \leq \sum_{n \geq 1} 2^{2^{n+1}} \exp\left(-\frac{u^2}{2} 2^n\right) \leq O(1)\, e^{-u^2} \ \ \ \ \ (5)

for {u \geq 4}, since we get geometrically decreasing summands.


\displaystyle  S = \sup_{t \in T} \sum_{n \geq 1} 2^{n/2} d(\pi_n(t), \pi_{n-1}(t)).

Note that if {\Omega_u} occurs, then {\sup_{t \in T} (X_t - X_{t_0}) \leq uS}. Thus (5) implies that for u \geq 4,

\displaystyle  \mathop{\mathbb P}(\sup_{t \in T} X_t - X_{t_0} > uS) \leq O(1) \, e^{-u^2},

which implies that

\displaystyle   \mathop{\mathbb E} \sup_{t \in T} X_t \leq O(S) \leq O(1) \sup_{t \in T} \sum_{n \geq 1} 2^{n/2} d(\pi_n(t), \pi_{n-1}(t)). \ \ \ \ \ (6)

Finally, by the triangle inequality,

\displaystyle d(\pi_n(t), \pi_{n-1}(t)) \leq d(t, T_n) + d(t, T_{n-1}) \leq 2\,d(t,T_{n-1}).

Plugging this into (6) recovers (3). \Box

Theorem 1.2 gives us a fairly natural way to upper bound the expected supremum using a hierarchical clustering of {T}. Rather amazingly, as we’ll see in the next post, this upper bound is tight. Talagrand’s majorizing measure theorem states that if we take the best choice of {\{T_n\}} in Theorem 1.2, then the upper bound in (3) is within a constant factor of {\mathop{\mathbb E} \sup_{t \in T} X_t}.

5 thoughts on “The generic chaining

  1. Hi James! I don’t see an earlier post about Gaussian processes, did you forget to post it?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s