A bit more on expanders…

So the lecture notes I posted a year ago giving a “simpler proof” of expansion for the Margulis-Gabber-Galil graphs turns out to be very similar to arguments of M. Berger (1991) and Y. Shalom (1999) for proving that SL_2(\mathbb Z) \ltimes \mathbb Z^2  has property (T). See Section 4.2 of the book of Bekka, de la Harpe, and Valette.

Anyone interested in expander graphs should take a look at Amir Yehudayoff’s recent article in the SIGACT newsletter: Proving expansion in three steps. Amir uses a ping-pong lemma argument (which is essentially the same as the “combinatorial lemma” in the preceding post) to exemplify the “opening” of the three part strategy of Bourgain and Gamburd to proving expansion in Cayley graphs.

Gabber-Galil analysis of Margulis’ expanders

I’m currently teaching a course on spectral graph theory, and I decided to lecture on Margulis’ expander construction and the analysis of its spectral gap by Gabber and Galil. I had never realized how intuitive the analysis could be; the lectures notes I had seen didn’t quite reflect this. In particular, we won’t do any unsightly calculations. Everything I’m about to say is probably well-known, but I thought I would write it down anyway.

The idea is to first start with an initial “expanding object,” and then try to construct a family of graphs out of it. First, consider the infinite graph {G} with vertex set {\mathbb Z^2} . The edges are given by two maps {S} and {T} , where {S(x,y) = (x,x+y)} and {T(x,y) = (x+y,y)} . So the edges are {\{(x,y), S(x,y)\}} and {\{(x,y), T(x,y)\}} . Clearly {(0,0)} is not adjacent to anything. Except for the origin, this graph is an expander in the following sense.

Lemma 1 For any subset {A \subseteq \mathbb Z^2 \setminus \{0\}} , we have

\displaystyle  |E(A)| \geq \frac{1}{3} |A|\,,

where {E(A)} denotes the edges leaving {A} .

The following simple proof is inspired by a paper of Linial and London.

Proof: First consider a subset {A} that does not intersect the coordinate axes. Let {Q_1, Q_2, Q_3, Q_4} represent the four quadrants of {\mathbb Z^2} , and let {A_i = A \cap Q_i} . Consider that {S(A_1), T(A_1) \subseteq Q_1} and {S(A_1) \cap T(A_1) = \emptyset} . The latter fact follows because if {(x,x+y)=(x'+y',y')} then {x'=-y} . Similarly, {S^{-1}(A_2), T^{-1}(A_2) \subseteq Q_2} and {S(A_3), T(A_3) \subseteq Q_3} and {S^{-1}(A_4), T^{-1}(A_4) \subseteq Q_4} , while all the respective pairs {S^{-1}(A_2), T^{-1}(A_2)} and {S(A_3), T(A_3)} and {S^{-1}(A_4), T^{-1}(A_4)} are disjoint.

Combining this with the fact that {S} and {T} are bijections on {\mathbb Z^2} immediately shows that {|S(A) \cup S^{-1}(A) \cup T(A) \cup T^{-1}(A)|} is at least

\displaystyle  |S(A_1)| + |T(A_1)| + |S^{-1}(A_2)| + |T^{-1}(A_2)|

\displaystyle  + |S(A_3)| + |T(A_3)| + |S^{-1}(A_4)| + |T^{-1}(A_4)|

\displaystyle  = 2(|A_1| + |A_2| + |A_3| + |A_4|)

\displaystyle  = 2 |A|\,.

Dealing with the coordinate axes is not too hard. Suppose now that {A \subseteq \mathbb Z^2 \setminus \{0\}} is arbitrary. Let {A_{+} = A \cap \{ (x,0), (0,x) : x \in \mathbb Z \}} . The preceding argument shows {|E(A)| \geq |A\setminus A_+|-|A_+| = |A|-2|A_+|} . We will show that {|E(A)| \geq |A_+|} . Averaging these two inequalities with weights {1/3} and {2/3} completes the proof.

Let {A_\times = A \cap \{ (x,x), (x,-x) : x \in \mathbb Z \}} , and {A_\circ = A \setminus (A_1 \cup A_2)} . Then we have the bounds:

\displaystyle |E(A_+, \bar A)| \geq |A_+| - |A_\times|

\displaystyle |E(A_\times, \bar A)| \geq 2|A_\times| - |A_\circ|

\displaystyle |E(A_\circ, \bar A)| \geq |A_\circ| - |A_\times|\,.

The first equation follows since each element of {A_+} is connected to exactly two elements of {A_{\times}} , and each element of {A_{\times}} is connected to exactly two elements of {A_+} . For instance, {(x,0)} is connected to {(x,x)} and {(x,-x)} , while {(x,x)} is connected to {(x,0)} and {(0,x)} .

The second follows because every point of {A_\times} is connected to two unique points of {A_\circ} , e.g. {(x,x)} is connected to {(x,2x)} and {(2x,x)} . The final inequality follows from the fact that {|E(A_\circ)| \geq |A_\circ|} from our first argument (since {A_{\circ}} does not touch the axes), and because {A_\circ} has no edges to {A_+} . Summing these three inequalities yields {|E(A)| \geq |A_+|} . \Box

Of course, {\mathbb Z^2} is not a finite graph, so for a parameter {n} , we define the graph {G_n=(V_n,E_n)} with vertex set {V_n = \mathbb Z_n \oplus \mathbb Z_n} , where {\mathbb Z_n = \mathbb Z/(n \mathbb Z)} . There are four types of edges in {E_n} : A vertex {(x,y)} is connected to the vertices

\displaystyle \{(x,y+1), (x+1, y), (x,x+y), (x+y,y)\}\,.

This yields a graph of degree at most 8.

Theorem 2 There is a constant {c > 0} such that for every {n \in \mathbb N} ,

\displaystyle  \lambda_2(G_n) \geq c\,,

where {\lambda_2} is the second-smallest eigenvalue of the Laplacian on {G_n} .

Of course, the graphs {\{G_n\}} now have torsion, and thus our expansion result for {\mathbb Z^2} is not, a priori, very useful. The non-trivial idea of the proof is that the torsion doesn’t matter, making Lemma 1 applicable. This is not difficult to show using some Fourier analysis. It turns out to be better though, if we first move to the continuous torus.

Recall that,

\displaystyle  \lambda_2(G_n) = \min_{f : V_n \rightarrow \mathbb R} \left\{\frac{\sum_{uv \in E_n} (f(u)-f(v))^2}{\sum_{u \in V_n} f(u)^2} : \sum_{u \in V_n} f(u) = 0\right\}\,,

Let {\mathbb T^2 = \mathbb R^2/\mathbb Z^2} be the 2-dimensional torus equipped with the Lebesgue measure, and consider the complex Hilbert space

\displaystyle  L^2(\mathbb T^2) = \left\{ f : \mathbb T^2 \rightarrow \mathbb C : \int_{\mathbb T^2} |f|^2 < \infty \right\}\,.

equipped with the inner product {\langle f,g\rangle_{L^2} = \int_{\mathbb T^2} f \bar g} .

We might also define a related value,

\displaystyle   \lambda_2(\mathbb T^2) = \min_{f \in L^2(\mathbb T^2)} \left\{\frac{\|f-f\circ S\|_{L^2}^2 + \|f-f \circ T\|_{L^2}^2}{\|f\|_{L^2}^2} : \int_{\mathbb T^2} f = 0\right\}\,. \ \ \ \ \ (1)

Note that this is just defined as a number; the eigenvalue notation is merely suggestive, but we have not introduced an operator on the torus.

Claim 1 There is some {\varepsilon > 0} such that for any {n \in \mathbb N} , we have {\lambda_2(G_n) \geq \varepsilon \lambda_2(\mathbb T^2)} \

Proof: We simply sketch a proof, which is rather intuitive. Suppose we are given some map {f : V_n \rightarrow \mathbb R} such that {\sum_{u \in V_n} f(u)=0} . Define its continuous extension {\tilde f : \mathbb T^2 \rightarrow \mathbb R} as follows: Under the natural embedding of {\mathbb Z_n \oplus \mathbb Z_n} into {\mathbb T^2} , every point {z \in \mathbb T^2} sits inside a grid square with four corners {u_1,u_2,u_3,u_4 \in \mathbb Z_n \oplus \mathbb Z_n} . Writing a local convex combination {z = \lambda_1 u_1 + \lambda_2 u_2 + \lambda_3 u_3 + \lambda_4 u_4} , we define

\displaystyle  \tilde f(z) = \lambda_1 f(u_1) + \lambda_2 f(u_2) + \lambda_3 f(u_3) + \lambda_4 f(u_4)\,.

Now it is elementary to verify that {\int_{\mathbb T^2} \tilde f = 0} . It is also easy to verify that {\|\tilde f\|^2_{L^2} \geq c\frac{1}{n} \sum_{v \in V} f(v)^2} for some {c > 0} . (Even if {f(u_1)+f(u_2)+f(u_3)+f(u_4)=0} , we still get a contribution from {f(u_1)^2 + f(u_2)^2 + f(u_3)^2 + f(u_4)^2} to {\|\tilde f\|_{L^2}} on this square because we are taking a weighted average.)

Finally, for any {z \in \mathbb T^2} , there is a path of length {O(1)} in {G_n} connecting each of the corners of {z} ‘s square to the corners of {S(z)} ‘s square. A similar fact holds for {T(z)} . In fact, this is the only place where we need to use the fact that edges of the form {(x,y) \leftrightarrow (x,y+1)} and {(x,y) \leftrightarrow (x+1,y)} are present in {G_n} . Thus any contribution {|\tilde f(z)-\tilde f(S(z))|^2} to {\|\tilde f-\tilde f\circ S\|_{L^2}^2} can be charged against a term in {\sum_{uv \in E_n} (f(u)-f(v))^2} , and similarly for {T} . \Box

Now our goal is to show that {\lambda_2(\mathbb T^2) > 0} . We will use the Fourier transform to “remove the torsion.” The point is that {S} and {T} , being shift operators, will act rather nicely on the Fourier basis.

We recall that if {m,n \in \mathbb N} and we define {\chi_{m,n} \in L^2(\mathbb T^2)} by {\chi_{m,n}(x,y) = \exp(2\pi i(mx+ny))} , then {\{\chi_{m,n}\}} forms an orthonormal Hilbert basis for {L^2(\mathbb T^2)} . In particular, every {f \in L^2(\mathbb T^2)} can be written uniquely as

\displaystyle  f = \sum_{m,n \in \mathbb Z} \hat f_{m,n} \chi_{m,n}\,,

where {\hat f_{m,n} = \langle f, \chi_{m,n}\rangle_{L^2}} . The Fourier transform is defined as a map {\mathcal F~:~L^2(\mathbb T^2) \rightarrow \ell^2(\mathbb Z^2)} , where {\mathcal F f(m,n) = \hat f_{m,n}} . In particular, {\mathcal F} is a linear isometry.

Define {S^*(f) = f \circ S} and {T^*(f) = f \circ T} . Then for any {m,n \in \mathbb Z} , we have

\displaystyle  S^*(\chi_{m,n}) = \chi_{m+n,n} \quad\textrm{ and }\quad T^*(\chi_{m,n}) = \chi_{m,n+m}.

In particular, for any {f \in L^2(\mathbb T^2)} , we have {\widehat{S^*(f)} = \sum_{m,n} \hat f_{m-n,n} \chi_{m,n}} and {\widehat{T^*(f)} = \sum_{m,n} \hat f_{m,n-m} \chi_{m,n}} . The final thing to note is that {\hat f(0,0) = \langle f, \chi_{0,0} \rangle = \int_{\mathbb T^2} f} . So now if we simply apply the Fourier transform (a linear isometry) to the expression in (1), we get a reformulation that {\lambda_2(\mathbb T^2)} is precisely

\displaystyle  \min_{g \in \ell^2(\mathbb Z^2)} \left\{ \frac{\sum_{z \in \mathbb Z^2} |g(z)-g(S(z))|^2 + |g(z)-g(T(z))|^2}{\|g\|^2_{\ell^2}}: g(0,0)=0 \right\}\,.

Here we have simply replaced {f} by {\mathcal F f} in (1), and then written {g = \mathcal F f} .

But now recall our initial infinite graph {G} on {\mathbb Z^2} . If we denote by {L_G} the Laplacian on {G} , then we can rewrite this as,

\displaystyle  \lambda_2(\mathbb T^2) = \min_{g \in \ell^2(\mathbb Z^2)} \left\{ \frac{\langle g, L_G g\rangle_{\ell^2}}{\|g\|^2_{\ell^2}}: g(0,0)=0 \right\}\,.

In other words, it is precisely the first Dirichlet eigenvalue on {G} , subject to the boundary condition {g(0,0)=0} .

But now the discrete Cheeger inequality tells us that

\displaystyle  \lambda_2(\mathbb T^2) \geq \frac1{2 d_{\max}} h^2,

where {h} is the minimal expansion of any set not containing the origin. Thus we have indeed unwrapped the torsion and returned to our initial question. Lemma 1 shows that {h \geq 1/3} , yielding the desired lower bound on {\lambda_2(\mathbb T^2)} .

The pseudorandom subspace problem

In this post, I will talk about the existence and construction of subspaces of \ell_1^N which are “almost Euclidean.” In the next few, I’ll discuss the relationship between such subspaces, compressed sensing, and error-correcting codes over the reals.

A good starting point is Dvoretzky’s theorem:

For every N \in \mathbb N and \varepsilon > 0, the following holds. Let |\cdot| be the Euclidean norm on \mathbb R^N, and let \|\cdot\| be an arbitrary norm. Then there exists a subspace X \subseteq \mathbb R^N with \mathrm{dim}(X) \geq c(\varepsilon) \log N, and a number A > 0 so that for every x\in X,

\displaystyle A|x| \leq \|x\| \leq (1+\varepsilon) A |x|.

Here, c(\varepsilon) > 0 is a constant that depends only on \varepsilon.

The theorem as stated is due to Vitali Milman, who gave the (optimal) dependence of \log N on the dimension of X, and proved that the above fact actually holds with high probability when X is chosen uniformly at random from all subspaces of this dimension. In other words, for d \approx \log N, a random d-dimensional subspace of an arbitrary N-dimensional normed space is almost Euclidean. Milman’s proof was one of the starting points for the “local theory” of Banach spaces, which views Banach spaces through the lens of sequences of finite-dimensional subspaces whose dimension goes to infinity. It’s a very TCS-like philosophy; see the book of Milman and Schechtman.

One can associate to any norm its unit ball B = \left\{ x \in \mathbb R^N : \|x\| \leq 1 \right\}, which will be a centrally symmetric convex body. Thus we can restate Milman’s proof of Dvoretzky’s theorem geometrically: If B is an arbitrary convex body, and H is a random c(\varepsilon) \log N-dimensional hyperplane through the origin, then with high probability, B \cap H will almost be a Euclidean ball. This is quite surprising if one starts, for instance, with a polytope like the unit cube or the cross polytope. The intersection B \cap H is again a polytope, but which is approximated closely by a smooth ball. Here’s a pictorial representation lifted from a talk I once gave:

(Note: Strictly speaking, the intersection will only be an ellipsoid, and one might have to pass to a subsection to actually get a Euclidean sphere.)

Euclidean subspaces of \ell_1^N and Error-Correcting Codes.

It is not too difficult to see that the dependence \mathrm{dim}(X) \approx \log N is tight when we consider \mathbb R^N equipped with the \ell_\infty norm (see e.g. Claim 3.10 in these notes), but rather remarkably, results of Kasin and Figiel, Lindenstrauss, and Milman show that if we consider the \ell_1 norm, we can find Euclidean subspaces of proportional dimension, i.e. \mathrm{dim}(X) \approx N.

For any x \in \mathbb R^N, Cauchy-Schwarz gives us

\|x\|_2 \leq \|x\|_1 \leq \sqrt{N} \|x\|_2.

To this end, for a subspace X \subseteq \mathbb R^N, let’s define

\displaystyle \Delta(X) = \max_{0 \neq x \in \mathbb R^N} \frac{\sqrt{N} \|x\|_2}{\|x\|_1},

which measures how well-spread the \ell_2-mass is among the coordinates of vectors x \in X. For instance, if \Delta(X) = O(1), then every non-zero x \in \mathbb X has at least \Omega(N) non-zero coordinates (because \|x\|_1 \leq \sqrt{|\mathrm{supp}(x)|} \|x\|_2). This has an obvious analogy with the setting of linear error-correcting codes, where one would like the same property from the kernel of the check matrix. Over the next few posts, I’ll show how such subspaces actually give rise to error-correcting codes over \mathbb R that always have efficient decoding algorithms.

The Kasin and Figiel-Lindenstrauss-Milman results actually describe two different regimes. Kasin shows that for every \eta > 0, there exists a subspace X \subseteq \mathbb R^N with \dim(X) \geq (1-\eta) N, and \Delta(X) = O_{\eta}(1). The FLM result shows that for every \varepsilon > 0, one can get \Delta(X) \leq 1 + \varepsilon and \dim(X) \geq \Omega_{\varepsilon}(N). In coding terminology, the former approach maximizes the rate of the code, while the latter maximizes the minimum distance. In this post, we will be more interested in Kasin’s “nearly full dimensional” setting, since this has the most relevance to sensing and coding. Work of Indyk (see also this) shows that the “nearly isometric” setting is useful for applications in high-dimensional nearest-neighbor search.

The search for explicit constructions

Both approaches show that the bounds hold for a random subspace of the proper dimension (where “random” can mean different things;we’ll see one example below). In light of this, many authors have asked whether there are explicit constructions of such subspaces exist, see e.g. Johnson and Schechtman (Section 2.2), Milman (Problem 8), and Szarek (Section 4). As I’ll discuss in future posts, this would also yield explicit constructions of codes over the reals, and of compressed sensing matrices.

Depending on the circles you run in, “explicit” can mean different things. Let’s fix a TCS-style definition: An explicit construction means a deterministic algorithm which, give N as input, outputs a description of a subspace X (e.g. in the form of a set of basis vectors), and runs in time \mathrm{poly}(N). Fortunately, the constructions I’ll discuss later will satisfy just about everyone’s sense of explicitness.

In light of the difficulty of obtaining explicit constructions, people have started looking at weaker results and partial derandomizations. One starting point is Kasin’s method of proof: He shows, for instance, that if one chooses a uniformly random N/2 \times N sign matrix A (i.e. one whose entries are chosen independently and uniformly from \{-1,1\}), then with high probability, one has \Delta(\mathrm{ker}(A)) = O(1). Of course this bears a strong resemblance to random parity check codes. Clearly we also get \dim(\mathrm{ker}(A)) \geq N/2.

In work of Guruswami, myself, and Razborov, we show that there exist explicit N/2 \times N sign matrices A for which \Delta(\mathrm{ker}(A)) \leq (\log N)^{O(\log \log \log N)} (recall that the trivial bound is \Delta(\cdot) \leq \sqrt{N}). Our approach is inspired by the construction and analysis of expander codes. Preceding our work, Artstein-Avidan and Milman pursued a different direction, by asking whether one can reduce the dependence on the number of random bits from the trivial O(N^2) bound (and still achieve \Delta(\mathrm{ker}(A)) = O(1)). Using random walks on expander graphs, they showed that one only needs O(N \log N) random bits to construct such a subspace. Lovett and Sodin later reduced this dependency to O(N). (Indyk’s approach based on Nisan’s pseudorandom generator can also be used to get O(N \log^2 N).) In upcoming work with Guruswami and Wigderson, we show that the dependence can be reduced to O(N^{\delta}) for any \delta > 0.

Kernels of random sign matrices

It makes sense to end this discussion with an analysis of the random case. We will prove that if A is a random N/2 \times N sign matrix (assume for simplicity that N is even), then with high probability \Delta(\mathrm{ker}(A)) = O(1). It might help first to recall why a uniformly random N/2 \times N matrix B with \{0,1\} entries is almost surely the check matrix of a good linear code.

Let \mathbb F_2 = \{0,1\} be the finite field of order 2. We would like to show that, with high probability, there does not exist a non-zero vector x \in \mathbb F_2^N with hamming weight o(N), and which satisfies Bx=0 (this equation is considered over \mathbb F_2). The proof is simple: Let B_1, B_2, \ldots, B_{N/2} be the rows of B. If x \neq 0, then \Pr[\langle B_i, x \rangle = 0] \leq \frac12. Therefore, for any non-zero x, we have

\Pr[Bx = 0] \leq 2^{-N/2}. (**)

On the other hand, for \delta > 0 small enough, there are fewer than 2^{N/2} non-zero vectors of hamming weight at most \delta N, so a union bound finishes the argument.

The proof that \Delta(\mathrm{ker}(A)) = O(1) proceeds along similar lines, except that we can no longer take a naive union bound since there are now infinitely many “bad” vectors. Of course the solution is to take a sufficiently fine discretization of the set of bad vectors. This proceeds in three steps:

  1. If x is far from \mathrm{ker}(A), then any x' with \|x-x'\|_2 small is also far from \mathrm{ker}(A).
  2. Any fixed vector x \in \mathbb R^N with \|x\|_2 = 1 is far from \mathrm{ker}(A) with very high probability. This is the analog of (**) in the coding setting.
  3. There exists a small set of vectors that well-approximates the set of bad vectors.

Combining (1), (2), and (3) we will conclude that every bad vector is far from the kernel of A (and, in particular, not contained in the kernel).

To verify (1), we need to show that almost surely the operator norm \|A\| = \max \left\{ \|Ax\|_2 : \|x\|_2 = 1\right\} is small, because if we define

\mathrm{dist}(x, \mathrm{ker}(A)) = \min_{y \in \mathrm{ker}(A)} \|x-y\|_2,


\mathrm{dist}(x', \mathrm{ker}(A)) \geq \mathrm{dist}(x,\mathrm{ker}(A)) - \|A\| \cdot \|x-x'\|_2

In order to proceed, we first need a tail bound on random Bernoulli sums, which follows immediately from Hoeffding’s inequality.

Lemma 1: If v \in \mathbb R^k and X \in \{-1,1\}^k is chosen uniformly at random, then

\displaystyle \Pr\left[|\langle v, X \rangle| \geq t\right] \leq 2 \exp\left(\frac{-t^2}{2 \|v\|_2^2}\right)

In particular, for y \in \mathbb R^{N/2} and x \in \mathbb R^N, we have \langle y, Ax \rangle = \sum_{i=1}^{N/2} \sum_{j=1}^N A_{ij} y_i x_j. We conclude that if we take \|x\|_2 = 1 and \|y\|_2 = 1, then

\Pr\left[ |\langle y, Ax \rangle| \geq t \sqrt{2N}\right] \leq 2 \exp\left(-t^2 N\right), (4)

observing that \sum_{i,j} y_i^2 x_j^2 = 1, and applying Lemma 1. Now we can prove that \|A\| is usually not too big.

Theorem 1: If A is a random N/2 \times N sign matrix, then

\Pr[\|A\| \geq 10 \sqrt{N}] \leq 2 e^{-6N}.


For convenience, let m = N/2, and let \Gamma_m and \Gamma_N be (1/3)-nets on the unit spheres of \mathbb R^m and \mathbb R^N, respectively. In other words, for every x \in \mathbb R^m with \|x\|_2 = 1, there should exist a point x' \in \Gamma_m for which \|x-x'\|_2 \leq 1/3, and similarly for \Gamma_N. It is well-known that one can choose such nets with |\Gamma_m| \leq 7^m and |\Gamma_N| \leq 7^N (see, e.g. the book of Milman and Schechtman mentioned earlier).

So applying (4) and a union bound, we have

\displaystyle \Pr\left[\exists y \in \Gamma_m, x \in \Gamma_N\,\, |\langle y, Ax \rangle| \geq 3\sqrt{2N}\right] \leq 2 |\Gamma_m| |\Gamma_N| \exp(-9N) \leq 2 \exp(-6N).

So let’s assume that no such x \in \Gamma_N and y \in \Gamma_m exist. Let u \in \mathbb R^N and v \in \mathbb R^m be arbitrary vectors with \|u\|_2=1,\|v\|_2=1, and write u = \sum_{i \geq 0} \alpha_i x_i and v = \sum_{i \geq 0} \beta_i y_i, where \{x_i\} \subseteq \Gamma_N and \{y_i\} \subseteq \Gamma_m, where |\alpha_i|, |\beta_i| \leq 3^{-i} (by choosing successive approximations). Then we have,

|\langle v, A u\rangle| \leq \sum_{i,j \geq 0} 3^{-i-j} |\langle y_i, A x_i\rangle| \leq (3/2)^2 3\sqrt{2N} \leq 10 \sqrt{N}.

We conclude by nothing that \|A\| = \max \left\{ |\langle u, Av \rangle| : \|u\|_2 = 1, \|v\|_2=1 \right\}.

This concludes the verification of (1). In the next post, we’ll see how (2) and (3) are proved.