In the last post, we saw how minimizing the relative entropy allowed us to obtain a “simple” feasible point of a particular polytope. We continue exploring this theme by giving a related proof of Chang’s Lemma, a Fourier-analytic result that arose in additive combinatorics. Our proof will reveal another aspect of entropy optimality: “Projected sub-gradient descent” and a corresponding convergence analysis.
1.1. Discrete Fourier analysis and Chang’s Lemma
Fix and a prime
. Let
denote the finite field with
elements. We will work over
. For convenience, denote
. Let
denote the uniform measure on
, and for a non-negative function
, define the relative entropy
where we use the shorthand . If
, we say that
is a density (with respect to
). In that case,
where
denotes the relative entropy (introduced in the last post).
We briefly review some notions from discrete Fourier analysis. One could skip this without much loss; our proof will only use boundedness of the Fourier characters and no other features. For functions , we will use the inner product
and the norms
for
.
For an element , we use
to denote the corresponding Fourier character. Now every can be written in the Fourier basis as
where . (For simplicity of notation, we have implicitly identified
and
.)
We will be interested in the large Fourier coefficients of a function. To that end, for , we define
Now we are ready to state Chang’s Lemma; it asserts that if is small, then all of
‘s large Fourier coefficients are contained in a low-dimensional subspace.
Lemma 1 If
is a density, then for every
the set
is contained in an
-subspace of dimension at most
, where
In the special case , the proof will actually give constant
instead of
.
1.2. A slightly stronger statement
In fact, we will prove the following slightly stronger statement.
Theorem 2 If
is a density, then for every
there exists a density
such that
and moreover
for some
and
and some function
. In other words,
is a function of at most
characters.
It is an exercise in linear algebra to see that for such a density , the set of non-zero Fourier coefficients
is contained in the span of
. Since
, this yields Lemma 1.
1.3. Entropy optimality
Let denote the Hilbert space of real-valued functions on
equipped with the inner product
. Assume now that
is a density and fix
.
Dtnote by the family of functions of the form
or
for some
.
Let us associate to the convex polytope
of functions
satisfying
for all .
Just as in the matrix scaling setting, encapsulates the family of “tests” that we care about (which Fourier coefficient magnitudes are large). Note that if
, then
. We also let
denote the probability (density) simplex: The set of all
such that
.
Of course, is non-empty because
. As before, we will try to find a “simple” representative from
by doing entropy maximization (which is the same thing as relative entropy minimization):
1.4. Sub-gradient descent
Soon we will explore this optimization using convex duality. But in this post, let us give an algorithm that tries to locate a feasible point of by a rather naive form of (sub-)gradient descent.
We start with the initial point which is the function that is identically
on all of
. Note that by strict convexity, this is the unique maximizer of
among all
. But clearly it may be that
.
So suppose we have a function for some
. If
, we are done. Otherwise, there exists a violated constraint: Some
such that
. Our goal is now to update
to
that doesn't violate this constraint quite as much.
Notice the dichotomy being drawn here between the “soft'' constraints of (which we try to satisfy slowly) and the “hard'' constraints of
that we are enforcing at every step. (This procedure and the analysis that follows are a special case of the mirror descent framework; see, for example, Section 4 of Bubeck’s monograph. In particular, there is a more principled way to design the algorithm that follows.)
We might try to write for some small
, but then
could take negative values! Instead, we will follow the approximation
(for
small) and make the update
Observe that the exponential weight update keeps positive, and our explicit renormalization ensures that
. Thus
.
Of course, one might feel fishy about this whole procedure. Sure, we have improved the constraint corresponding to , but maybe we messed up some earlier constraints that we had satisfied. While this may happen, it cannot go on for too long. In the next post, we will prove the following.
Lemma 3 For an appropriate choice of step size
, it holds that
for some
.
Notice that this lemma proves Theorem 2 and thus finishes our proof of Chang’s Lemma. That’s because by virtue of , we know that
. Moreover,
is built out of at most
characters
, and thus it satisfies the conclusion of Theorem 2.
For the case , we do not have to approximate the real and complex parts separately, so one saves a factor of
.
1.5. Bibliographic notes
This argument bears some similarity to the proof of Impagliazzo, Moore, and Russell. It was discovered in a joint work with Chan, Raghavendra, and Steurer on extended formulations. In both those arguments, the “descent” is implicit and is done by hand—the characters are carefully chosen not to interact. It turns out that this level of care is unnecessary. The analysis in the next post will allow the descent to choose any violated constraint at any time, and moreover will only use the fact that the characters are bounded.