I have recently started reading some seminal generative modeling papers thanks to the recommendation of Mehdy Bennani (PhD student at Université de Caen Normandie).
Since this is a new research area for me, I am committed to studying these papers thoroughly to develop a solid understanding of the subject.
However, as is the case with any area in Machine Learning that has evolved progressively over the years, some equations, theorems, or concepts are presented with the assumption that the reader is already familiar with them.
A new reader may accept them as having an inherent value of truth and move on, or they may ask “Why” and start going down the rabbit hole.
While trying to escape the rabbit hole, I’ll share some of the eye-opening things I encountered along the way, and explain them in a way that makes sense to me
(so that I’ll be able to remember when I read these posts in the future).
In this post, I’ll start with the “Generative Modeling by Estimating Gradients of the Data Distribution” paper by Song & Ermon (2019) [1].
As indicated in the paper’s abstract, this work presents a “generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching.”
Rather than explaining the paper itself, I’ll limit myself to describe what I didn’t know and/or catched my attention.
(Stein) Score
Let’s start with some notation.
We’re given a dataset {xi∈RD}i=1N whose samples come from an unknown data distribution pdata(x).
Now, consider a general distribution p(x).
If this distribution was parametrized by a set of parameters β, we would express it as p(x;β).
We could try to use Maximum Likelihood Estimation (MLE) to find the parameters β so that p(x;β) approximates pdata(x).
The challenge is that, in many settings, we know the un-normalized distribution p~(x;β) but not p(x;β).
To calculate it, we need to normalize p~(x;β) as follows:
p(x;β)=∫Xp~(x;β)dxp~(x;β)=Zβp~(x;β).
Note that Zβ may be simple to calculate when dealing with simple probability functions (e.g., Gaussian); however, its calculation
could be intractable when the p~(x;β) is complex, high-dimensional, or its integral doesn’t have a closed-form solution.
Instead of doing that, let’s apply the log operation to the previous equation:
logp(x;β)=logp~(x;β)−logZβ.
Since Zβ was integrated over all possible values of x, it’s independent of x.
Thus, if we take the gradient w.r.t. x, we obtain:
∇xlogp(x;β)=∇xlogp~(x;β)−0.
By doing so, we removed the influence of the normalizing constant.
∇xlogp(x) is known as the Stein score (or simply, “score”) function.
Score Matching
Let sθ represent a score network; i.e., a neural network parameterized by θ.
Instead of training a model to estimate pdata(x) directly, the score network is trained to estimate the score of pdata; i.e.,
sθ(x)≈∇xlogpdata(x).
This task is known as score matching.
In score matching, the objective is to minimize the difference:
θmin21Epdata[∣∣sθ(x)−∇xlogpdata(x)∣∣22],
which, according to Eq. 1 in Song & Ermon (2019) [1], is equivalent to:
Epdata[tr(∇xsθ(x))+21∣∣sθ(x)∣∣22],
where ∇xsθ(x) denotes the Jacobian of sθ(x).
This equivalency between both optimization problems was not immediately clear to me, so I started reading more about it.
It turns out that the proof can be found in the Appendix 1 of the
“Estimation of Non-Normalized Statistical Models by Score Matching” paper by Aapo Hyvärinen (2005) [2].
Nevertheless, this proof didn’t seem intuitive enough so I decided to do it in a way I can understand:
From this, notice that the last term is not dependent on θ; thus, it can be ignored during optimization.
The new optimization problem can be written as:
Recall that applying the gradient to the log of a function can be expanded as:
∇xlogf(x)=f(x)∇xf(x)f(x)∇xlogf(x)=∇xf(x)
Then, we can rewrite M as:
M=∫sθ(x)⋅∇xpdata(x)dx.
Integration by Parts:
If you have y=uv and apply the derivative w.r.t. x (u and v are functions of x), you obtain:
dxdy=udxdv+vdxdu.
Then, isolating udxdv and integrating on both sides of the equation, we have:
∫udxdvdx=∫dxd(uv)dx−∫vdxdu,
∫uv′dx=uv−∫vu′dx
Now, M looks like the left side of Eq. 3, isn’t it?
To be more clear, consider that u=sθ(x) and v=pdata(x).
With that, let’s repeat the “integration by parts” process I explained before but with the new expressions u and v.
But first, note that, since sθ(x):RD→RD and pdata(x):RD→R,
we can treat them as a vector field and a scalar field, respectively.
These definitions will help us to apply the gradient operator correctly.
In particular, recall that if the gradient is applied to a vector field r(x)={r1(x),…,rD(x)}, we express it as:
∇x⋅r(x)=i=1∑D∂xi∂ri(x).
For the sake of notation, let’s call ∇x⋅ the divergence operator.
Now, applying the gradient operator w.r.t. x to sθ(x)pdata(x) (the product is also a vector field) and applying the product rule, we have:
The first term of the right side of the equation, ∫RD(∇x⋅(sθ(x)pdata(x)))dx,
has a particular structure in vector calculus.
So, before moving on, let’s talk about the divergence theorem:
Divergence Theorem: The flux of a vector field through a closed surface is equal to the volume integral of the
divergence of the field over the region enclosed by the surface:
∫V(∇⋅F)dV=∫SF⋅ndS,
where:
V is the volume in RD,
S is the boundary (or surface) of V,
F is a continuously differentiable vector field,
n is the outward-pointing unit normal vector on ∂V.
It may seem unclear why to bring this theorem here considering we’re not dealing with surfaces or volumes.
However, let’s pretend we do for a minute.
The product (sθ(x)pdata(x)) is a vector field whose flux
passes through a surface enclosing a volume that is given by RD.
In this context, according to the divergence theorem, the sum of the divergences inside RD (i.e., ∫RD(∇x⋅(sθ(x)pdata(x)))dx)
is equal to the outward flux that crosses the surface of the volume RD.
Since the “volume” RD is infinite, its surface is located at the boundary ∣∣x→∞∣∣.
Then:
∣∣x→∞∣∣limsθ(x)pdata(x)=0.
In other words, the outward flux when ∣∣x→∞∣∣ is negligible, which implies that all divergences are cancelled inside RD and
[1] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019), pp. 11918–11930.
[2] H. Aapo, “Estimation of non-normalized statistical models by score matching.” Journal of Machine Learning Research (2005), vol. 6, pp. 695-809.