DDPM - Denoising Diffusion Probabilistic Models

last updated: 2026-01-01

Still in-progress writing

Forward Diffusion Process

Given an image data point x0 sampled from the image dataset distribution x0q(x), let us gradually add Gaussian Noise in a series of T time steps, producing a sequence of noisy images (x1,xT) where xi is the image after the first i steps.

Pasted image 20251017031402.png

More formally this process can be described as the following: our image of d pixels can be flattened to a vector Rd and for type-checking purposes can be normalized into pixel intensities [1,1] (instead of [0,255]). Each transition step is a conditional probability distribution (q(xtxt1):RdR+), giving us the probability density for the image xt given the previous time step’s xt1. We call this process Markovian, because it satisfies the Markov Property: each step only relies on the previous step (more formally q(xtx0:t1)=q(xtxt1))

xtq(xtxt1)=N(xt;μt=1βtxt1,Σt=βtI)
To clarify some notation:
  • Retrieving image xt means retrieving from a Gaussian distribution denoted by N(mean,variance)
    • Mean: μt=1βtxt1 of each pixel
    • Covariance: Σt=βtI (where I=Rd×d is the identity matrix)
      • What this means is that each individual pixel (with variance at Σp,p=βt) is independently distributed of each other since off-diagonal entries Σp,q=0.
  • Using the reparameterization trick from VAEs, this means x=μ+σϵ, ϵN(0,I), (ϵRd)

βt[0,1] is a constant given from our noise scheduler, (β1,β2,,βT) specifying the variance (noise intensity) added each time step. One interesting question one might have is why include the 1βt coefficient for mean. According to the reparameterization trick, since xt=1βtxt1+βtϵ, and because xt1 and ϵ are sampled from independent gaussians we can state

var(xt)=var(1βtxt1)+var(βtϵ)=(1βt)var(xt1)+βt1

Since we normalized our image to [1,1], var(x0)1 , so then t, var(xt)1. Apparently, this is called “variance preserving”. If our input var(x0)1, then var(xt)1 as well. The point is the variance is constant through the entire forward process!

Variance/Noise Schedule

Originally, the authors of DDPM utilizes a linear schedule.

Variance Schedule of Linear (top) vs Cosine (bottom)
Variance Schedule of Linear (top) vs Cosine (bottom)

Simplified Sampling Form

The joint distribution of the entire trajectory of T time steps, the product of all T different PDF —

(Chain rule)q(x1:Tx0)=t=1Tq(xtx0:t1)(Markov Property)=t=1Tq(xtxt1)

— can be expressed a simpler closed form expression if we define a additional variables

(Base Case)x1=α1x0+1α1ϵ0=α¯1x0+1α¯1ϵ0(Inductive Case)xt=αt xt1+1αtϵt1(Inductive Hypothesis)=αt (α¯t1x0+1α¯t1ϵ)+1αtϵt1=αtα¯t1x0+αt(1α¯t1)ϵ+1αtϵt1(Combine Variance Step)=α¯tx0+[αt(1α¯t1)ϵ+1αtϵt1](See below)=α¯tx0+1α¯tϵ

The key combined variance step is as follows: Since ϵt2 and ϵt1 are sampled independently, the linear combination of independent Gaussians stays Gaussian, and yields a merged standard deviation as follows

var(X+Y)=var(X)+var(Y)+2cov(X,Y)=αt(1α¯t1)I+(1αt)I=(1αtα¯t1)I=(1α¯t)I

which allows us to replace ϵ, ϵt1 as sampling from a shared ϵN(0,I) Thus we can write

xt=α¯tx0+1α¯tϵ

and thus produce a sample

xtq(xtx0)=N(xt;α¯tx0mean,(1α¯t)Icovariance)

As T, then we should have reached an isotropic Gaussian distribution, one where xTN(0,I) follows a perfect gaussian distribution of mean 0. Note that is because α¯t0 !

This is advantageous, because we all already know how to sample gaussian noise, so figuring out how to reverse the gaussian noise in the reverse diffusion process allows us to generate random images!

Reverse Diffusion Process

Pasted image 20251225235102.png

We want to learn the reverse distribution q(xt1xt) to acquire some new images in our dataset x0 by learning a deep learning model by pθ where we learn some estimate mean and variance through parameters θ:

pθ(xt1xt)=N(xt1;μθ(xt,t), Σθ(xt,t))

One specific path we may take from xTx0 is represented by

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt)

But the PDF of the entire reverse diffusion process is an “integral” over all the possible pathways we can take to reach x0.

pθ(x0)=pθ(x0:T)dx1:T
summary
summary

Classifier-Free Guidance

Latent DDPM

Stable Diffusion

Sources