Project 5:
Fun with diffusion models!
In this project we will have some fun with diffusion. We will implement and deploy diffusion models for
image generation.
Same as our last project, we will split this project into two parts.
Part A: The power of diffusion models
In part A we will implement implement a diffusion sampling loop, play around with diffusion models and use them for impainting and creating optical illusions.
Part 0: Setup
We are going to use the diffusion model DeepFloyd, to generate our images using text input. It is a two stage model by Stability AI. In the first stage it first generates a 64x64 image from the text input and then in the second stage it generates a 256x256 image by upsampling the 64x64 image.
Lets see what our diffucion model generates, if we feed it some prompts. We use the following prompts:
"an oil painting of a snowy mountain village"
"a man wearing a hat"
"a rocket ship"
The results are shown below, I used different amounts of inference steps to generate the images:
I used 2 inference steps to generate this image, we can see that using only 2 steps does not yield a good result, we end up with only noise
Here I used 10 inference steps and the results are pretty good, we can that the generated images are actually corresponding to the prompts
These are images that were generated using 40 inference steps. Even though these images were generated using 4 times the amout of inference steps compared to the last 3 images, I don't see a big difference in the quality of the images.
Part 1: Sampling loops
How does diffusion work? If we have a clear image of something, we can progressively add noise to it every step. Lets say we have a clean image at time step 0. We are going to add some noise to the image at every step until we reach time step T, where the image is completely noisy.
A diffusion model tries to reverse this process. It starts with a noisy image at time step T and tries to remove the noise until it ends up with an image. In a way, the models are trained to remove the noise.
1.1 Implementing the Forward Process
Now we want to implement the sampling loop and we first have to start with the forward process, where we take a clean image and add noise to it. This function describes how we can add noise to the image:
Using the Berkeley Campanile as our example image, and the previously mentioned function with 1000 time steps, we get the following results:
Berkeley Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
1.2 Classical Denoising
Now we want to try to denoise the images we generated in the previous step. We can use the classical method of Gaussian blur filtering to remove the noise. It improves the noisy images somewhat, but the results are not satisfying yet.
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
Gaussian Blur Denoising at t=250
Gaussian Blur Denoising at t=500
Gaussian Blur Denoising at t=750
1.3 One-Step Denoising
The model we are using has a denoiser, which has been trained on a very large image dataset. We can use this pretrained denoiser to denoise the noisy images we have generated in the previous step, by letting it predict the gaussian noise in each step and removing said noise. The denoiser is a Unet, which also needs, as input, the timestep t as additional input.
The original image, the noisy image and the estimate of the original image are shown below:
Original image: Berkeley Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
Assumed noise at t=250
Assumed noise at t=500
Assumed noise at t=750
Estimate of the original image at t=250
Estimate of the original image t=500
Estimate of the original image at t=750
1.4 Iterative Denoising
Previously we have tried to denoise the image in one step, but we can see that one step denoising preforms not as good if we have more noise, which is why we want to try to denoise the image iteratively, since diffusion models are meant to denoise iteratively. Even though in theory we start denoising the image from timestep 1000, and denoise the image one step at a time until we reach timestep 0 and end up with the clear image, this would cost a lot of time and compute, so we are going to denoise the image using larger step sizes of 30, which will still yield similarly good results.
We use the following formular to receive the less noisy image at the next timestep, according to the following DDPM Paper:
Here I am showing the noisy image result of every 5th loop of the iterative denoising function, each loop has a step size of 30:
Estimate of the original image at t=690
Estimate of the original image t=540
Estimate of the original image at t=390
Estimate of the original image at t=240
Estimate of the original image at t=90
Original image: Berkeley Campanile
Iteratively Denoised Campanile
One Step Denoised Campanile
Gaussian Blurred Campanile
We can see that the iterative denoising method works better that the one step denoising method, but it still does not result in the original image. The result looks like a tower, but it is not exactly the Berkeley Campanile.
What I have learned
I often use the panorama feature on my phone to take panorama shots. I did not think that creating panorama shots was this easy. I learned how to choose features, then create feature descriptors, match features and then use RANSAC to find the best homography matrix by ignoring outliers.
It was very interesting to see how well the homography matrix could rectify parts of images that were taken from different angles. Another very interesting observation was how well matching the feature descriptors worked, using the ration of the distance to the first and second nearest neighbour, which RANSAC further improved on to remove any outliers.
1.5 Diffusion Model Sampling
Now we want to sample from our diffusion model by choosing the promt "a high quality image" and using pure noise generate an image. Here are 5 sample images generated using the mentioned prompt:
Sample 1
Sample 2 (Who is this?)
Sample 3
Sample 4
Sample 5
1.6 Classifier-Free Guidance (CFG)
The images above are reasonable but not good and a little non-sensical, which is why we want to use the Classifier-Free Guidance (CFG) method to guide the diffusion model to generate better images.
CFG works by using a conditional guidance, basically a denoising that closely alligns with the given prompt and an unconditional guidance, a denoising that does not have a prompt. We use the following formular for it:
We implement the Classifier Free Guidance according to the following paper here and use a CFG scale of 7 to generate the new images. We can see that this does improves the image quality:
Sample 1 with CFG
Sample 2 with CFG
Sample 3 with CFG
Sample 4 with CFG
Sample 5 with CFG
1.7 Image-to-image Translation
Image to image translation is the process of taking one image and transforming into a different but related image. Here we can do that through noise, we add noise of varying amounts to an image and then use our diffusion model to predict the noise and remove it. Depending on the noise we add, the image will look just slightly or vastly different, since the model will be hallucinating the noise away.
Here we will follow the SDEdit algorithm to force an image, that has been appended with some noise, back to the original image manifold, without using conditioning.
Here we show the SDEdit result of 3 images at noise level [1, 3, 5, 7, 10, 20] with text prompt "a high quality photo". We first start with the Berkeley Campanile image, then with an nonsensical image of our Oranapple and finally with a very commonly seen image, our earth:
Berkeley Campanile:
Original image: Berkeley Campanile
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
Oranapple (Since this is a very unrealistic immage, the model has difficulties creating this):
Original image: Oranapple
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
Earth (Since this is a very common immage, the model recreates this image pretty early on):
Original image: Earth
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
1.7.1 Editing Hand-Drawn and Web Images
We can also use the SDEdit algorithm to edit hand-drawn and web images. Here we first show the result of the SDEdit algorithm on a unrealistic web image:
Unrealistic image of a cat, and a graphic of a rainbow and pixel cat reflecting in it's glasses:
Original image: Cat with glasses and rainbow pixel cat reflection
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
My drawing of a person:
Original image: My drawing of a person:
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
My drawing of a house:
Original image: My drawing of a house
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
1.7.2 Inpainting
Inpainting is the process of using a mask to mark areas of an image that should be regenerated using our diffusion model. We adjust our denoising function to force our denoised image after each step to have the same pixels like our original image at places where mask m is 0. The rest we leave as is. We are basically applying the following formula:
We end up with the following result:
Original image: Campanile
Mask
Hole to Fill
Campanile Inpainted
For the cool cat, we end up with the following result:
Original image: Cool Cat
Mask
Hole to Fill
Cool Cat Inpainted
Castle on a snowy mountain:
Original image: Castle on mountain
Mask
Hole to Fill
Castle on mountain Inpainted
1.7.3 Text-Conditional Image-to-image Translation
For the text conditional image-to-image translation we basically use the same method as for the standard image to image tranlation, but this time, choose a different prompts to condition the diffusion model.
For this text conditional image to image translation, we used the prompt "a rocket ship" and our Campanile image:
Rocket Ship at noise level 1
Rocket Ship at noise level 3
Rocket Ship at noise level 5
Rocket Ship at noise level 7
Rocket Ship at noise level 10
Rocket Ship at noise level 20
Campanile
For this text conditional image to image translation, we used the prompt "a man wearing a hat" and our Oranapple image:
a man wearing a hat at noise level 1
a man wearing a hat at noise level 3
a man wearing a hat at noise level 5
a man wearing a hat at noise level 7
a man wearing a hat at noise level 10
a man wearing a hat at noise level 20
Oranapple
For this text conditional image to image translation, we used the prompt "a pencil" and our earth image:
a pencil at noise level 1
a pencil a hat at noise level 3
a pencil at noise level 5
a pencil at noise level 7
a pencil at noise level 10
a pencil at noise level 20
Our Earth
1.8 Visual Anagrams
I this section we are going to create visual anagrams, which are one kind of optical illusions. These images appear like something else when flipped around.
To archive this, we are denoising the image using 2 different prompts. We use the prompt "an oil painting of an old man" to obtain a noise estimate 1 and then use a different prompt "an oil painting of people around a campfire" to obtain a noise estimate of the flipped input image.
We then combine both noise estimates and use this combined noise estimate to perform the diffusion steps. The procedure can be summarized with the following functions:
We show 4 resulting visual anagrams generated using the descriped procedure:
An Oil Painting of People around a Campfire
An Oil Painting of an Old Man
An Oil Painting of People around a Campfire
An Oil Painting of an Old Man
A photo of the Amalfi cost
A photo of a dog
A photo of the Amalfi cost
A photo of a dog
1.9 Hybrid Images
In our last part we will implement something similar to Factorized Diffusion To create hybrid images, like in our project 2. We use a similar process like in 1.8, where we create 2 noise estimates with 2 different text promts and then combine the high frequency part of one noise with the low frequency part of the other noise estimate.
I used a gaussian kernel of size 33 and a sigma of 2. The process can be described with the following formulas:
These are our results:
(To see the low frequency, move far away from the screen and squint your eyes :D)
Close up: A lithograph of waterfalls (high frequency)
Far away: A lithograph of a skull (low frequency)
Close up: A pencil (high frequency)
Far away: A rocket ship (low frequency)
Close up: An oil painting of people around a campfire (high frequency)
Far away: An oil painting of a snowy mountain village (low frequency)
For the last image, since both prompts are similar and can coexist, the image became an oil painting of people around a campfire in a snowy mountain village.
Part B: Diffusion Models from Scratch!
Part 1: Training a Single-Step Denoising U-Net
1.1 Implementing the UNet
First we want to create a single step denoiser. We train a model to take a noisy image z as an imput and output the denoised image x. We use the following loss function:
We implemented the following U-Net architecture in pytorch for our model:
Unconditional Unet
This Unconditional Unet is made up of the following simple and composed operations:
Standard UNet Operations
The blocks perform the following operations:
(1) Conv is a convolutional layer that doesn't change the image resolution, only the channel
dimension.
(2) DownConv is a convolutional layer that downsamples the tensor by 2.
(3) UpConv is a convolutional layer that upsamples the tensor by 2.
(4) Flatten is an average pooling layer that flattens a 7x7 tensor into a 1x1 tensor. 7 is the resulting
height and width after the downsampling operations.
(5) Unflatten is a convolutional layer that unflattens/upsamples a 1x1 tensor into a 7x7 tensor.
(6) Concat is a channel-wise concatenation between tensors with the same 2D shape. This is simply
torch.cat().
(7) ConvBlock, is similar to Conv but includes an additional Conv. Note that it has the same input and
output shape as (1) Conv.
(8) DownBlock, is similar to DownConv but includes an additional ConvBlock. Note that it has the same
input and output shape as (2) DownConv.
(9) UpBlock, is similar to UpConv but includes an additional ConvBlock. Note that it has the same input
and output shape as (3) UpConv.
1.2 Using the UNet to Train a Denoiser
To use the Unet to train a denoiser, we first add noise to our MINST images in the following way:
Here are the resulting noisy images, that were generated using different sigmas:
Varying levels of noise on MNIST digits
Now we can finally start training our model using the previously noised images. We use a sigma noise level of 0.5, a batch size of 256 and trained for 5 epochs. We also use a hidden dimension of D=128 and a learning rate of 1e-4.
We get the following loss curve (I used tensorboard for some visualizations):
Loss curve for the training of the denoiser visualized in Tensorboard(Steps on x axis)
For the first and 5th epoch we get the following results: (The left column shows the ground truth, the middle column shows the noisy input images and the right column the denoised output)
Result after 1 epoch
Result after 5 epochs
Out-of-Distribution Testing
Our model was trained on images, that have added noise of 0.5. We can now try to denoise images that were generated with a different amount of noise. Here we visualize the results on various levels of noise from 0 to 1:
Results on digits from the test set with varying noise levels.
Part 2: Training a Single-Step Denoising U-Net
Now we can start with diffusion and we will orient ourselves on the Denoising Diffusion Probabilistic Models paper.
We first have to make a small change to our model. Instead of predicting the denoised image x, we will try to predict the added noise to the image:
Here is the unet that outputs the predicted noise
And since we saw in part A already, that one step denoising is not as good as iterative denoising, we will use iterative denoising in this part, with a T of 300, to improve our results.
2.1 Adding Time Conditioning to UNet
There are many ways to inject out timestep T into our unet model, we are adjusting our unet and use a fully connected block to add the time conditioning into our unet:
Our New Unet with the time conditioning
Our Fully Connected block
2.2 Training the UNet
After adjusting our unet architecture we can now begin to train our unet. We use the following algorithm, to train our unet for 20 epochs in total, with a batch size of 128 and a hidden dimension of 64. For the optimizer we use a learning rate of 1e-3 and a learning rate decay of 0.1^(1.0*num_epochs). The algorithm we are using can be seen in the next image:
Algorithm 1: Training time-conditioned UNet
Training the model for 20 epochs, we get the following loss curve:
Loss curve for the training of the time-conditioned unet
2.3 Sampling from the UNet
Since the model is trained now, we can now try to sample some images from the model. We sample some numbers from epoch 5 and 20.
We use the following sampling algorithm to sample from our model:
Algorithm 2: Sampling from time-conditioned UNet
We can see that the result improves with more epochs, the images look more like numbers, but they still don't exactly look perfect, some look like combinations of different numbers:
Sample result after epoch 5
Sample result after epoch 20
2.4 Adding Class-Conditioning to UNet
We can now add class conditioning to our model. We can use the class conditioning to improve the performance of our model, telling it which number is which, and at the same time giving us more control during sampling.
Since we still want our model to work, even when there is no class conditioning, we implement a dropout, where it drops our class conditioning to 0, 10% of the time.
We use the follwing method to add our class conditioning to the model:
fc1_t = FCBlock(...)
fc1_c = FCBlock(...)
fc2_t = FCBlock(...)
fc2_c = FCBlock(...)
t1 = fc1_t(t)
c1 = fc1_c(c)
t2 = fc2_t(t)
c2 = fc2_c(c)
# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = c1 * unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = c2 * up1 + t1
# Follow diagram to get the output.
We can use the following algorithm to train our model:
Algorithm: Training class-conditioned UNet
Training yields us with the following loss curve this time:
Loss curve for the training of the class-conditioned unet
2.5 Sampling from the Class-Conditioned UNet
We use classifier free guidance and the following adjusted sampling algorithm to sample from our model:
Algorithm 4: Sampling from class-conditioned UNet
Using the conditional variable we are able to control the generated output of our model. We use it to generate 4 of each number: This yields us with the following results:
Sample result after epoch 5
Sample result after epoch 20