CS180 - Project 5 - Diffusion Models
Overview
In the first portion of this projected, I explored the DeepFloyd IF diffusion model. Specifically, I implemented diffusion sampling loops and then implemented them to create inpainted and optical illusions.
PART A:
Set Up
As stated earlier we will be using the DeepFloyd IF diffusion model which is a two stage model. During the first stage, the model creates sized images, while in the second stage the model takes those images as input and generates sized images. Note that throughout this project I am using the seed 180.
Sampling form the Model
Below I have sampled from the model using a variety of prompts and the number of inference steps for each prompt.
Note that the text prompts are captioned below each image
num_inference_steps = 10
num_inference_steps = 20
num_inference_steps = 50
Sampling Loops
Now in this part of the project I wrote my own sampling loop code, which use the pertained DeepFloyd deniers. The way the model works is illustrated above where we need to translate pure noise into a clean image. Now to train the model to do this we start off with a clean
image () and iteravietly add noise to the image. Thus the diffusion model learns to reverse this process. To then generating the images using a diffusion model, you start with pure noise at sampled from a normal distribution. And then you predict and remove part of the noise at each iterative time step. Until you reach with the clean image. For the DeepFloyd model
Note that throughout a large portion of part A we used the image of the Campanile on the right as our test image
So here the forward process takes a clean image and adds noise to it. To implement this we can do the following
This is what the test image looks like at noise level 250, 200, and 750 respectively
Classical Denoising
No we will try to denoise these images trying to use a more classical. This time using Gaussian blur filtering and see that the results aren’t that great
My best result was with a kernel size of 7. This was a relatively good balance of getting rid some noise while not blurring too much of the original structure of the images.
One Step Denoising
Now, we'll implement a pretrained diffusion model to denoise our images. Here we are using a UNet that has already been trained on a very, very large dataset of pairs of images. We can use it to recover Gaussian noise from the image. Then, we can remove this noise to recover (something close to) the original image.
The prompt used here is “a high quality photo”
Implementing Iterative Denoising
Rather than doing a one time denoising, diffusion models denoise iteratively
Theoretically, we could start with noise at timestep , denoise for one step to get an estimate of , and continue this all the way until . However, this is computationally inefficient and not cost effective.
By skipping steps we can speed this up. The rationale for why this is possible is due to a connection with differential equations, as described here. To do this we create a list of timesteps. And go through those time steps. To do this we implement the following
where:
- is the image at time
- is your noisy image at timestep where (less noisy)
- are the
alphas_cumprod
, which contains all the
-
-
- is the current estimate of the clean image
The is random noise, which in the case of DeepFloyd is also predicted. We can interpret this process as a type of linear interpolation between the signal and nose.
By using i_start = 10
, which means that the denoting process starts at this time step of 10 we get the following and we can see the denoising occurring overtime
As we can see above the image under the title “Iterative Denoising” is the final image after using iterative denoising, while the image under the title “One-Step Denoising” shows the final image after using a single denoising step and it looks much worse. Lastly on the right, the predicted clean image using gaussian blurring is displayed.
Diffusion Model Sampling
Previously, we used the diffusion model to denoise an image. But now we will use it to generate images from scratch by passing in random noise, essentially denoising pure noise. The prompt used here way “a high quality photo”
Classifier Free Guidance
In order to improve the image quality, we can use a technique called classifier-free guidance. Note that this comes at the expense of image diversity. In this approach we estimate the noise conditioned on the text prompt and the unconditional noise. The new noise estimate becomes , with being the strength of CFG. Here we chose
Image-to-Image Translation
In this section, we will create an image that is similar to our test image. Essentially, we will add a little it of noise and then force it back onto the image manifold unconditionally. The results below are shown at the noise levels [1, 3, 5, 7, 10, 20] with the text prompt “a high quality photo”.
Then I input the following new images:
And after applying the image-to-image translation it resulted in
Editing Hand-Drawn and Web Images
Now here I input three more images to the model, one from the internet (middle) and two that I hand drew (left, right)
Which resulted in the following:
Note that the noise levels of [1, 3, 5, 7, 10, 20] and text prompt remained the same.
Inpainting
Using the methods described in the RePaint paper from Lugmayr et. al, we can create a mask so that we can run the diffusion on a select part of the image.
Everything inside the mask is untouched while everything outside undergoes the transformation.
Here is the resulting
Below are times 990, 840, 690, 540, 390, 240, and 60
Here is the final resulting image
Here are some examples with my own images
With the masks being right over my face and over my cat
Text-Conditioned Image-to-image Translation
Here we are changing the prompt from a “high quality photo” to “a rocket ship”. These are the following realists
And then on my own images:
The biggest take away is that the bigger the mask I think the better that this works
So I did my face with a bigger mask and got the following result
Visual Anagrams
We can combine everything to create visual anagrams. In this part, we will create an image that looks like "an oil painting of an old man", but when flipped upside down will reveal "an oil painting of people around a campfire".
Left is the original, right is the flipped (I’ve outlined the fire in pink to make it more clear)
Here is another attempt at the same visual anagram of prompts with an old man and campfire. On the right the campfire is at the bottom with lots of logs in it (the hair on the image of the man on the left) and smoke (the beard of the man on the left).
I’ve outlined the rocketship with the stream of gas behind it in the flipped version because it’s a little harder to see. The man with the hat has a beard.
Hybrid Images
Here we can create hybrid images using a diffusion model inspired by the techniques we used in our earlier frequency project. To do this we create a hybrid noise first, using the estimations of noise from two different text prompts. From here we combine the two different noise estimations by taking the high frequencies from one and the low frequencies from the other. Mathematically it looks like:
where and are the two text prompt embeddings. This is known as factorized diffusion.
Here are some examples:
Hybrid image of “a lithograph of a skull” and “a lithograph of waterfalls”
Further away you can see one big skull off to the right. Potentially with smaller skulls ontop of it
Here’s another attempt at the skull and waterfall:
On the left you can see a waterfall passing through some trees and turning into a river. While on the right further away, you can make out facial features of a skull
Here are two other examples
This combines 'an oil painting of a snowy mountain village’ with 'a photo of the amalfi cost’
Up close it looks like the snowy mountain village with the mountains in the background but from far away you can more clearly see the cost and the shoreline.
And here are two different examples of combining 'a man wearing a hat’ with ‘a rocket ship’
In both you can see a man with a hat take the whole frame up but up close it’s a rocket in the sky
PART B:
Overview
In the second part of this project, we are instead building the diffusion models form the ground up using the MNIST dataset.
Training a Single-Step Denoising UNet
For this part of the project we start by building a single-step denoising UNet using the following architecture
Along with the following simple and composed operators
Using the UNet to Train a Denoiser
If we have a noisy image , and we want to train a denoiser to map to a clean image . We can use the L2 loss as follows
We need training data pairs of to train the denoiser. So during each training batch, we can generate from as follows:
Here we will show the different noising processes over
Training Unconditional Net
Now we can use this data and train the unconditional Unet to denoise. This results in the following after 1 and 5epochs of training with
And this is the training loss curve of the training process.
Out-of-Distribution Testing
Now as noted earlier we trained our model with , let’s see if our model can predict outside the distribution by using other values for
As we can see it does predict relatively well for other values of . Note here we used
Training a Diffusion Model
Adding Time Conditioning to UNet
To inject scalar into the UNet model to condition it the 180 course staff suggested using the following archtecture
Where more specifically those fully connected (FC) blocks are are shown as follows
Training the UNet
Here to train the UNet we take inspiration from Ho et al. (2020) in their paper Denoising Diffusion Probabilistic Models. As we follow the pseudocode seen above. We repeat this for different images and values of until the model converges as shown below
Sampling from the UNet
In order to sample, we use the algorithm above also described by Ho et al. Here are the resulting samples from epoch 5 and 20
Adding Class-Conditioning to UNet
In order to improve our results we can do class conditioning to focus on each digit. By doing so we will add 2 more FCBlocks and adding a one hot encoded vector for the classes rather than a scalar. Also here we will include a 10% dropout
Sampling from the Class-Conditioned UNet
To do camping again we reference Ho et. al. for their sampling algorithm. Here are the resulting samples from epoch 5 and 20. However they look very similar thus for a more stark contrast I’ve also included epoch 1.
And lastly here is the training loss overtime