CS180 - Project 5 - Diffusion Models

Overview

In the first portion of this projected, I explored the DeepFloyd IF diffusion model. Specifically, I implemented diffusion sampling loops and then implemented them to create inpainted and optical illusions.

PART A:

Set Up

As stated earlier we will be using the DeepFloyd IF diffusion model which is a two stage model. During the first stage, the model creates $64 \times 64$ sized images, while in the second stage the model takes those images as input and generates $256 \times 256$ sized images. Note that throughout this project I am using the seed 180.

Sampling form the Model

Below I have sampled from the model using a variety of prompts and the number of inference steps for each prompt.

Note that the text prompts are captioned below each image

num_inference_steps = 10

an oil painting of a snowy mountain village

num_inference_steps = 20

num_inference_steps = 50

Sampling Loops

Now in this part of the project I wrote my own sampling loop code, which use the pertained DeepFloyd deniers. The way the model works is illustrated above where we need to translate pure noise into a clean image. Now to train the model to do this we start off with a clean

image ( $x_0$ ) and iteravietly add noise to the image. Thus the diffusion model learns to reverse this process. To then generating the images using a diffusion model, you start with pure noise at $x_T$ sampled from a normal distribution. And then you predict and remove part of the noise at each iterative time step. Until you reach $x_0$ with the clean image. For the DeepFloyd model $T = 1000$

Note that throughout a large portion of part A we used the image of the Campanile on the right as our test image

So here the forward process takes a clean image and adds noise to it. To implement this we can do the following

q(x_t|x_0) = N(x_t; \sqrt{\bar \alpha} x_0, (1 - \bar{\alpha}_t)\textbf{I})\newline\newline\text{which is the same as computing}\\ x_t = \sqrt {\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon \ \ \ \ \ \ \ \text{ where } \epsilon \sim N(0,1)

This is what the test image looks like at noise level 250, 200, and 750 respectively

Classical Denoising

No we will try to denoise these images trying to use a more classical. This time using Gaussian blur filtering and see that the results aren’t that great

My best result was with a kernel size of 7. This was a relatively good balance of getting rid some noise while not blurring too much of the original structure of the images.

One Step Denoising

Now, we'll implement a pretrained diffusion model to denoise our images. Here we are using a UNet that has already been trained on a very, very large dataset of $(x_0, x_t)$ pairs of images. We can use it to recover Gaussian noise from the image. Then, we can remove this noise to recover (something close to) the original image.

The prompt used here is “a high quality photo”

Implementing Iterative Denoising

Rather than doing a one time denoising, diffusion models denoise iteratively

Theoretically, we could start with noise $x_{1000}$ at timestep $T=1000$ , denoise for one step to get an estimate of $x_{999}$ , and continue this all the way until $x_0$ . However, this is computationally inefficient and not cost effective.

By skipping steps we can speed this up. The rationale for why this is possible is due to a connection with differential equations, as described here. To do this we create a list of timesteps. And go through those time steps. To do this we implement the following

x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma

where:

$x_t$ is the image at time $t$

$x_{t'}$ is your noisy image at timestep $t'$ where $t' < t$ (less noisy)

$\bar\alpha_t$ are the alphas_cumprod, which contains all the $\bar\alpha_t, \ \ t \in [0, 999]$

$\displaystyle \alpha_t = \frac{\bar\alpha_t }{\bar\alpha_{t'}}$

$\beta_t = 1 - \alpha_t$

$x_0$ is the current estimate of the clean image

The $v_\sigma$ is random noise, which in the case of DeepFloyd is also predicted. We can interpret this process as a type of linear interpolation between the signal and nose.

By using i_start = 10, which means that the denoting process starts at this time step of 10 we get the following and we can see the denoising occurring overtime

As we can see above the image under the title “Iterative Denoising” is the final image after using iterative denoising, while the image under the title “One-Step Denoising” shows the final image after using a single denoising step and it looks much worse. Lastly on the right, the predicted clean image using gaussian blurring is displayed.

Diffusion Model Sampling

Previously, we used the diffusion model to denoise an image. But now we will use it to generate images from scratch by passing in random noise, essentially denoising pure noise. The prompt used here way “a high quality photo”

Classifier Free Guidance

In order to improve the image quality, we can use a technique called classifier-free guidance. Note that this comes at the expense of image diversity. In this approach we estimate the noise conditioned on the text prompt and the unconditional noise. The new noise estimate becomes $\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$ , with $\gamma$ being the strength of CFG. Here we chose $\gamma = 7$

Image-to-Image Translation

In this section, we will create an image that is similar to our test image. Essentially, we will add a little it of noise and then force it back onto the image manifold unconditionally. The results below are shown at the noise levels [1, 3, 5, 7, 10, 20] with the text prompt “a high quality photo”.

Then I input the following new images:

And after applying the image-to-image translation it resulted in

Editing Hand-Drawn and Web Images

Now here I input three more images to the model, one from the internet (middle) and two that I hand drew (left, right)

Which resulted in the following:

Corresponds to hand drawn frown on the left

Corresponds to Kirby image in the middle

Corresponds to hand drawn stars and triangles on the right

Note that the noise levels of [1, 3, 5, 7, 10, 20] and text prompt remained the same.

Inpainting

Using the methods described in the RePaint paper from Lugmayr et. al, we can create a mask so that we can run the diffusion on a select part of the image.

x_t \leftarrow \textbf{m} x_t + (1 - \textbf{m}) \text{forward}(x_{orig}, t)

Everything inside the mask is untouched while everything outside undergoes the transformation.

Here is the resulting

Below are times 990, 840, 690, 540, 390, 240, and 60

Here is the final resulting image

Here are some examples with my own images

With the masks being right over my face and over my cat

Text-Conditioned Image-to-image Translation

Here we are changing the prompt from a “high quality photo” to “a rocket ship”. These are the following realists

And then on my own images:

The biggest take away is that the bigger the mask I think the better that this works

So I did my face with a bigger mask and got the following result

Visual Anagrams

We can combine everything to create visual anagrams. In this part, we will create an image that looks like "an oil painting of an old man", but when flipped upside down will reveal "an oil painting of people around a campfire".

Left is the original, right is the flipped (I’ve outlined the fire in pink to make it more clear)

“an oil painting of people around a campfire” [flipped]

Here is another attempt at the same visual anagram of prompts with an old man and campfire. On the right the campfire is at the bottom with lots of logs in it (the hair on the image of the man on the left) and smoke (the beard of the man on the left).

I’ve outlined the rocketship with the stream of gas behind it in the flipped version because it’s a little harder to see. The man with the hat has a beard.

Hybrid Images

Here we can create hybrid images using a diffusion model inspired by the techniques we used in our earlier frequency project. To do this we create a hybrid noise first, using the estimations of noise from two different text prompts. From here we combine the two different noise estimations by taking the high frequencies from one and the low frequencies from the other. Mathematically it looks like:

\epsilon_1 = \text{UNet}(x_t, t, p_1)\newline\epsilon_2 = \text{UNet}(x_t, t, p_2)\newline\epsilon_{\text{hybrid}} = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)

where $p_1$ and $p_2$ are the two text prompt embeddings. This is known as factorized diffusion.

Here are some examples:

Hybrid image of “a lithograph of a skull” and “a lithograph of waterfalls”

Further away you can see one big skull off to the right. Potentially with smaller skulls ontop of it

Here’s another attempt at the skull and waterfall:

On the left you can see a waterfall passing through some trees and turning into a river. While on the right further away, you can make out facial features of a skull

Here are two other examples

This combines 'an oil painting of a snowy mountain village’ with 'a photo of the amalfi cost’

Up close it looks like the snowy mountain village with the mountains in the background but from far away you can more clearly see the cost and the shoreline.

And here are two different examples of combining 'a man wearing a hat’ with ‘a rocket ship’

In both you can see a man with a hat take the whole frame up but up close it’s a rocket in the sky

PART B:

Overview

In the second part of this project, we are instead building the diffusion models form the ground up using the MNIST dataset.

Training a Single-Step Denoising UNet

For this part of the project we start by building a single-step denoising UNet using the following architecture

Along with the following simple and composed operators

Using the UNet to Train a Denoiser

If we have a noisy image $z$ , and we want to train a denoiser $D_\theta$ to map $z$ to a clean image $x$ . We can use the L2 loss as follows

L=\mathbb{E}{z, x}\left\|D\theta(z)-x\right\|^2

We need training data pairs of $(z, x)$ to train the denoiser. So during each training batch, we can generate $z$ from $x$ as follows:

z=x+\sigma \epsilon, \quad \text { where } \epsilon \sim N(0, I)

Here we will show the different noising processes over $\sigma=[0.0,0.2,0.4,0.5,0.6,0.8,1.0]$

Training Unconditional Net

Now we can use this data and train the unconditional Unet to denoise. This results in the following after 1 and 5epochs of training with $\sigma =0.5$

Results on digits from the test set after 1 epoch of training

Results on digits from the test set after 5 epochs of training

And this is the training loss curve of the training process.

Out-of-Distribution Testing

Now as noted earlier we trained our model with $\sigma = 0.5$ , let’s see if our model can predict outside the distribution by using other values for $\sigma$

As we can see it does predict relatively well for other values of $\sigma$ . Note here we used $\sigma=[0.0,0.2,0.4,0.5,0.6,0.8,1.0]$

Training a Diffusion Model

Adding Time Conditioning to UNet

To inject scalar $t$ into the UNet model to condition it the 180 course staff suggested using the following archtecture

Where more specifically those fully connected (FC) blocks are are shown as follows

Training the UNet

Here to train the UNet we take inspiration from Ho et al. (2020) in their paper Denoising Diffusion Probabilistic Models. As we follow the pseudocode seen above. We repeat this for different images and values of $t$ until the model converges as shown below

Sampling from the UNet

In order to sample, we use the algorithm above also described by Ho et al. Here are the resulting samples from epoch 5 and 20

Sampling from time conditioned UNet after 5 epochs

Sampling from time conditioned UNet after 20 epochs

Adding Class-Conditioning to UNet

In order to improve our results we can do class conditioning to focus on each digit. By doing so we will add 2 more FCBlocks and adding a one hot encoded vector for the classes rather than a scalar. Also here we will include a 10% dropout

Sampling from the Class-Conditioned UNet

To do camping again we reference Ho et. al. for their sampling algorithm. Here are the resulting samples from epoch 5 and 20. However they look very similar thus for a more stark contrast I’ve also included epoch 1.

Sampling from class conditioned UNet after 5 epochs

Sampling from class conditioned UNet after 1 epoch

And lastly here is the training loss overtime