Conditional Inpainting — Using the captions for image inpainting (Part 2)

- 6 mins

In this series of posts I will detail how I incorporated the image captions to my model in order to perform image inpainting. Link to Part 1 of Using the captions for image inpainting.

In this second part, I finish the implementation and show results of using the captions. Furthermore, I will expand on some elements I previously mentioned could help my performance (running the optimization for image reconstruction multiple times)!

This post relates to the class project for my Deep Learning class. For more information regarding this project, or for all other post related, please follow this link.

Implementation (cont’d)

After Part 1, I now have a model that can convert words from a caption into a vector. However to make this more appropriate and as previously discussed, I will need to deal with the limited vocabulary and the full sentence.

Ideally I could preprocess all the captions to be embedded, however for now, I will do it each time it is required. Hopefully it is not too much of a bottleneck, if it is, I will reevaluate and maybe invest the time and memory to preprocess the captions.

For now it doesn’t seem to be an issue since loading the embedding model takes about 3 minutes, but once it is loaded it generates the vectors really quick. For example, for a given mini batch of size 128, the 128 x 300 matrix of encoded captions is generated in less than 0.02 sec.

Conditional GAN

The Conditional GAN is basically just a way to reconsider the input that is provided to a traditional GAN as including additional information (that’s the conditional part). In the original GAN we feed just noise to the generator, and the discriminator is only fed with images.

A couple of options can be thought of regarding feeding the captions

For now, I am particularly interesting in the first option where we only feed the captions to the generator. Maybe later I will incorporate the captions to the discriminator, but I am interested in seeing the performance of the generator with that additional information.

The way I add the captions is simply as a second input layer to the network that will be concatenated with the noise input that was originally used. The rest of the model is the same! To add the captions to the discriminator, we could add it in a later layer once the image (fake or real) has been encoded, i.e. once in the dense layer of the network.

For a detailed implementation of my model with the captions, you can refer to my repo (link to interesting part of conditional GAN). All other code available can now be used and including captions by simply adding the flag ‘-c’ when launching

Results of Conditional GAN pre-training

The following are results that were obtained only by training the newly conditional GAN with the captions. As a reminder, the captions are embedded into a vector of length 300 and are the averages of all words in all of the different captions for a given image. The hyper-parameters used are the same as previous training, but I train the discriminator for 15 steps and then the generator for 10, furthermore, the model was trained for 95 epochs.


You might be wondering how come I didn’t include any of the captions on which the images were conditioned on? It’s simple, it is because they absolutely relate in no way to the images that were generated. Unfortunately, it seems like using the captions did not help in the sense that it didn’t give instructions to the generator. As an example, the top left image is supposed to be ‘a man holding a bunch of bananas’, but it sure looks more like a horse to me…

We can notice however the great detail and structure that has been created by the generator with these images. I think this can be explained simply by considering now the generator receives an input vector of size 400 (100 pure noise and 300 caption embedding). I believe the additional 300 units of information it receives is actually noise and therefore the generator is getting a larger array of noise. I make the assumption that our method of embedding the captions might be encoding them into noise.

This could be caused by the method used to create the caption embedding, as previously mentioned this is a whole research field in itself and I do not intend to further explore.

Improvements to model

Before considering the results of the reconstruction, in a previous post, I mention at the end that using a mini batch of noise to do the optimization part of the image reconstruction might be useful. As a reminder, the approach I use to reconstruct the image is to consider the pre-trained GAN as being optimal where it can generate images that lie on the true distribution manifold.

If my GAN is efficient enough, using the contextual and perceptual losses defined in this post, I can find the noise that will generate the best possible image by gradient descent. I recommended that maybe running different trajectories of minimization would allow to escape from local minima. The way I implemented this is by doing the optimization multiple times on the given image, each time with different initial noise input.

Only the 100 units of noise are changed, the 300 units of caption are kept untouched during the optimization.

Another possible improvement was to use Poisson blending as the authors of the reference paper for the approach I am using recommended. However, they were using it because their reconstruction doesn’t blend well with the contour of the image. This however isn’t the case for

Image reconstruction using Conditional GAN

Below are some results of the inpainting reconstruction. As previously mentioned, I now have multiple results for each image, in the hopes of getting better images. On the left you have the corrupted images, followed by 5 reconstructions and the real image on the right. These results are for the same model as shown above (trained 95 epochs with 15 discriminator and 10 generator training steps).


Even though the results do not look so terribly wrong, I believe they are worst than the reconstruction shown in this post, where the same methodology was used but without the captions. I do believe however that having the multiple results for one image does help, one could incorporate this with the previous approach.


Following these results, the initial thought I had that the 300 unit vector of captions were noise somewhat can be confirmed. My intuition leading to the previous conclusion follows the method used to optimize.

When using the captions, I have a 400 unit input to my generator where only 100 of them are tweaked to find the best matching image according to the method used. This means that only 25% of the input can be tweaked and the other 75% are kept intact (because they represent the caption). Considering I am getting worst results than without the captions where I was using purely 100 units of noise and 100% of them were tweakable, I think the generator considers the additional 300 units of input as noise, where if the latter were tweakable, we could get decent images.

I am fairly confident to conclude that given my implementation and method used, using the captions does not help the model to reconstruct the images.

rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora