Conditional Inpainting — Generating colored interior base on image contour- 3 mins
In this post I detail my first objective/step with regards to the IFT6266 project, being able to generate colors based on the input (contour of image).
Expanding on what was covered so far in class on generative models (not much). I understand the generative part is achieved simply by producing an output and minimize its difference with the target output.
Just like a regular neural net, that difference/loss will be passed through the different parameters and hopefully it will get closer as it trains more and more.
As mentioned in my plan for the project, I will not include any of the captions at this stage.
Dealing with the black square
My initial thought to deal with the black 32x32 square in the middle of the input image, is just to leave it as is and proceed with regular convolution. It will be interesting to see if it really performs and affects the training time.
The model should, in theory, be able to realize that it doesn’t provide any help at all to consider the blacked out portion of the image.
As a first model with a goal of just outputting something that makes the slightest sense, I considered a CNN encoder and then a CNN decoder. The structure of the network is at follows (channels, nb pixels, nb pixels)
- input shape = (3, 64, 64)
- conv1 output shape = (32, 60, 60)
- pool1 output shape = (32, 30, 30)
- conv2 output shape = (32, 24, 24)
- pool2 output shape = (32, 6, 6)
- conv3 output shape = (32, 2, 2)
- pool3 output shape = (32, 1, 1)
- dense1 output shape = (256)
- dense2 output shape = (256)
- reshape output shape = (64, 4, 4)
- tconv1 output shape = (32, 7, 7)
- upsamp1 output shape = (32, 14, 14)
- tconv2 output shape = (3, 16, 16)
- upsamp2 output shape = (3, 32, 32)
To summarize the above network, it encodes the contour into a 32 x 1 x 1 input, and then applies 2 full hidden layers of 256 units. It is then decoded from a reshaped 64 x 4 x 4 into a 3 x 32 x 32 that matches our desired output shape.
A sigmoid is applied to the last transposed convolution layer. This output is then scaled by 255 to be in the appropriate range for generating images.
As a way to evaluate performance, images from the valid set are generated using the true contour of the image and the output of the network. Below is the evolution the predicted image from a random set over the training epochs, with the corresponding true image.
N.B. Due to technical issues, I lost the original images and wasn’t able to regenerate them/find them..I was however able to recover one of the gif. (update – March 29, 2017)
We can notice that the model does learn some different shades of color and is able to fill the center of the image with the right shade/intensity and colors.
Some possible improvements/modifications. First, increase the number of features as the encoding happens, to end up with more inputs to the fully connected layers and decoder. Second, the last layer shouldn’t be an up sample operation (thanks Francis link to his blog) as it reduces the output resolutions to being blocs of the same color!