Table of Contents
After a few months with no side projects on my plate, I was eager to create something new. I’ve made it a routine to try and create AI that competes with my nephew in games he’s playing (just like in my previous posts). So, I asked my nephew what game he’s playing at the moment and hoped to write something that can also play it (and maybe even better than him). The problem is that my nephew, Yali, is playing chess and he is freakishly good at it.
Fortunately, Yali’s younger sister Gili is into coloring books. While playing with her, I noticed that she kept choosing different colors from the ones that I would choose. This got me thinking. After verifying that Gili isn’t color blind, my thoughts were that she was probably choosing colors I wouldn’t because I have seen more color combinations and patterns than she has (elderly wisdom) and I was using this knowledge while coloring with her.
The main goal of this post is to train a few models to color the same images as Gili, but each network will “witness” different color combinations and will probably have different results coming from a different point of view. For example, if the training set contains only pictures taken at night, it will never color the sky as blue. Let’s start by looking at transfer learning examples.
Transfer Learning Examples
“Transfer learning is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, the knowledge gained while learning to recognize cars could apply when trying to recognize trucks.”
There are plenty of other transfer learning examples. However, in my experiment, I want to have a model trained only on images from the movie Frozen and use this to model to color other cartoons, including Iron Man and Moana. My theory is that Iron Man will be blue and Moana will be colored with a white color scheme and bright hair color.
Dataset
In my case, the training dataset is made of images before and after they were colored. While I would really like to color thousands of coloring books and scan them, I don’t have the time to do that. So, my solution was to go the opposite way. Starting from a color picture and applying filters to make it look like it was taken from a coloring book.
Searching for Frozen images in Google returned some good results. However, the problem is that most of the good pictures are basically the same image and not all of the results are really related to Frozen.
So, I decided to use the movie itself. If you watch the movie (and you should) you can see that there are scenes where the picture doesn’t really change. A long dialogue, for example, might cause a model to overfit. To avoid this, I took an image every 0.5 seconds.
Usually in image detection, it’s common to rotate and mirror the images to gain more data. When detecting a face in an image, we want to detect it even when the image is rotated. When coloring images, a horizontal line might have a different meaning than a vertical line. For example, we might color the sky blue above a horizontal line, but something to the right of a vertical line doesn’t have to be blue sky.
After obtaining the images, a filter needs to be applied to all of them. Making an image look like a coloring book image is quite simple. Luckily, you can write JavaScript scripts directly in Photoshop.
From a real image to coloring book look.
Running this script on all of the images led to this result. It isn’t perfect, but it’s a good point to start from.
My first intuition was to create a three-layer (Red Green Blue) output of the black and white image. However, a better solution was to work in LAB.
While in RGB all layers are responsible for the black and white appearance, in LAB it’s only the first layer that determines the black and white appearance. The L stands for lightness and the other layers (A and B) are responsible for the color spectrums green–red and blue-yellow.
LAB separated into channels (A and B are merged with L)
In my case, the input is already black and white so I only need to generate two layers (A and B) instead of RGB’s three layers. Predicting two layers is way easier. It won’t add new shadows or depth to my image (since those are represented by the L layer that doesn’t change), but I’m fine with that.
Encoder-decoder
Starting from an L channel input (1x128x128 matrix, in my case) each layer in the encoder extracts multiple features and shrinks the image. Each feature can detect something else. For example, a feature can detect faces, trees, hair, and more. The decoder then takes each feature and extracts the colors that make it as accurate to the target as possible.
In my case, if the encoder created a feature that detects trees, the decoder will apply color for those trees to look as close to the target as possible. In Frozen, all the trees are white because of the snow, so it will color the trees as white. For Moana, trees will be green.
The output layer is a 2x128x128 matrix that represents the A and B channels. Merging the L channel (the input layer) and A and B channels will give us the final image.
The results are not exactly what I expected, so I decided to try something else.
Generative adversarial network
A typical network predicts the item’s label based on its features. A generative adversarial network (GAN) does the opposite. It generates an item out of a label.
GAN has two main parts:
- Generating new items
- Verifying that the items are, in fact, related to the label (also known as discriminator)
It helps to think about this like police fighting against money counterfeiting. Stay with me here. The police are trying to get better at recognizing fake money (discriminator) while the counterfeiters are trying to get better at fooling the police into thinking their money is real (generator). Each party is trying to be better at what it does and, after multiple iterations, we will have good counterfeiters (and policemen).
The discriminator is a network trained to predict whether an image is real or not, just like the police in the previous example. It starts by identifying real money out of a pile of something that looks like money drawn by a 5-year-old, and it does a great job (as we all probably would).
Unlike the discriminator network where the input is an image and the output is whether the image is real or fake, the generator network starts from a certain noise (random array, for example) and transforms this noise into a fake item. The noise is not going to be generated differently, the only thing that improves is the transformation method.
Both sides have different ways to improve.
The money counterfeiters have an “inside man” in the police, who tells them how good (or bad) the last batch of fake money was. They then change their transformation method based on this intel. Once in a while, the police get their hands on some fake money and they use it to improve the discriminator network.
After a few rounds of this cycle, both of the networks get pretty good at what they do. We can use the improved generative network to create some good fakes. Or, at least we hope so.
Typically, GAN is used to generate new things (creating new faces, for example) by sending random values to the generator. What I was aiming to achieve was to use the GAN mechanism to have a good coloring model.
Although my parents taught me that there’s no such thing as bad coloring, the discriminator will be the judge on how good the coloring is (if it is inside the lines, for example). I used the same model I used in the encoder-decoder example as a generator and created a new discriminator model. The results are still far from what I was hoping for.
Let it paint
The results were generated by the encoder-decoder model training on images from both Frozen movies. Like a little kid, the model is getting better at coloring images with time. It starts with very little motor senses, choosing some colors and throwing them on the page.
Equivalent to a 3-year-old.
Like a new father, I watched how the model improved as much as I could and sent updates to friends (even if they didn’t ask for it). After a few more rounds, my little kid started to detect faces.
The model can detect faces now.
Blue Ironman with green background.
Her face is white but the background is quite cool.
More transformation
Having the results with a network trained on Frozen, I wanted to have another transfer. This time I will learn from the best artworks of all time.
I don’t have a coloring version of those famous artworks (how cool would that be?), so I’m going to color from a black and white version of the artwork to a color version. Hopefully, the model will color the coloring book version in that way too.
Scrolling through the images I notice a big problem. Many of those images have a black background. Since my prediction is the A and B layers of the LAB, the model won’t color anything in black (black is represented by L, the input to the model). Orange is a very dominant color in the dataset, so it is possible that instead of black the model will use orange.
Just as I thought, the model colored the image mostly orange. However, we can see that it did detect Moana.
Let’s try to change the predicted data by multiplying it, which will give us stronger colors.
It seems like that model actually figured out the interesting information to color, but the color difference is too little.
Pivot
The model isn’t good enough to color a coloring book of animated characters where the image doesn’t have any depth. So, let’s try to color images with more depth.
Menashe Kadishman
An image of me
Gustav Klimt
Moving forward
My results are fine, but there are a few changes that I’ll make moving forward in order to make them even better. First, converting 3D images to a coloring book format is a good start but it’s definitely not perfect. In the next experiment, I’ll use older movies with less depth and see how that impacts the results. Second, LAB works well for predicting only two layers. If I want the colors to be darker or lighter, this isn’t the way to go. I’ll have to adjust for that, as well.