Real-time Artwork Generation using Deep Learning
Adaptive Instance Normalisation(AdaIN) for Style Transfer between any arbitrary content-style image pair.
In this post we will be looking into the paper “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization”(AdaIN) by Huang et. al. We are looking at this paper because it had some key advantages over the other state-of-the-art methods at the time or release.
Most important of all, this method, once trained, can be used to transfer style between any arbitrary content-style image pair, even ones not seen during training. While the method proposed by Gatys et. al. can also transfer styles between any content-style image pair, it is drastically slow compared to this method as it iteratively optimises for the stylised image during inference. The AdaIN method is also flexible, it allows for control over the strength of the transferred style in the stylised image and also allows for extensions such as style interpolation and spatial controls.
Going forward, we will first familiarise ourselves with Instance normalisation, adaptive instance normalisation and then dive deep into the AdaIN paper’s working. Finally we will see some outputs and look at the code to implement Adaptive Instance Normalisation and to train the style transfer network.
Instance Normalisation
Batch Normalisation normalises the feature map across the entire batch of fixed size using the mean and variance of the entire batch. Instance normalisation(a.k.a. Contrast Normalisation) on the other hand normalises each channel for each sample.
From Eq. 1, we can clearly see that each channel for each sample is normalised separately. Another difference to batch norm is that instance norm is also applied during inference unlike batch norm.
Instance norm normalises the contrast of the input image making the stylised image independent of the input image contrast, thus improving the output quality[2]. The authors of [1] also state that instance normalisation acts as a form of style normalisation by matching the feature statistics(mean and variance). Here the, style to which the images are normalised is defined by the learnable affine parameters γ and β[1].
To learn more about instance norm and other normalisation techniques, checkout Aakash Bindal’s post here.
Conditional Instance Normalisation
Conditional Instance Normalisation was introduced in [4], where the authors proposed learning different γs and βs for each style image.
While this method works well it requires 2xCxS extra parameters (C: # channels, S: # styles), thus increasing the size of the network. Since Instance norm performs a form of style transfer[1], having a set of parameters for each style allows us to normalise images to each of these stlyes.
Adaptive Instance Normalisation
The authors extend the idea that affine parameters γ and β define the normalisation style in instance norm to propose Adaptive Instance Normalisation. In the proposal, Instance Normalisation is modified to adapt to different content-style image pairs.
AdaIN takes content features and style features x and y as input and simply matches the statistics(mean and variance) of content features to that of the style features y. There are no learnable parameters here and the transformation is obtained as
Where σ(⋅) is the standard deviation, σ²(⋅) the variance and μ(⋅), the mean, all computed along spatial dimension as in Instance Norm.
AdaIN Style Transfer Network
Architecture
The AdaIn StyleNet follows an Encoder-Decoder architecture(Fig. 2). The encoder f(⋅) is the first few pre-trained layers(up to relu4_1)of the VGG19 network. The encoder is fixed and not trained.
Adaptive Instance Normalisation is performed on the outputs of the encoder as
where f(c) and f(s) are the feature maps produced by the encoder for the content and style images respectively. The mixing coefficient α∈[0,1] controls the strength of the style in the stylised image. α is set to 1 during training.
The decoder g(⋅) is the reverse of the encoder with the pooling replaced by 2x nearest neighbour upscaling. The decoder is initialised with random weights and its weights are learned. The stylised image is obtained by passing the transformed feature map t into the generator.
Loss Function
The loss objective is a combination of style and content loss.
The content loss is the euclidean distance between the feature map of stylised image obtained from the VGG encoder and the AdaIN output t:
AdaIN output t is used as the content target as it helps the model converge faster. The style loss is obtained as:
Each ϕi is a layer of the VGG 19 network. Here relu1_1, relu2_1, relu3_1 and relu4_1 are used.
Output Samples
Here are some artwork generated by the model.
Code and Pre-trained Models
Code Snippet to implement Adaptive Instance Normalisation:
The code for training the AdaIN style transfer network can be found here:
The weights for pre-trained models(both pytorch and onnx formats) can be found here:
Conclusion
This post introduced Instance normalisation and how Adaptive instance normalisation can be used for style transfer between arbitrary content-style image pairs. By training a generator network and using Adaptive instance normalisation we are able to transfer style between any content-style image pair. We are also able to control the strength of the style on the generated image during runtime. Another advantage the method provides is the inference speed over the other models at the time. Finally, some outputs were shown and the code and pre-trained weights for inference made available for free, non-commercial use.
If you find any mistakes in the post, please leave a comment, I will fix them!
If you liked the post, please consider following the author, Aadhithya Sankar.
References
[1] Huang, Xun, and Serge Belongie. “Arbitrary style transfer in real-time with adaptive instance normalization.” Proceedings of the IEEE International Conference on Computer Vision. 2017.
[2] D.Ulyanov, A.Vedaldi, and V.Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In CVPR, 2017.
[3]Wu, Yuxin, and Kaiming He. “Group normalization.” Proceedings of the European conference on computer vision (ECCV). 2018.
[4] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.
[5] Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017.
[6] https://github.com/aadhithya/AdaIN-pytorch
[7] https://en.wikipedia.org/wiki/File:Side_profile,_Brihadeeswara.jpg
[9]https://en.wikipedia.org/wiki/Taj_Mahal#/media/File:Taj_Mahal_in_India_-_Kristian_Bertel.jpg
Other Works by the Author
If you liked this post, here are some other posts you might enjoy: