Real-time Artwork Generation using Deep Learning

Adaptive Instance Normalisation(AdaIN) for Style Transfer between any arbitrary content-style image pair.

Aadhithya Sankar
Towards Data Science

--

Artwork generated by AI. Image Source: [6]

In this post we will be looking into the paper “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization”(AdaIN) by Huang et. al. We are looking at this paper because it had some key advantages over the other state-of-the-art methods at the time or release.

Most important of all, this method, once trained, can be used to transfer style between any arbitrary content-style image pair, even ones not seen during training. While the method proposed by Gatys et. al. can also transfer styles between any content-style image pair, it is drastically slow compared to this method as it iteratively optimises for the stylised image during inference. The AdaIN method is also flexible, it allows for control over the strength of the transferred style in the stylised image and also allows for extensions such as style interpolation and spatial controls.

Going forward, we will first familiarise ourselves with Instance normalisation, adaptive instance normalisation and then dive deep into the AdaIN paper’s working. Finally we will see some outputs and look at the code to implement Adaptive Instance Normalisation and to train the style transfer network.

Instance Normalisation

Fig. 1: Different Nmalisation Techniques. Image Source: [3].

Batch Normalisation normalises the feature map across the entire batch of fixed size using the mean and variance of the entire batch. Instance normalisation(a.k.a. Contrast Normalisation) on the other hand normalises each channel for each sample.

Eq. 1: Instance Normalisation

From Eq. 1, we can clearly see that each channel for each sample is normalised separately. Another difference to batch norm is that instance norm is also applied during inference unlike batch norm.

Instance norm normalises the contrast of the input image making the stylised image independent of the input image contrast, thus improving the output quality[2]. The authors of [1] also state that instance normalisation acts as a form of style normalisation by matching the feature statistics(mean and variance). Here the, style to which the images are normalised is defined by the learnable affine parameters γ and β[1].

To learn more about instance norm and other normalisation techniques, checkout Aakash Bindal’s post here.

Conditional Instance Normalisation

Conditional Instance Normalisation was introduced in [4], where the authors proposed learning different γs and βs for each style image.

Eq. 2: Conditional Instance Normalisation.

While this method works well it requires 2xCxS extra parameters (C: # channels, S: # styles), thus increasing the size of the network. Since Instance norm performs a form of style transfer[1], having a set of parameters for each style allows us to normalise images to each of these stlyes.

Adaptive Instance Normalisation

The authors extend the idea that affine parameters γ and β define the normalisation style in instance norm to propose Adaptive Instance Normalisation. In the proposal, Instance Normalisation is modified to adapt to different content-style image pairs.

AdaIN takes content features and style features x and y as input and simply matches the statistics(mean and variance) of content features to that of the style features y. There are no learnable parameters here and the transformation is obtained as

Eq. 3: Adaptive Instance Normalisation.

Where σ(⋅) is the standard deviation, σ²(⋅) the variance and μ(⋅), the mean, all computed along spatial dimension as in Instance Norm.

AdaIN Style Transfer Network

Architecture

Fig.2: Proposed Style Transfer Network Architecture/ Image Source: [1]

The AdaIn StyleNet follows an Encoder-Decoder architecture(Fig. 2). The encoder f(⋅) is the first few pre-trained layers(up to relu4_1)of the VGG19 network. The encoder is fixed and not trained.

Adaptive Instance Normalisation is performed on the outputs of the encoder as

Eq. 4: AdaIN mixing.

where f(c) and f(s) are the feature maps produced by the encoder for the content and style images respectively. The mixing coefficient α∈[0,1] controls the strength of the style in the stylised image. α is set to 1 during training.

The decoder g(⋅) is the reverse of the encoder with the pooling replaced by 2x nearest neighbour upscaling. The decoder is initialised with random weights and its weights are learned. The stylised image is obtained by passing the transformed feature map t into the generator.

Eq. 5: Stylised Image generation

Loss Function

The loss objective is a combination of style and content loss.

Eq. 6: Total Loss

The content loss is the euclidean distance between the feature map of stylised image obtained from the VGG encoder and the AdaIN output t:

Eq. 7: Content Loss

AdaIN output t is used as the content target as it helps the model converge faster. The style loss is obtained as:

Eq. 8: Style Loss

Each ϕi is a layer of the VGG 19 network. Here relu1_1, relu2_1, relu3_1 and relu4_1 are used.

Output Samples

Here are some artwork generated by the model.

Style Transfer in action. Image Sources: (a)[7], (b)[8], (c )[6]
Style Transfer in action. Image Sources: (a)[9], (b)[10], (c )[6]
Controlling the strength of stye on the Stylised image during inference using α. Image Source: [1]

Code and Pre-trained Models

Code Snippet to implement Adaptive Instance Normalisation:

The code for training the AdaIN style transfer network can be found here:

The weights for pre-trained models(both pytorch and onnx formats) can be found here:

Conclusion

This post introduced Instance normalisation and how Adaptive instance normalisation can be used for style transfer between arbitrary content-style image pairs. By training a generator network and using Adaptive instance normalisation we are able to transfer style between any content-style image pair. We are also able to control the strength of the style on the generated image during runtime. Another advantage the method provides is the inference speed over the other models at the time. Finally, some outputs were shown and the code and pre-trained weights for inference made available for free, non-commercial use.

If you find any mistakes in the post, please leave a comment, I will fix them!

If you liked the post, please consider following the author, Aadhithya Sankar.

References

[1] Huang, Xun, and Serge Belongie. “Arbitrary style transfer in real-time with adaptive instance normalization.” Proceedings of the IEEE International Conference on Computer Vision. 2017.

[2] D.Ulyanov, A.Vedaldi, and V.Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In CVPR, 2017.

[3]Wu, Yuxin, and Kaiming He. “Group normalization.” Proceedings of the European conference on computer vision (ECCV). 2018.

[4] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.

[5] Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017.

[6] https://github.com/aadhithya/AdaIN-pytorch

[7] https://en.wikipedia.org/wiki/File:Side_profile,_Brihadeeswara.jpg

[8]https://en.wikipedia.org/wiki/Olive_Trees_(Van_Gogh_series)#/media/File:Van_Gogh_The_Olive_Trees..jpg

[9]https://en.wikipedia.org/wiki/Taj_Mahal#/media/File:Taj_Mahal_in_India_-_Kristian_Bertel.jpg

[10]https://en.wikipedia.org/wiki/The_Starry_Night#/media/File:Van_Gogh_-_Starry_Night_-_Google_Art_Project.jpg

--

--

MSc. Informatics @ TU Munich. Specialised in Deep Learning for CV and Medical imaging.