Thoughts and Theory

A Primer on Atrous(Dilated) and Depth-wise Separable Convolutions

What are atrous/dilated and depth-wise separable convolutions? How are the different from standard convolutions? What are their uses?

Published in

Towards Data Science

5 min readSep 15, 2021

With properties such as weight sharing and translation invariance, Convolutional layers and CNNs have become ubiquitous in Computer Vision and Image Processing tasks using deep learning methods. With that in mind, this article aims at discussing some of the developments we’ve seen in convolutional networks. Specifically we focus on two developments: Atrous(Dilated) convolutions and Depth-wise Spearable convolutions. We will see how these two types of Convolutions work, how they are different from normal convolutions and why we may want to use them.

Convolutional Layer

Before we get into the topic, let’s quickly remind ourselves how convolutional layer works. At its core, convolutional filters are simply feature extractors. What were hand crafted feature filters before are now learned through the “magic” of back-propagation. We have a kernel(weights of the conv layer) that is slid over the input feature map and at each location, element-wise multiplication followed by a summation of theproducts is performed to obtain a scalar value. The same operation is performed at each location. Fig. 1 shows this in action.

Fig. 1: A 3x3 Convolution filter in action[1].

The convolutional filter detects a particular feature by sliding over the input feature map, i.e, it looks for that feature at each location. This intuitively explains the translation invariance property of Convolutions.

Atrous(Dilated) Convolution

To understand how atrous convolution differs from the standard convolution, we firs need to know what receptive field is. Receptive Field is defined as the size of the region of the input feature map that produces each output element. In the case of Fig.1, the receptive field is 3x3 as each element in the output feature map sees(uses) 3x3 input elements.

Deep CNNs use a combination of Convolutions and max-pooling. This has the disadvantage that, at each step, the spatial resolution of the feature map is halved. Implanting the resultant feature map onto the original image results in sparse feature extraction. This effect can be seen in Fig. 2. The conv. filter downsamples the input image by a factor of two. Upsampling and imposing the feature map on the image shows that the responses correspond to only 1/4th of the image locations(Sparse feature extraction).

Fig. 2: Sparse Feature Extraction in DCNN[2].

Atrous(Dilated) convolution fixes this problem and allows for dense feature extraction. This is achieved a new parameter called rate(r). Put simply, atrous convolution is akin to the standard convolution except that the weights of an atrous convolution kernel are spaced r locations apart, i.e., the kernel of dilated convolution layers are sparse.

Fig. 4: 3x3 Atrous(dilated) Convolution in action[1].

Fig. 3(a) shows a standard kernel and Fig. 3(b) a Dilated 3x3 kernel with a rate r = 2. By controlling the rate parameter, we can arbitrarily control the receptive fields of the conv. layer. This allows the conv. filter to look at larger areas of the input(receptive field) without a decrease in the spatial resolution or increase in the kernel size. Fig. 4 shows a dilated convolutional filter in action.

Fig. 5. Dense Feature extraction using dilated convolutions[2].

Compared to standard convolution used in Fig. 2, it can be seen in Fig. 5 that dense features are extracted by using a dilated kernel with rate r=2. Dilated convolutions can be trivially implemented by just setting the dilation parameter to the required dilation rate.

Dilated Convolution: pytorch implementation

Depth-wise Separable Convolution

Depth-wise separable convolution was introduced in Xception net[3]. Fig.6 shows a standard convolution operation where the convolution acts on all channels. For the configuration shown in Fig. 6, we have 256 5x5x3 kernels.

Fig. 6: Standard 5x5 convolution applied to a 12x12x3 input.

Fig. 7(a) shows depth-wise convolution where the filters are applied to each channel. This is what differentiates a Depth-wise separable convolution from a standard convolution. The output of the depth-wise convolution has the same channels as the input. For the configuration shown in Fig. 7(a), we have 3 5x5x1 kernels, one for each channel. Inter-channel mixing is achieved by convolving the output of depth-wise convolution with a 1x1 kernel of required number of output channels (Fig. 7(b)).

Fig. 7: 5x5 Depth-wise Separable convolution followed by 1x1 conv.

Why choose Depth-wise Separable Convolution?

To answer this we take a look at the number of multiplications required to perform a standard convolution and a depth-wise separable convolution.

Standard Convolution
For the configuration specified in Fig. 6, we have 256 kernels of size 5x5x3. The total multiplications required to compute the convolution:
256*5*5*3*(8*8 locations) = 1228800

Depth-wise Separable Convolution
For the configuration specified in Fig. 7, we have 2 convolution operations:
1) 3 kernels of size 5x5x1. Here, the number of multiplications required is: 5*5*3*(8*8 locations) = 4800
2) 256 kernels of size 1x1x3 for the 1x1 convolution. The number of multiplications required: 256*1*1*3*(8*8 locations) = 49152
Total multiplications required for Depth-wise separable convolutions: 4800 + 49512 = 54312.

We can quite clearly see that the depth-wise convolutions require much less computations than the standard convolution.

In pytorch, depth-wise separable convolutions can be implemented by setting the group parameter to the number of input channels.

Note: The groups parameter in pytorch has to be a multiple of the in_channels parameter. This is because in pytorch, the depth-wise convolution is applied by dividing the input features into groups=g groups. More info here.

Conclusion

This post delved into two popular types of convolution: atrous(dilated) convolution and depth-wise separable convolutions. We saw what they were, how they were different from the standard convolution operation and also saw the advantages they posed over the standard convolution operation. Finally we also saw how the atrous(dilated) and depth-wise separable convolution can be implemented using pyTorch.

References

[1] Convolution arithmetic (https://github.com/vdumoulin/conv_arithmetic)

[2] Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834–848.

[3] Chollet, François. “Xception: Deep learning with depthwise separable convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.