Luis Caballero Diaz's profile

Advanced Convolution Methods

This project focuses on introducing advanced convolutional methods as spatial separable and depthwise separable convolutions.

NORMAL CONVOLUTION OPERATION
First of all, let's do an introduction for the normal convolution operation. In simple words, the convolutional operation uses a number of kernels (also called filters) to go accross to the input feature map and generate the output feature map. Therefore, there are two main parameters to define the convolution operation:

Size of the window patches to extract meaningful information from the inputs. These are typically 3 x 3 or 5 x 5, and it will correspond to the kernel width and height size.

Depth of the output feature map. The framework will create a number of kernels equal to the output features. The process to fill in the kernels with values is within the convolutional layer parametrization, and they are defined during neural network training, starting with a random initialization. The proper number of outputs depends on each case under analysis and on the layer position within the neural network.

Once the convolution parametrization is defined, the kernel sized accordingly to the window patch goes across the input feature map, as depicted below. 
The full loop of a kernel into the input feature map generates one output feature, as depicted below.
Therefore, multiple kernels needs to be created and applied across the input feature map to to generate the defined number of output features. Each kernel is parametrized differently according to the neural network training process to extract meaningful information from the input data. In below picture, there are 256 kernels to generate the output feature map with a depth equal to 256.
That is the normal convolution operation. For the particular case in the pictures, the convolution operation has passed from 12 x 12 x 3 input feature map to 8 x 8 x 256. Note the reduction from 12 x 12 to 8 x 8 is due to border effects of applying 5 x 5 kernel with no padding. 

Therefore, calculating the number of multiplications involved in this transformation is feasible. It applied 256 kernels of 5 x 5 x 3 across an output feature space of 8 x 8, meaning the multiplications involved correspond to 256 x 5 x 5 x 3 x 8 x 8 = 1228800. That number is high and so the computational workload in a convolution operation is a factor to consider, and a research topic to optimize it further. Next, some improved methods will be explained.
For further details for a convolutional neural network and the convolutional operation, please check below link.

SPATIAL SEPARABLE CONVOLUTION
The main idea behind spatial separable convolution is to split the width and height dimension of kernels, meaning having two 1D kernels instead of one 2D kernel. 

For example, next picture depicts a transformation of 2D 3 X 3 kernel into two 1D 1 x 3 and 3 x 1 kernels.
The application of the two 1D kernels is depicted below, requiring an intermediate image prior to reach the final output image.
Therefore, the number of multiplications would be the sum of the multiplications involved in the two transformations (intermediate and output image). For the particular example of below picture, the intermediate image would require applying 3 x 1 kernel across 3 x 5 intermediate image, leading to 3 x 1 x 3 x 5 = 45 multiplications. The output image instead would require 1 x 3 x 3 x 3 = 27, leading to a total of 72 multiplications. 

As comparison, the normal convolution would require applying a 3 x 3 kernel across 3 x 3 output image, 3 x 3 x 3 x 3 = 81 multiplications.
The number of multiplications has reduced, but the main downside for spatial separable convolution is that not all kernels can be split into two smaller kernels. That limitation leads not to have a spread use among deep learning applications. Instead, depthwise separable convolution, explained below, does not have this limitation.
DEPTHWISE SEPARABLE CONVOLUTION
This convolution performs a spatial convolution on each channel of the input feature independently (depthwise convolution), and then it mixes the output channels via a 1 x 1 convolution (pointwise convolution). Next, both steps are explained further.

PART 1 - DEPTHWISE CONVOLUTION
In this operation each channel in the input feature is convolutionally treated independently. Thus, kernels of depth equal to 1 are applied to each input channel independently. The output of this convolution will have the same depth as the original feature map. The depthwise convolution is depicted next.
PART 2 - POINTWISE CONVOLUTION
At this point, the feature map depth equals the number of channels to the input feature, so it needs a channel convolution to generate as many output features as required. That is done using a 1 x 1 kernel, also known as pointwise. Therefore, a 1 x 1 kernel with depth equal to the current feature map depth goes across to the intermediate feature map, generating an output feature. That process then is repeated a number of times equal to the required output features, as depicted below for a particular case of 256 output features.
The concept of depthwise separable convolution is equal to separate the learning of spatial features and the learning of channel features. That is captured because in the depthwise convolution only learning from spatial features is considered. And once the spatial features are convoluted, the pointwise convolution focus on channelwise learning.

That approach is correct if assuming that spatial locations in the input are highly correlated, but channels are fairly independent, which tends to be a good representation of the reality. And as it is a more representationally efficient way, it also tends to learn better representations using less data, resulting in better performing models with lower computational workload due to the reduced number of multiplications.

As final exercise let's compare the computational workload between normal convolution and depthwise separable convolution.

- In the transformation from 12 x 12 x 3 to 8 x 8 x 256, the normal convolution required 1228800 multiplications. 

- Instead, the multiplications in the depthwise separable convolution would equal the sum of both depthwise and pointwise convolution operations. As for the depthwise operation, it needs 3 kernels of 5 x 5 x 1 each going across intermediate maps of 8 x 8, leading to a total of 3 x 5 x 5 x 1 x 8 x 8 = 4800 multiplications. As for the pointwise operation, it needs 256 kernels of 1 x 1 x 3 each across 8 x 8 output map, leading to a total of 256 x 1 x 1 x 3 x 8 x 8 = 49152. Meaning all multiplications in the full depthwise separable convolution is 53952, which is a significant decrease compared to normal convolution.
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Advanced Convolution Methods
Published:

Advanced Convolution Methods

Published:

Creative Fields