Luis Caballero Diaz's profile

Image Classification using CNN Part 1

This project focuses on a binary image classification exercise. Therefore, the objective is to create a model able to differentiate between images of two classes. The dataset of this exercise consists of 2000 images for bikes and other 2000 images for cars. During this project, a convolutional neural network will be tuned to learn the basic structure of a car and a bike, and be able to make accurate predictions for new images for cars and bikes. The project is split in two parts, covering part 1 in this publication.

Part 1 is focused on creating the convolutional neural network architecture.

Part 2 is focused on tuning the network to maximize the learning and generalization power of the model.

As reference, the work is done with Python and keras framework and the used dataset is an open dataset from kaggle. The code and dataset can be found in the below links.

Firstly, some images included in the dataset are plotted to get familiar with the dataset. Below there are 9 pictures of each class. Each picture is differently sized, so a preprocessing step to unify them in a common size matching the input size in the convolutional neural network is required. All images have been resized by applying bilinear interpolation to 150 x 150 pixels.
CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE DEFINITION
The convolutional neural network used in this exercise is based on depthwise separable convolution layers instead of conventional convolutional layers. That is because the depthwise separable convolution layers are more efficient in terms of memory and computational usage, and they generally lead to better results than standard convolutional layers by applying a more realistic representation. They consider the spatial locations are highly correlated, but instead channels are fairly independent. If needed, further explanation about depthwise separable convolution layers can be found in below link.


The network applies a basic structure of a depthwise separable convolution layer followed by a max pooling operation. The convolution layers use kernels (or filters) 3 x 3 with stride = 1 and the max pooling operations use filters 2 x 2 with stride = 2. The network consists of two parts:

1. Convolutional part to extract meaningful features from the input image. Several consecutive layers will be applied to avoid just learning small local patterns of the input image and extract progressively high level information for the target class. 

2. Model output part consisting on flattening all information to 1D to apply an output layer using a sigmoid activation function. The sigmoid function is the suitable choice for the present exercise of binary class classification. 

CONVOLUTIONAL PART
Defining the convolutional part implies to select the number of layers and the number of output features per layer. Some guidelines are below.

- The spatial size of the input image must be reduced to a manageable level of parameters prior to apply the flatten layer. It requires using several pooling layers, which are normally applied after a convolutional layer. For example, in case of not applying pooling and output 132 features keeping the original input size 150 x 150 the same would require almost 3 millions of parameters in the flatten layer, which would lead to computational constraints.

- The output features needs to focus on big areas of the input image. It also requires applying several pooling layers. Otherwise, the spatial hierarchy is lost since the filters of the last layer would focus only on a small local area of the input image, not allowing the network learn general high level patterns of the target class.

- The first layers focus on small local area of the input image and so, a reduced number of output features is normally applied in these initial layers.

- The last layers focus on high level patterns of the target class and so, a progressively higher number of output features is applied when the network becomes deeper. 

In the current scenario of input spatial size of 150 x 150, the number of pooling layers should be around 4 and 5 to have a reduced output spatial size lower than 10 x 10. A good exercise to check if 4 layers is a suitable choice would be to assess the output of each layer to confirm the network is deep enough to lead to high level patterns in the last layer, and not just small local pattern of the input image. That can be done using below code, which creates a new keras functional API model adding the output of each layer available. 
The defined model has 32 output features in the first layer and that value is then doubled per each deeper layer. Therefore, second layer output features are 64 and third layer output features are 128. Fourth layer would have 256 output features following that pattern, but it is kept to 128 to reduce the computational burden and it will only be extended to 256 if really needed. After assessing the fourth layer output, it will be decided if a fifth layer is needed. The code to create the model is as follows.
The below exercise is to apply the above code in the next picture and plot the output of each layer to understand what each convolutional layer is really seeing.
Next is depicted the output features of the first layer. The first layer acts as a collection of various edge detectors. At that stage, the activations retain almost all of the information present in the initial image.
Next is depicted the output of the second layer. As going deeper in the network, the activations become increasingly abstract and less visually interpretable since they begin to encode higher levels pattern, but still visually recognizable in this stage. 
Next picture depicts the output of the third layer. The layer encodes even more abstract data. Note higher presentations carry increasingly less information about the visual contents of the input image, and increasingly more information related to the class of the image, which is key for a good model generalization.
Next picture depicts the output of the fourth layer. The outputs are not longer interpretable, which means they gather class information and so the depth of the convolutional network looks already suitable with four layers. A fifth layer will increase the computational burden with a potential limited benefit.

As general observation, the sparsity of the activations increases with the depth of the layer. For example, in the first layer, all filters are activated by the input image. Instead, in the following layers more filters are progressively blank. This means the pattern encoded by that particular filter is not found in the input image.
As complementary information, the exercise is repeated but with the below car image, reaching the same conclusions as with the bike example.
Based on the results of the previous exercise, it is evidenced an important characteristic in deep neural networks: the features extracted by a layer become increasingly abstract with the depth of the layer. It means that the output of deeper layers has less information about the specific input image, but more information about the target class (in this case either bike or car). 

Therefore, a deep neural network acts as an information distillation pipeline, postprocessing raw input image data so that irrelevant information is filtered out and only focused on useful information, which is magnified and refined.

The summary of the convolutional network is as follows, consisting of four convolutional layers of 32, 64, 128 and 128 layers respectively, and applying a max pooling layer after each convolutional step.
The input size is 150 x 150, transformed to 148 x 148 with 32 features in the first convolutional layer due to border effects of the 3 x 3 kernel size. Then, it is transformed to 74 x 74 with the max pooling layer. 

Second convolutional layer outputs 72 x 72 with 64 features, transformed to 36 x 36 after max pooling. 

Third convolutional layer outputs 34 x 34 with 128 features, transformed to 17 x 17 after max pooling. 

Fourth convolutional layer outputs 15 x 15 with 128 features, transformed to 7 x 7 after max pooling. 
MODEL OUTPUT PART
The convolutional part can be considered completed when the input image spatial size has been reduced to manageable values, and some meaningful output features are generated based on the high level patterns of the target class (and not in local patterns of the input image). In this point, standard layers needs to be applied, but firstly the information needs to be flatten to 1D. 

The output of the convolution part is 7 x 7 x 128, leading to 6272 parameters. The output layer, as explained, is a layer with a single node with a sigmoid activation function taking into consideration the current exercise is a binary class classification. It might be a good option to use a standard layer between the 6272 flatten nodes and the single node of the output layer. However, for example by using an intermediate hidden layer of 512 nodes would imply more than 3 millions parameters. As reference, the input dataset only has 4000 images, so tuning 3 millions parameters with 4000 images might not be an optimal scenario, a part from the extra computational workload associated in that tuning. For simplicity, the model of this project will not use that hidden layer and it will connect directly the flatten layer to the model output.

Therefore, the convolutional neural network architecture to use in this exercise is as follows. As observed, it requires 35k parameters, which is a more manageable value than the previous 3 millions. 
After creating the convolutional neural network, next step is to tune it and maximize the learning and generalization power of the network. That work will be captured in a different publication (see introduction).
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Image Classification using CNN Part 1
Published:

Image Classification using CNN Part 1

Published:

Creative Fields