Luis Caballero Diaz's profile

Image Classification using CNN Part 2

This project focuses on a binary image classification exercise. Therefore, the objective is to create a model able to differentiate between images of two classes. The dataset of this exercise consists of 2000 images for bikes and other 2000 images for cars. During this project, a convolutional neural network will be tuned to learn the basic structure of a car and a bike, and be able to make accurate predictions for new images for cars and bikes. The project is split in two parts, covering part 2 in this publication.

Part 1 is focused on creating the convolutional neural network architecture.

Part 2 is focused on tuning the network to maximize the learning and generalization power of the model.

As reference, the work is done with Python and keras framework and the used dataset is an open dataset from kaggle. The code and dataset can be found in the below links.

During the work done in the part 1 of the project, the below convolutional neural network architecture was defined. 
OVERFIT MODEL RUN
Therefore, next step is to run the model based on the above architecture and check the performance. The technique cross validation is applied to have more robust results than using hold out technique. The main difference is that hold out only focused on one split for validation check, and cross validation loops in different validation splits to have more realistic and significant results and not trusting a single split outcome. 

The train vs test split is 75-25%, leading to 3000 images in training set and 1000 images in testing set. Then, the training set is in turn split in 4 folds, having 750 images each, and it iterates assigning as validation data one fold each time. Therefore, the model will run 4 times training on 2250 images and using 750 different images as validation each time. Note the 1000 images reserved as testing set are not included in any process for the validation and model tuning, and it is only used as final evaluation for the definitive tuned model. That is a very important to ensure the model has high learning and generalization power. 

The summary of the cross validation applied is depicted as follows.
It is important to rescale the input data for a neural network prior to run the model. Otherwise, the model will not be able to capture any significant trend if each input has different scale. A rescaling between 0 and 1 is applied in all input images by simply dividing by 255 (max number per each RGB pixel) using the ImageDataGenerator from keras.
Moreover, am early stop callback is defined to avoid running the model longer than needed when it already converged.
The model runs using the previous ImageDataGenerator flowing the images from directory from both validation and training set. There is a loop to control the directory to flow the images in each cross validation split. The batch size is defined to 20 because there is a significant degradation in the ability of the model to generalize when using larger batch sizes. The number of steps per epoch is defined as the number of images in the directory divided by the batch size so that all images in the folder are assessed during each epoch. The code to run the model is as follows.
The results are depicted as follows, showing a clear overfit. The mean cross validation accuracy is around 0.93, but starting in epoch 10, training accuracy keeps increasing but validation accuracy freezes, meaning the model is learning particular details of the training set and losing generalization and learning power.
The overfit can be also detected in the loss function results. In training set, loss function converges to zero because model becomes tuned to make a perfect fit with the training set. However, validation loss function starts to increase when overfitting due to the loss of model generalization power.
BATCH NORMALIZATION MODEL RUN
One improvement to apply in the model is batch normalization technique. The input data was rescaled to 0-1 to ensure the neural network can learn from uniform data, but there is not any guarantee that the intermediate data among layers keep correctly scaled once it is postprocessed. Therefore, a good practice is to apply a batch normalization layer in the output of each layer to rescale the intermediate data. This way all layers will see the input data scaled and the maximum benefit from the neural network will be reached. 

The model including batch normalization is depicted as follows.
The model including batch normalization is run to check if there is any improvement, and results are depicted below.

The mean cross validation accuracy is similar as previous scenario but same overfit behavior has been detected. However, this time the model converges faster because it can learn faster from the input data thanks to have data properly scaled in all stages of the neural network. It is an evidence that batch normalization provides benefits to the neural network, but it does not fix the overfit issue.
As for loss function plot, overfit is also detected since validation stops improving and loss function starts to increase. However, this time the validation loss increase follows a less steep trend due to batch normalization appliance. 
DATA AUGMENTATION TECHNIQUE
One common technique to avoid overfitting is to apply data augmentation technique. That is specially important when the samples in the input dataset is limited, as the current case. The main purpose of the data augmentation is to modify the input dataset pictures to make them slightly different so that the model can learn new patterns from the modified picture. It might be understood as artificially increasing the number of samples of the input dataset. The model looks the same number of samples per each epoch, but each sample is slightly different each time, leading to further model generalization power and reducing overfitting likelihood. 

The data augmentation is applied in the ImageDataGenerator in keras, allowing the framework to randomly modify different picture features such as zoom, spatial shift, rotation or flip.
For a clearer understanding of data augmentation, the technique is applied 9 times to the next original picture.
It can be observed the output picture is different each time the function is applied, which helps the generalization power of the model. Thus, data augmentation technique is a very good strategy to artificially increase the number of samples in the input dataset, avoid overfitting and increase model learning power.
Next, as reference, same exercise is applied in a car image.
DROPOUT AND L2 REGULARIZATION
Other alternatives to avoid overfitting are DROPOUT and REGULARIZATION.

Dropout is a layer which makes zero a certain percentage of the output features for the previous layer. Thus, the model faces more difficulties to overfit since randomly some features are disabled. The core idea behind dropout is that introducing noise in the output values of a layer can avoid the network to memorize some non meaningful patterns, which would be learned if no noise is present. Therefore, by applying dropout, the network is challenged to really learn what it is meaningful for the input dataset.

Regularization, instead, focuses on limiting the effect of each parameter in the loss function, and so limiting the coefficients of each parameter. Overfitting happens when parameters are tuned to match the training set, which normally implies a very complex model with high parametrization and only a few features having most weight in the loss function. When regularizing, the weight in the loss function is forced to be more uniformly distributed among features, increasing the generalization power. There are two main regularization approaches:

- L1 regularization: The cost added is proportional to the absolute value of the weight coefficients. In practice, it tries to force some parameter to be exactly 0.

- L2 regularization: The cost added is proportional to the square of the value of the weight coefficients. In practice, it tries to reduce the amount of each parameter to the minimum possible. L2 regularization is normally preferred.
DATA AUGMENTATION, DROPOUT AND L2 REG MODEL RUN
Once explained the major techniques to avoid overfitting, all three techniques are applied in the current model to check if the performance improves. The final model is depicted as follows. It uses a L2 regularization parameter of 0.05 in each layer and it applies a 50% dropout layer just after convolution part and all features are flatten to 1D.
The results look much better by applying anti overfitting techniques. The model is not overfitting any more and the generalization power has been increased, leading to a mean cross validation accuracy close to 0.96 (it was only around 0.93 when overfitting). The simulation requires more epochs since the model is learning slower due to the defined constraints to avoid overfitting, but at the same time, it avoids learning very detailed patterns on the training set. 
Same improved performance is detected in the loss function assessment. Both training and validation sets are progressively reduced, leading to a higher model generalization and learning power. Therefore, based on these results, this model looks suitable for final evaluation against the testing data reserved in the initial part of the project.
FINAL MODEL EVALUATION
Once the model has demonstrated to have high generalization and learning power against the validation set, last step is to run the model against the testing set reserved in the initial part of the project. Results are as follows.
The model for split 1 leads to 0.935 accuracy, but the other three models reach accuracy 0.961, 0.969 and 0.968. Overall confirms the generalization power of the models.

At this point, there are four models available and only one must be selected. The easiest option would be to select the one with highest accuracy rating against the testing set and complete the project with a higher accuracy level of 0.969. However, a more sophisticated and promising option is to ensemble all four models applying an optimal weight to each model. Thus, all learning from one particular model might be leveraged to lead a more optimal combined model with higher accuracy results. 

ENSEMBLING MODELS
To ensemble models in an efficient way, the weight of each model in the ensemble must be accurately defined. Otherwise, the ensemble might not operate as expected if higher weights are assigned to the less efficient models. 

The below code loads the four model splits, and calculate the predictions of each model for both training and testing sets. Then, it uses the library scipy.optimize with the function minimize to find the optimal weights for each model to minimize the difference between the predicting training set class vs the real training set class. It is very important to tune the weights against the training set, because the testing set is only applied for final model evaluation but never for tuning purposes.

One remark is the sicpy.optimize.minimize works as local minimum finder, so a loop is created with random initialization to find all potential local minimum values and then select the minimum among them. 

Once the minimize loop is concluded and the global minimum is found, these weights are applied into the testing set for final model evaluation and check if the performance improves compared to the non ensembled scenario.
The above code calls the make_predictions function to reach both predictions and real target for both training and testing set. The code for that function is depicted below as reference using the computer vision library (import cv2).
The final outcome is depicted next. By applying an optimal weight to the model ensembling in the models of the four splits, the testing accuracy escalated to 0.982, which is significantly higher than the best accuracy for each model independently. The weights are also depicted below. For example, the algorithm assigns almost 40% to model of the split 3, which was the one with highest testing accuracy on its own, and assigns only 17% to model of the split 1, which was the one with lowest testing accuracy on its own. 

The main reason of the increased accuracy when ensembling all four models is that ensembling several models leads to a combined learning and help to learn different approaches and particularities, which would not be unlearned if using a single model. 
Even a model has lower accuracy results on its own might be significant to be included in the ensemble because it might provide a new fresh learning unknown by the models in the ensemble. This can be observed in the below example, which removed from the ensemble the model for the first split, which leads to only 0.935 accuracy compared to a value higher than 0.96 in the other three models. 

When ensembling only the model for the splits 2, 3 and 4, excluding the model for first split, the final testing accuracy is 0.981, slightly reduced compared to the case of ensembling all four split models (it was 0.982). Therefore, ensembling the model of the first split, even having a lower accuracy level on its own compared to the other three models, it has been demonstrated to be positive for the ensemble increasing the accuracy. 
As conclusion, next table summarizes all models results reached during this project, concluding that the ensemble model of the four splits with data augmentation, dropout and L2 regularization leads to best performance with an accuracy of 0.982.
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Image Classification using CNN Part 2
Published:

Image Classification using CNN Part 2

Published:

Creative Fields