Luis Caballero Diaz's profile

Deep Learning Neural Network Introduction

This project focuses on introducing neural networks used in deep learning, explaining the operating working principle and the training algorithm. 
NEURAL NETWORK INTRODUCTION
A neural network is a group of interconnected nodes (or neurons) communicating each other. The neurons are grouped in layers flowing information from the initial layer to the last layer. All neurons in a layer have a direct connection to the neurons of the next layer. Note there are some advanced networks with recurrent connections to provide memory capability when processing sequences. The main layers in a neural network are as follows:

Input Layer 
The purpose of the input layer is to take raw information. There is no computational workload, just passing the raw information for the dataset features to further layers.
Output Layer
It is the final layer of the network which delivers the predicted outcome of the network based on the internal learning capability.
Hidden Layers
Hidden layers are the group of non exposed internal layers in the network. They process the raw information to extract progressively more meaningful information. For that purpose, the hidden layer performs all kinds of computation on the features entered through the input layer and transfers the result to the output layer. 

An example of a neural network is depicted as follows.
Let's take a deeper look to a neuron architecture. The mathematical operation in each neuron is defined as follows.

Output = activation function(dot(W, Inputs) + b)

Inputs refers to all inputs connections to the particular neuron, W refers to the weight for each of the input connection and b represents a neuron bias. Dot refers to the scalar dot product function which corresponds to the cumulative sum of each input and weight product, as depicted below.
A graphical representation of a neuron operation is depicted below.
Both connection weights and bias are parameters in the neural network which must be optimized during training. For a particular case of one neuron, the parametrization might look manageable, but the scenario changes when training multiple hidden layers of hundreds of neurons each, leading to millions of parameters. Just as reference, next picture shows an example of a neural network with three hidden layers of just 25 neurons each and the network parametrization highly increases in complexity due to the excessive number of connections.
As mentioned, each input connection to the neuron is weighted. That enables controlling the corresponding impact of each input feature and combination of input features to the neural output. That is a very important characteristic of neural network since they perform a feature engineering work themselves when training. As comparison, that work is normally required to be manually done by the data scientist engineer prior to apply shallow learning algorithms, and whose outcome highly affects to model accuracy. Instead, neural networks make that work by themselves, but it is still recommendable to apply preprocessing steps to simplify and optimize the training process.

Once weights are known, the scalar dot product is performed to finally apply an output activation function. The activation function decides if a neuron should be activated or not, meaning it decides if the neuron input information is meaningful in the network prediction learning process. Therefore, the activation function transforms the summed weighted input from the neuron into an output value to be fed to the next hidden layer (or as final network output) depending on the learning power of the input neuron information.

The activation function concept is clear, but why is this transformation needed? Would it be possible to proceed just with the outcome of the scalar dot product? The reason is to add non-linearity behavior to the neural network. Thus, a non-linear activation function in our network is required. Otherwise, the network will be only able to learn linear behaviors based on the scalar dot product and the bias summation, which would limit the learning power of our model. 
NETWORK INFORMATION FLOW
In a neural network information is commonly flowing from input layer to output layer, which is known as feedforward propagation and the common direction when using the network to predict new data. However, the information can also flow the opposite direction, from output layer to input layer, which is known as backpropagation. Backpropagation is used to find the most optimal parametrization during network training.

Feedforward Propagation
The flow of information goes in the forward direction. The input is used to calculate some intermediate function in the hidden layer, which is then used to calculate the output. 
In the feedforward propagation, the activation function is a mathematical gate or filter between the input feeding to the current neuron and its output going to the next layer. 

Backpropagation
This is used for training the network. The process to train a network is to initialize the network parameters and calculate the network output to be compared with the true output for the input features. Knowing that deviation, LOSS FUNCTION can be calculated. The loss function is the approach to evaluate the network performance, and backpropagation is the technique to define the optimal network parametrization (neuron weights and biases) to minimize the loss function. To do that, the information flows backwards, from output layer to input layer, to understand which parameter is a major contributor in the loss function and update it accordingly. This process is performed by applying the partial derivatives for loss function neuron to neuron and layer to layer. This process is detailed later.

Loss function selection depends on the exercise under analysis. Some guidelines are summarized next.

- Use binary_crossentropy for Binary classification, Multiclass multilabel classification and Regression to values between 0 and 1

- Use categorical_crossentropy for Multiclass single label classification

- Use MSE for Regression to arbitrary values (also usable for Regression to values between 0 and 1)
ACTIVATION FUNCTION
As explained, the activation function plays an important role to introduce non-linearity into the network, but there are several activation function alternatives to use as depicted in the below diagram.
As mentioned earlier, activation function must be non-linear to introduce non-linearities to the network, enable deeper learning power and allow backpropagation when network is under training. In case of having a linear activation function, the derivative is constant and so there is no relation between the neuron inputs and the output, meaning backpropagation flows is not possible. By using non-linear activation functions, backpropagation is enabled and going back to understand which weights in the input neurons can provide a better prediction is feasible.

Most common non-linear activations functions are discussed below.

ReLU (Rectified Linear Unit) ACTIVATION FUNCTION​​​​​​​
The rectified linear activation function (ReLU) is a non-linear (or piecewise linear) function that outputs the input directly if it is positive and otherwise, it will output zero. 

ReLU might look linear, but it has a derivative function and allows backpropagation while simultaneously being computationally efficient. It acts as a linear function for positive values and as a non-linear activation function for negative values. This near linearity accelerates the convergence of gradient descent towards the global minimum of the loss function. It makes ReLU a proper solution as activation function for hidden layers.

The main limitation of ReLU is the problem of "dead neurons" (dying ReLU), which occurs when the neuron gets stuck in the negative side and constantly outputs zero. Because gradient of 0 is also 0, it is unlikely for the neuron to recover. This happens when the learning rate is too high or negative bias is quite large.
SIGMOID/LOGISTIC ACTIVATION FUNCTION
This function takes any real value as input and outputs values in the range of 0 to 1. 
The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0, as shown below.

It is commonly used for models in which a probability prediction is required as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range. The function is differentiable and provides a smooth gradient, which prevents abrupt jumps in the output values. 

The main downside is that the output is limited between 0 and 1, no matter the magnitude in the input. That might potentially lead to low derivatives and causing the vanishing gradient problem (explained in the bottom of this project). It makes this activation function not suitable for hidden layers.
TANH ACTIVATION FUNCTION
Tanh function is very similar to the sigmoid/logistic activation function, and even has the same S-shape with the difference in output range of -1 to 1. Below graph compares Tanh vs Sigmoid as reference. In Tanh, the larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to -1.0.

Tanh, in the same way as sigmoid, also suffers from vanishing gradient problems due to the limited output without mattering the input range. It makes this activation function not suitable for hidden layers.
SOFTMAX ACTIVATION FUNCTION
Softmax function is described as a combination of multiple sigmoids to calculate relative probabilities, returning the probability of each class. The output is an array of probabilities sized accordingly to the number of classes, being the sum of all probabilities equal to 1. Therefore, this make it a perfect option as an activation function for the last layer of the neural network in the case of multi-class classification. 
The activation function for each layer is an important decision when creating a network, so next some general common rules are summarized:

HIDDEN LAYERS ACTIVATION FUNCTION
- All hidden layers usually use the same activation function. Starting using ReLU activation function is a good approach, and then move to other activation functions if results are not optimal.

- ReLU activation function should only be used in the hidden layers, not in the output layer.

- Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training due to vanishing gradients.

OUTPUT LAYER ACTIVATION FUNCTION
- The output layer will normally use a different activation function compared to the hidden layers, and it will depend on the goal or type of prediction made by the model. Some guidelines are depicted below.

- Use Sigmoid for Binary classification, Multiclass multilabel classification and regression to values between 0 and 1.

- Use Softmax for Multiclass single label classification.

- Use None for Regression to arbitrary values.
BACKPROPAGATION AND STOCHASTIC GRADIENT DESCENT
As mentioned earlier, the training algorithm for a neural network uses backpropagation technique, meaning that from the network output is determined how much to modify the network parameters (neuron weights and biases) to minimize the loss function. 

The first thought to do that would be to consider minor changes in the parameters and check how the loss function behaves. If the change improved the loss function, keep that trend. Otherwise, apply the opposite trend, and repeat for each parameter. That approach is not a real option even for small neural networks since the parameter number quickly increases with the network architecture. Moreover, it would require lots of forward network passes to assess the minor change in the particular parameter. 

Thus, the widely accepted algorithm is STOCHASTIC GRADIENT DESCENT (SGD), which calculates the gradient for the loss function and update parameters accordingly. This algorithm is applied thanks to have differentiable operations in a neural network, which in turns, enable applying the derivative. For a simple function f(x), the derivative can be applied to determine the function slope. Knowing the function slope (derivative) and given a certain x value, for example x0, the effect in the function output value can be estimated when applying a step in x0. Thus, if the slope is positive in x0, assuming step > 0, f(x0 + step) will be higher than f(x0), and vice versa. Therefore, the decision can be taken to either add or subtract step accordingly to our purpose of either maximizing or minimizing the function. This explanation can be translated to a function of multidimensional inputs as a matrix. In that case of matrix operation, the derivative is called gradient. 

The gradient calculation for a neural network is not trivial since complexity exponentially increases with the number of parameters and layers. For example, defining a relationship between a parameter in the first layer and the effect in the network output will depend on all internal layers. The process to capture those internal relationships is to use the derivative chain rule to calculate partial derivatives from loss function regarding all network parameters. Next picture provides a diagram of the appliance of the derivative chain rule. 
Once the gradient is calculated, next step is to identify the parameters which make the gradient equals to 0 and select the combination leading to lowest value in the original function. That would be the parametrization leading to minimum loss function. However, that approach is plausible for a case with limited coefficients, but it is definitely not feasible for the number of coefficients required in a neural network. Instead, the common practice is to update the parameters in the opposite direction of the gradient at the current parametrization since the gradient aims to the direction where the value of the function increases most for a given particular parametrization and our purpose is to minimize the loss function. A batch size close to 32 tends to lead to good accuracy results if the dataset is not very large.

Next, the stochastic gradient descent algorithm is summarized step by step.

1. Select a batch from input data. Depending on the size of the batch, the algorithm is called differently. Thus, if a single sample is selected, the method is known as true SGD, with a small batch is known as mini batch SGD and with full input data is known as batch SGD. The efficient compromise between accuracy and time consumption is mini batch SGD. However, the batch size of the mini batch SGD needs to be defined. Based on current literature, it has been observed that there is a significant degradation in the ability of the model to generalize when using a larger batch size. The gradient descent can be assumed as making a linear approximation to the loss function, so it is more efficient to move applying small step sizes (meaning using small batch size).

2. Run the neural network for the batch to calculate the model predictions (forward pass)

3. Compute the loss function between the predictions and the true targets

4. Compute the gradient of the loss function with regard the current parametrization applying partial derivatives and the chain rule (backward pass)

5. Update the parameters in the opposite direction from the gradient to reduce loss function. The update will be applied according to the defined learning rate, which must be properly defined to avoid undesired situations as oscillating without converging or progressing too slow, both depicted below.

6. Repeat until loss function converges in a minimum value 
NEURAL NETWORK CHALLENGES
To conclude, let's explain two major issues faced when training a neural network.

VANISHING GRADIENTS
As explained, the model is trained by backpropagation of the partial derivatives of the loss function regarding each neuron weight. Therefore, there is a potential risk for deep networks to lose the information of the initial layers, specially under certain activation functions that limits the output for a small range without mattering the input values. For example, sigmoid activation function limits the output between 0 and 1, and a large change in the input of the sigmoid function will only cause a small change in the output, leading to small derivative too. This problem can be a major issue in deep networks with several hidden layers by causing the gradient to be too small for effective training. That is the reason for which is not recommended using activation functions as sigmoid or tanh in the hidden layers.

EXPLODING GRADIENTS
Training the model is a process which typically works best when the updates are small and controlled. If the neuron weight values become very large, the neural network might overflow without completing the training. This problem is known as exploding gradients, which is caused when significant error gradient is accumulated resulting in very large updates to neural network weights during training, leading to unstable performance.
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Deep Learning Neural Network Introduction
Published:

Deep Learning Neural Network Introduction

Published:

Creative Fields