Luis Caballero Diaz's profile

Weather Forecast using RNN and CNN

This project focuses on a timeseries forecast exercise. Therefore, the objective is to create a model able to "predict the future" based on past performance. Note this exercise is only possible when the past performance has a clear correlation with the future data. Even having some correlation degree between past and future data, the results are potentially affected by external perturbations and uncontrolled irregularities, and so some degree of mismatch is expected between the prediction and the real future. 

Let's illustrate with examples the above statement. On a hand, stock movements or lottery predictions do not have a correlation between past and future performance, and so using machine learning for that purpose will not lead to accurate results. However, on the other hand, weather forecast might have certain degree of correlation between past and future performance because a storm does not abruptly happen, but there is a natural phenomena behind that which might be statistically characterized using certain features. Therefore, weather forecast will be the exercise focused on this project.

The dataset of this exercise consists of weather information for several weather stations in Ireland from 1990 to 2020. Each data point corresponds to one hour, and so it leads to more than 266k data points in total. Among the hourly temperature of the weather station, each data point includes 12 environmental features such as rain, wind direction, wind speed, sun or dew point. The dataset provides information about several Irish weather stations, but this project will focus on the station CASEMENT located in Dublin.

Therefore, the input for the exercise will be a timeseries, which will include more relevant information in the most recent timesteps. Therefore, chronologically order is important for the present exercise and so, a recurrent neural network needs to be considered. Additionally, for comparison purposes, the work will be repeated with a 1D convolutional neural network which does not have memory capability, unlike recurrent networks.

As reference, the work is done with Python and keras framework and the used dataset is an open dataset from kaggle. The code and dataset can be found in the below links.

BASELINE DEFINITION
The aim of the project is to predict the temperature for the next day, meaning the temperature in 24h. Therefore, first step is to define the metric and the baseline scenario to apply future comparisons with the developed models.

The selected metric is MAE (mean absolute error) since the exercise for the project is a regression. And the baseline scenario is a dummy model which assumes the temperature in 24h to be exactly the same as the current temperature. The final model from this project should be able to predict better than baseline to demonstrate learning power capability.

Below picture depicts the dummy model defined as baseline, which leads to a MAE score in the testing set of 2.37C. Therefore, that is the target any model developed in the project should improve to be accepted. Note the target is challenging, an error of a couple of degrees is a good performance for so simple model since the dummy model is benefiting from the 24h cyclic performance of the temperature, as depicted in the plot (remember each sample represents one hour). 
DATA SCRUBBING
Once the aim of the project and the target to improve are clear, next step is to get familiar with the input dataset. For that purpose, all features are plotted below against the temperature.

There are some outliers values depicted in red above. The identification of these points as outliers is because they are clearly separated from the rest of points and all of them are tied to the same value, which is a sign of a saturated wrong measure. Note other features also have some isolated points out of the feature trend, but they look to be related to exceptional conditions instead of wrong measures, so they are not modified. 
Instead of removing the identified outliers, which might affect to the timeseries postprocessing, the selected option is to use the scikit learn functions SIMPLE IMPUTER and COLUMN TRANSFORMER to assign them the mean value of the corresponding column. The simple imputer function allows replacing certain items for some particular value based on the existing elements. In this particular case, the values to replace are transformed to NAN values to simplify the use of simple imputer, and then these values will be replaced by the mean value of the corresponding column. Column transformer function allows executing the simple imputer function in the corresponding columns. 

Additionally, there are some empty values in the dataset, and the same approach will be applied to them. After executing the below code with the function REPLACE_VALUES, no empty, NAN or outliers values are present in the dataset anymore.
Below it is depicted the new plot for each feature against temperature after scrubbing the dataset. Outliers in WETB and CLHT have disappeared and instead are assigned to the mean value of the feature. It can be detected as an horizontal line in the middle of the feature plot. 
The histogram also looks good with most features following normal distribution without outliers. The plot showing the most unequal distribution is RAIN since it looks to have a very low mean with almost all points close to zero, but there is also a highly reduced group of points close to 20 corresponding to heavy rains. However, that profile for this feature does not look unrealistic.
Finally, let's assess the correlation matrix for the input dataset. As depicted below, there are features as WETB, DEWPT and VAPPR with a strong positive correlation with temperature. Other variables as RHUM shows some negative correlation, and the rest of variables do not show any significant correlation. 
​​​​​​​DATA GENERATOR
Both recurrent and 1D convolutional layers analyzed in the current project requires a 3D tensor with shape [batch, timesteps, feature] as input. Batch refers to the amount of samples grouped together for model fitting purposes. Timesteps define the length of the input sequence. And feature corresponds to the different input variables to the model. 

A recommendable approach to input data into the neural network when operating with sequences is through a data generator yielding consecutive 3D tensors with the corresponding shape while going across the input dataset. Next picture depicts an example how to create a data generator from an input dataset with 17 timesteps, 5 features, batch size = 3 and with the purpose to predict the feature values in 6 future timesteps (called lookforward) based on the data of the 6 previous timesteps (called lookback). 
Current step is depicted in blue and it starts in the step number 5. Note it is not possible to start earlier because it is required 6 previous timesteps as lookback. Therefore, the data from timesteps 0, 1, 2, 3, 4 and 5 itself is used to predict the future data at 6 timesteps forward, meaning at timestep = 11. That consists of a single sample for the neural network, but it is also required to provide the target information for each sample to tune accordingly the network parameters. Timesteps for the sample are depicted in green and target information for the particular sample is depicted in orange. However, it is not common to load sample by sample into neural network, and a mini batch approach is commonly applied loading groups of samples. Thus, the process is repeated three times advancing one timestep in each occurrence to complete the required batch of 3 in the present exercise. The combination of the three samples and targets for timesteps 5, 6 and 7 consists of a batch to load the neural network, with a shape of 3 samples x 6 timesteps x 5 features.

The process is repeated until reaching the last timestep in the input dataset. Last timestep corresponds to the timestep which the lookforward step is equal to the last index of the input dataset. In the example, last timestep is 10, as depicted below, using lookback data for timesteps 5, 6, 7, 8, 9 and 10 itself to predict the features values in 6 timesteps forward, meaning timestep = 16. Therefore, a second batch with a shape of 3 samples x 6 timesteps x 5 features would be generated with the timesteps 8, 9 and 10.
The data generator is coded in the below function GENERATOR as reference, using min and max index as explained in this section and captured below.

min index = first index + lookback - 1 (0 + 6 - 1 = 5 for the example)
max index = last index - lookforward (16 - 6 = 10 for the example)
The current dataset has around 266k samples and a hold out validation will be performed. It means a validation set is required for tuning purposes prior to test the final model against the testing set. Therefore, the input dataset will be split among training, validation and testing according to next:

TRAINING DATA --> samples from 0 to 180k
VALIDATION DATA --> samples from180k to 225k
TESTING DATA --> samples from 225k to end of dataset

Each set needs to define its corresponding data generator based on the above GENERATOR function. The full dataset is passed in each call of the function, but changing the min and max index accordingly to the defined training, validation and testing sets split allows focusing on different samples each time. 
RECURRENT NEURAL NETWORK DEFINITION
The recurrent neural network consists of a single GRU layer followed by a dropout layer and a standard layer. The dropout layer will affect the GRU outputs and it can be parametrized accordingly to define the level of dropout, so the layer would be bypassed if defining null dropout. Note dropout in GRU inputs might be applied using the GRU layer settings if needed. The standard layer has a single output and no activation function since it is a regression exercise. The summary of the network is as follows.
The purpose of the exercise is to predict the temperature 24h ahead, so lookforward must be 24 since data points are hourly sampled. The lookback is also defined to 24h to limit the computational expensiveness. With those considerations, a GRU output layer sweep is performed with no dropout and using a low enough learning rate to enable assessing the network capacity. The results are as follows. 

GRU layer of 16 outputs has very low capacity since system is not able to overfit. Instead, layer of 64 and 128 have strong capacity since they widely overfit. The trade off solution is a GRU layer of 32 outputs.
Next step is to tune learning rate accordingly. For that purpose, a sweep with different learning rates fixing outputs to 32, lookback to 24 and no dropout is performed and depicted below. The aim is to find a learning rate to allow model learning in optimal and stable steps. The results show that for learning rate 0.00001 the model is not able to overfit due to the limited learning power and for learning rate higher than 0.0005 the learning power might exceed the optimal learning steps. The trade off for learning rate is 0.0001.
Once the neural network with enough capacity is defined, next step is to apply dropout to avoid overfitting and lead to better results. Both standard dropout and recurrent dropout can be applied in recurrent networks. Standard dropout applies to inputs/outputs to the layer and recurrent dropout applies to the inner recurrent activations of the layer. In practice, recurrent dropout has normally higher impact than applying a standard dropout mask that varies randomly from timestep to timestep, which might disrupt the error signal and be harmful to the learning process.

Therefore, a recurrent dropout sweep is depicted below. No recurrent dropout leads a strong overfit, and overfit is mitigated when applying certain level of recurrent dropout. Recurrent dropout = 0.75 makes hard the learning process and it would require larger number of epochs to converge. The trade off for recurrent dropout is around 0.25 and 0.5, but finally 0.5 is selected since it might lead to better results when increasing the epochs number.
Next picture depicts the standard dropout sweep. The overfit trend reduces when applying standard dropout, but in exchange of making worse the validation score. Thus, for this exercise standard dropout will be kept to 0 and overfitting mitigation will be managed by recurrent dropout.
Therefore, definitive recurrent neural network is defined with the following parameters:
LOOKFORWARD = 24
LOOKBACK = 24
DROPOUT = 0
RECURRENT DROPOUT = 0.5
OUTPUTS = 32
LEARNING RATE = 0.0001

Let's apply this model against the testing set for final model validation. Results are depicted below. The testing set MAE score is 2.07C, which is superior to the baseline of 2.37C. Therefore, it confirms the learning power of the recurrent neural network.
CONVOLUTIONAL NEURAL NETWORK
As last exercise, the same previous work is repeated using a 1D convolutional neural network this time.

Note an introduction to convolution neural network can be found in next link: https://www.behance.net/gallery/163223919/Convolutional-Neural-Network-Introduction

The convolutional network is defined with two 1D convolution layers including a max pooling and a global max pooling layers. Convolutional neural network summary is depicted below.
The network parametrization remains the same as recurrent network, using lookback and lookforward = 24 and learning rate = 0.0001. With the purpose to define the required dropout and avoid overfitting, a dropout sweep is performed and depicted below. The conclusion is that both 0.25 and 0.5 dropout values would be a suitable choice. For the exercise, dropout 0.5 is selected. 

Note the execution time for the 1D convolutional network is around 10 seconds per epoch, when the recurrent network needed 50 seconds per epoch, so 1D convolutional is much faster.
Therefore, once the 1D convolution network is defined, it must be applied against the testing set for final model validation. Results are depicted below. The testing set MAE score is 2.22C, which is superior to the baseline of 2.37C but inferior to the 2.07C for the recurrent network. Therefore, convolution network is able to identify some significant patterns in the timeseries to generate some learning power and make more accurate predictions than baseline. However, the results are inferior compared to recurrent network since convolutional networks, unlike recurrent networks, are not able to identify the chronological order of the timeseries patterns. It means that the identified patterns will be treated in the same way independently on the temporal location, which does not consider the fact that the immediate past performance becomes more significant when operating with timeseries.
CONCLUSIONS
- Both 1D convolutional and recurrent networks have been demonstrated to have learning power when applied to data sequences with some degree of correlation between past and future performance.

- Recurrent networks have memory capability thanks to consider past performance and past states when processing current timestep, which enables chronological assessment for the sequence.

- 1D convolutional networks are much faster than recurrent networks, but they process each timestep independently reaching isolated patterns with no past performance consideration.

- 1D convolutional networks are a great and faster alternative for recurrent networks for sequence postprocessing not affected by chronological order, such as natural language applications.

- Recurrent networks, in spite of the computational expensiveness, are the best option for sequence postprocessing affected by chronological order, such as timeseries, given its past performance memory capability.
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Weather Forecast using RNN and CNN
Published:

Weather Forecast using RNN and CNN

Published:

Creative Fields