- Marnix Hamelberg

# Healthy air quality predictions

In the __previous blog post__ we used machine learning to predict air quality. Here we went over the basics of a decision tree and the linear relationships between the input data. The nitrogen dioxide (NO2) predictions derived from the raw input variables, which included weather and road traffic data, can be seen in the graph below.

Initially the predictions look quite similar to the observed NO2 values, but there is definitely room for improvement. A common phrase in machine learning is **garbage in - garbage out**, nicely summarized by the XKCD comic below.

The comic speaks for itself, where we should prepare the input data properly to get any decent predictions and to understand the contribution of each variable. This is what we are going to do in this blog post. First, let us have a deeper look at the current prediction performance. The scatter plot below surrounded by linked histograms shows the current relationship between the air quality predictions and observations. Here the unprepared input data (garbage in) was used, resulting in suboptimal predictions (garbage out). This is visually clear by how far the related data points (gray dots) deviate from the black diagonal line. Ideally you want the data points as close as possible (low variance) and evenly distributed (no bias) around this line.

**Statistics**

Besides a visual interpretation of the relationship between the predictions and observations, we should also have a look at the hard numbers, provided at the bottom left of the scatter plot. These statistics are the correlation (**r**), coefficient of determination (**r²**), mean squared error (**mse**), root mean squared error (**rmse**), and residual prediction deviation (**rpd**). If you are already familiar with these statistics you can of course skip the following paragraph ;)

The **r** tells us something about how evenly the data points are distributed around the black diagonal line (i.e. identity line or 1:1 line). If all the data points would be evenly distributed around this line a high correlation is implied, where the **r** reaches 1. If there is little to no correlation, the **r** would be close to 0, and if they are negatively correlated, the **r** reaches -1. The slope and intersection of the **r **are visualized by a dashed red line. This red line follows the smallest distance between all the data points. The spread of the data points around this red line is quantified by the **r²**. Where an **r² **close to 1 implies that the data points are narrowly located around this line spreading further out as the **r² **decreases to 0. The exact values of this spread is quantified by the **mse**, which is the average of the squared distances between the data points and the red line. The square is taken to eliminate negative values (data points below the red line). To make this more readable, often the root is taken from the **mse**, simply indicated as the **rmse**. Ideally you want the **mse **and **rmse** to be as small as possible relative to the range of the tested data. To make this relative value more objective, the variances of the observed values are used to rationalize the variances between both datasets. This is the **rpd**, dividing the standard deviation of the observed values by the **rmse **of the observed and predicted data. The resulting ratio should ideally be higher than 2.

**Testing and improving predictions**

Now with the statistics out of the way, we can start testing and improving the predictions. Just like with stock prices, we are usually interested in mid to long term trends. Small fluctuations on an hourly basis are not that interesting and relevant, even disturbing our carefully planned stop losses. They also do not really say that much about the stock performance, as the overall trend is what matters. Just like with the input data for our prediction problem, volatility disturbs the overall trend of the data and makes it unnecessarily complicated. Besides, small fluctuations may also be caused by small errors in the sensors. Therefore, let us start with smoothing the input data over time. Since the data is temporal, we cannot just filter small outliers by a certain threshold over the whole datasets, as long term valid fluctuations could be removed. Instead, we look at a specific part in time. This part is called a window, which can be of a certain size, in this case 6 hours. The window moves over the timeline hour by hour replacing the hourly values by an average value of the neighboring 6 hours. This is visualized in the graphs below, where the dark lines are the smoothened raw data points by taking the moving mean.

After smoothening the data we get the following prediction results:

Statistics before temporal smoothing:

**r**=0.734, **r²**=0.539, **mse**=0.302, **rmse**=0.550, **rpd**=1.427

Statistics after temporal smoothing:
**r**=0.882, **r²**=0.778, **mse**=0.133, **rmse**=0.365, **rpd**=1.992

**Cyclical data**

The smoothened input data provides quite drastic improvements in the prediction performance, but there is still something fundamental we should not overlook. As mentioned by Brian de Vogel at Geodan, converting cyclical data is a necessary step for a decision tree not to get confused. Simply put, when looking for example at the wind direction, a direction of 356° is metrologically close to 5°, but numerically they are far apart. This can confuse a decision tree and the thresholds located in each statement. Where values around 0° cause completely different branching while in reality they form a resemblance. One solution for this is to convert the degrees of wind direction to numeric variables that represent the actual distances between each degree in the wind direction. This can be achieved by converting the arbitrary degrees to the trigonomic ratios of sine and cosine. The graph below shows the converted wind directions in the cartesian coordinate plane, where the variable wind gusts (i.e. speed) is indicated by color and dot size. Interesting to note, the South-West is the most wind intense, which makes sense in the Netherlands.

The same principle is applied to the temporal variable 'Hour of the day'. Again, where in actuality 23:00 and 01:00 are closely related while numerically far apart. The 24 hour day cycle is therefore converted to a sine and cosine variable resembling the cyclical nature of an analog clock. Now after converting the cyclical data properly, let us see how it impacts the prediction performance.

Statistics before cyclical data correction:

**r**=0.882, **r²**=0.778, **mse**=0.133, **rmse**=0.365, **rpd**=1.992

Statistics after cyclical data correction:

**r**=0.915, **r²**=0.838, **mse**=0.104, **rmse**=0.323, **rpd**=2.253

**Feature importance**

The cyclical correction worked! Especially the **rpd **saw a decent improvement of about 13% which is now higher than the elusive threshold of 2. Now it is all coming together, for one final optimization step where we will remove redundant input variables (i.e. features). In the previous blog post the feature importance was already discussed. This was based on the branching within the decision tree, which could provide a decent indication of the contribution of each feature. However, this method is limited to decision trees, with unforeseen biases still being present propagating down the branch statements. A model independent method is more appropriate for testing the importance of each feature. One such method is where you randomly shuffle the values of one single feature, rerun the prediction model, and see how much this shuffling contributed to the decrease in prediction performance. If this decrease in performance is neglectable, you may assume that this feature can be left out as it has a small contribution to the overall prediction performance. This method is called the permutation feature importance, where the results, together with the corresponding feature correlation, can be seen in the graph below.

Even though some features such as moisture and temperature are linearly correlated to NO2, their contribution to the overall model performance seems neglectable. In other words, we can throw them out. Still, you have to be careful with this as the features you remove may actually contribute to the prediction performance in other scenarios, say at other measurement stations or time ranges. Therefore, before any conclusive exclusion of any features, more tests at different locations and times ranges should be performed. For now for this specific dataset, let us remove the features below wind direction sin. After rerunning the prediction model, no significant performance hit was detected, indicating that these features are redundant.

**Final predictions**

Finally after all the preparation steps of the input data, we can do our final prediction. With some final tweaking, the performance of the air quality predictions at this specific location gives us the following results:

We started at a prediction performance of:

**r**=0.747, **r²**=0.557, **mse**=0.302, **rmse**=0.549, **rpd**=1.429

Working our way up to a prediction performance of:

**r**=0.935, **r²**=0.874, **mse**=0.086, **rmse**=0.294, **rpd**=2.474

Going back to the visual interpretation of the statistics as seen in the graph above, the red line representing the **r** is much closer to the black identity line. The spread of the data points around the red line is much narrower, resulting in a larger **r²**. The **mse **and **rmse **are closer to the zero relative to a range of about 1 to 4 log transformed NO2 concentrations. The **rpd **is well above 2, suggesting that the prediction model provides an objectively decent performance. Now this is not the whole story, as this is tested on a single location over a span of a week. In future blog posts a more scientific approach will batch analyze more locations to test the validity of the prediction model and methods. Before we get into the deep scientific stuff, let us look at something completely different and refreshing: the addition of **satellite data**!

**Satellite data**

No worries, we are reaching the end of this long blog post. Thanks for making it this far! In the next blog post we will go deeper into satellite and forecasting data that may replace the sparsely located ground sensors. The new satellite and forecasting data is spatially much denser, potentially providing predictions at locations devoid of ground sensors. As a sneak peak, the map below shows the sparsely located air quality ground sensors compared to a single satellite image of NO2 values captured by the Sentinel-5 Precursor mission.