- Marnix Hamelberg

# Trees for air quality

In the science of machine learning, as the name suggests, a machine will figure out how information relates to each other. The machine will learn that if something is this, then something else should be that. Take for example the total size and the amount of rooms of apartments in Amsterdam. Each combination of the size and rooms have a (realistically depicted) rental price attached to it as seen in the table below.

**Apartment specifications**

It seems logical for us that the increase in size and rooms result in a higher price. We know this by experience looking at the table (and of course by real world experience). Now the trick is to make a machine also learn this, and even give us a prediction of the price when it knows the size and rooms of a new apartment. Different algorithms have been developed to perform this task. One such algorithm is a **decision tree**, which is in its essence very simple. But how does it exactly work? Let us break down a decision tree for the apartment variables. The tree below shows the steps one has to take using the size and rooms of an apartment to get the corresponding price.

The decision tree uses simple rules:

1. Start at the top statement.

2. If the statement is true, go to the right.

3. If the statement is false, go to the left.

4. The final statement is the expected outcome.

**A simple decision tree**

Take for example apartment #2. Going down step by step through the decision tree, starting at the top. Since apartment #2 has two rooms, we must go to the left as 2 ≤ 1.50 is false. Reaching the second statement, the amount of rooms are indeed below 2, therefore we must go to the right. Here only two options are available as the tree already accounted for the amount of rooms. Now size is what matters. The size is below 31.50 m², making us go right. And indeed, we get to the correct price of €1100 as seen in the table above. So now imagine being faced with a situation where we want to know the price of a new apartment with a size of 12 m² and 1 room. Going down the tree will end us up with a predicted price somewhere above €600 but not higher than €850.

Now this same principle is applied to predict air quality. Here, the size and rooms are road traffic data and weather data, and the price is the air quality. When predicting air quality, many more variables with thousands of data points are at play. Trees resulting from this data can become ridiculously large. See below a static image of a single decision tree at maximum depth to predict air quality. Each tiny blue dot (i.e. node) is a statement and the final node is the expected outcome based on all above statements.

**A full decision tree to predict air quality**

For demonstration purposes, a decision tree is used with only a limited depth of decision branches. In reality there are many more branches going into finer and finer detail. Using the shallow tree below we can trace statements down to an expected air quality value, in this case a nitrogen dioxide (NO2) concentration. What is already interesting to note is that some variables, such as wind gusts and wind direction play very decisive roles in decision branching. Both in position (high in the hierarchy) as in quantity (frequently stated). The decisiveness of the variables (i.e. features) can be formalized as the **feature importance** (i.e. the influence each feature has on the decided outcome). This could both have a negative or positive effect, depending on the quality and relevance of a each feature.

**A single decision tree to predict air quality (NO2)**

(note: it has a maximum depth of 5 statements)

The feature importance and its tree branching is a very important aspect of decision trees. For an individual tree, a bias can occur where a certain path down the branches takes you to an inaccurate, biased or non-optimal outcome. This may be because branching further down the line did not account for certain relationships between the data. Say you take a path down the line to the left, reaching the Road traffic count statement. Here you realize the Road traffic count value deviates drastically from this statement, and corresponds more with a Road traffic count threshold elsewhere in the tree. Even knowing this, there is no way back at this point, being essentially stuck with a subpar outcome. This may happen every time the top statement is false, never reaching the right side of the tree. To account for this error, a method has been developed that involves a multitude of individual decision trees with randomly sampled values of each feature. Each tree, which can go into the hundreds, follow their own branching based on the randomly sampled feature values with each slightly different outcomes. All the outcomes of these trees are then aggregated by taking either their mean (continuous data) or mode (discrete data). This provides a less biased and more accurate prediction output as it cancels out individual errors in trees by filtering out extremities by the law of large numbers. The algorithm for these randomly sampled trees is called a **random forest model**.

**Exploring and relating data**

Now that we know more about the underlying principles of this relatively simple but powerful machine learning algorithm, we can finally start using it for predicting air quality. It is important to prepare the data in such a way to avoid any disruptions and biases in tree branching. Before we do that we will explore the raw unprocessed features, assess how they relate to each other, and see how well they predict air quality.

Let us start by taking a look of the relationships between all features. These relationships do not necessarily have to be linear. The random forest regressor is very good at finding these nonlinear relationships, while for us humans it can be quite hard to directly observe any interconnection. See my __previous blog post__ where the air quality (gray line in the top graph) is compared to underlying features (red and blue lines in the bottom two graphs). It is difficult to see from each individual graph how the lines relate to each other. Correlating all features provide some insight. See the correlation matrix below displaying the Pearson correlation coefficient (**r**) for all feature combinations.

**Correlation matrix**

You can see there is some correlation between the data, but mostly not very significant, which means they could all have an individual contribution to the nonlinear prediction of air quality. This brings the question, how does each feature actually relate to air quality? The bar plot below visualize these relationships, followed by a scatter plots with linked histograms showing the marginal distribution of each feature versus a logarithmic (log) transformation of air quality. This log transformation is performed to convert the exponential nature of the raw NO2 values to a linear nature found in the other features. This is already one preprocessing step that increases the prediction performance of the random forest regressor.

**Feature correlation to air quality**

**Further visualizing the feature correlation to air quality**

(note: not all data points and features are included)

A few insights can be made looking at correlations and the scatter plots. It is clear that some features are inversely correlated, which is perfectly valid for a prediction model. Most features barely have any direct correlation to air quality with large variances around the regression line resulting in a low coefficient of determination (**r²**). But interestingly enough, this does not necessarily mean it will not perform well in the random forest model, as there may be an nonlinear relationship that we as humans do not immediately observe. Take for example the road traffic count from the road traffic data. It has a weak **r** of 0.200 and an **r²** of 0.040, but looking at the calculated feature importance in the graph below, it is pretty high up in terms of decisive influence it has in the decision tree.

**Correlation versus feature importance**

We can conclude that most linear relations between features and air quality say little about their final influence on the predictive power of a decision tree, but they do give a somewhat indication of what to expect. See for example the wind gusts feature that both has a high inverse correlation and is at the top rank in the feature importance (i.e. high up in the decision tree with numerous statements). The wind direction also has a very high feature importance, but barely any correlation. This is problematic, however, we leave this for next blog post. With the scatter plots we also have more insight in the data continuity and potential outliers.

**Predicting air quality**

So now that we know a bit more about decision trees we can finally make our prediction using the random forest model. It uses 100 random trees and takes the mean of all individual tree predictions (in this case NO2 concentrations) at a hourly interval over 7 days. In order to make this prediction, the tree established a relationship between data from a single air quality sensor and nearby road traffic and weather data in the same time intervals. See the graph below with the air quality predictions and the actual observed values. The observed values were of course not included in the original decision tree, and are used to access the prediction accuracy.

**Predicted air quality: NO2**

**Further visualizing the observed versus the predicted air quality**

It is clear that the random forest model is doing something right, as there is a correlation (**r** of 0.747) between the observed and predicted air quality values. The spread around the regression line (**r² **of 0.557) is still relatively high. The mean squared error (**mse**) and root mse (**rmse**)** **should be smaller and the residual prediction deviation (**rpd**) should be preferably above 2.

Now that we know how a decision tree works we could use that to our advantage to optimize branching by properly preprocessing the data. This includes temporal smoothing, converting cyclical data, and feature selection. More on this on my __next blog post __where we will boost our prediction performance.