Predicting air quality
Welcome to my blog where I will update you on my work as a research intern at Geodan. In each blog post I will take you through my preliminary findings, examine technical details, and discuss potential obstacles. Most of this will make their way into a 'dynamic article', which hopefully shows the complete picture by the end of the four month internship.
The goal of my research is to predict air quality using road traffic behavior and weather patterns. They are all measured by sensors on the ground. Data from these sensors consists of hundreds of thousands of hourly measurement values, changing over time and space. For example, as expected, a clear drop in the amount of cars at a certain sensor is visible in the road traffic data when the coronavirus regulations were introduced. In the weather data, seasonal temperature fluctuations are visible and the wind direction is constantly changing. All not so surprising. But what is interesting is how these traffic and weather changes affect air quality. Of course you can measure air quality directly, but in reality these sensors are sparsely located in the Netherlands, leaving large gaps of zero air quality information. Luckily, the recently launched Sentinel-5P satellite fills most of these gaps, but with a caveat that it only passes over the Netherlands once a day. More on this in a later blog post.
Let us jump into the actual prediction problem. Air quality sensors measure specific atmospheric components, such as nitrogen oxides (NOx) and particulate matter (PM). For now we will stick to predicting nitrogen dioxide (NO2). At first we will establish a relationship between NO2 and certain components from the road traffic and weather data. For this a machine learning model is used, namely a random forest regressor. The model uses the explanatory variables, i.e. road traffic and weather data, to predict the response variables, i.e. air quality data. Hourly historical data points of the past one and a half years of both the explanatory and response variables are fed into the random forest regressor. The model will nonlinearly fit the data using an ensemble of decision tree classifiers with random sampling to prevent overfitting. When this process is completed, new road traffic and weather data may predict NO2 concentrations within the same time span at hourly intervals. The first results spanning over 7 days can already be seen in the graph below.
Air quality (NO2) data
The gray line represents the observed NO2 concentrations at an air quality sensor. They are the actual measurements by this sensor. The black line represents the prediction of the NO2 concentrations. The graphs below show some of the road traffic and weather data components used to predict the NO2 concentrations. In a later blog post I will go into more detail about the prediction accuracy.
Road traffic data
The predictions are possible by fitting historical data from the 1st of January 2019 to the 6th of May 2020, where the data outside this range is not included when fitting the data in the machine learning model. When dragging the above graphs to the right you can see the historical data used for fitting the machine learning model. The road traffic and weather data during the prediction time period are used as explanatory variables whilst the NO2 data is used as a reference to the predictions. Comparing the road traffic and weather patterns to the NO2 predictions make it actually hard to distinguish which explanatory variables had the most influence on the prediction performance. This is something we will tackle in my next blog post, diving deeper into the relationships between the datasets and how the prediction model actually works.
I hope you enjoyed reading my first blog post! If you have any thoughts or feedback, feel free to leave a comment below.