DeepAI
Log In Sign Up

Bias detection of PM_2.5 monitor readings using hidden dynamic geostatistical calibration model

Accurate and reliable data stream plays an important role in air quality assessment. Air pollution data collected from monitoring networks, however, could be biased due to instrumental error or other interventions, which covers up the real pollution status and misleads policy making. In this study, motivated by the needs for objective bias detection, we propose a hidden dynamic geostatistical calibration (HDGC) model to automatically identify monitoring stations with constantly biased readings. The HDGC model is a two-level hierarchical model whose parameters are estimated through an efficient Expectation-Maximization algorithm. Effectiveness of the proposed model is demonstrated by a simple numerical study. Our method is applied to hourly PM_2.5 data from 36 stations in Hebei province, China, over the period from March 2014 to February 2017. Significantly abnormal readings are detected from stations in two cities.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/27/2020

AirRL: A Reinforcement Learning Approach to Urban Air Quality Inference

Urban air pollution has become a major environmental problem that threat...
11/05/2018

Matrix Completion With Variational Graph Autoencoders: Application in Hyperlocal Air Quality Inference

Inferring air quality from a limited number of observations is an essent...
10/15/2021

Spatially Adaptive Calibrations of AirBox PM_2.5 Data

Two networks are available to monitor PM_2.5 in Taiwan, including the Ta...
06/16/2020

A joint bayesian space-time model to integrate spatially misaligned air pollution data in R-INLA

In air pollution studies, dispersion models provide estimates of concent...
01/12/2021

HighAir: A Hierarchical Graph Neural Network-Based Air Quality Forecasting Method

Accurately forecasting air quality is critical to protecting general pub...
11/19/2020

Interpretable and Transferable Models to Understand the Impact of Lockdown Measures on Local Air Quality

The COVID-19 related lockdown measures offer a unique opportunity to und...

1. Introduction

In recent years, China has experienced severe regional air pollutions (Hu et al., 2010), because of rapid industrialization and alarming increase in energy consumption. Particulate matters with diameter smaller than 2.5 microns () is one of the major pollutants and receives a lot of public attention. Lots of studies have shown that is very harmful to human health through penetrating deeply into lung tissues and entering the bloodstream. For example, long-term exposure to may cause lung cancer (Pope III et al., 2002) or cardiorespiratory failure (Hoek et al., 2013), and regional pollution may increase mortality rate (e.g. Schwartz et al. (1996), Shang et al. (2013) and Lelieveld et al. (2015)). High concentrations of also influence people’s daily life such as outdoor activities and routine business (e.g. Pope et al. (1995) and Chapko and Solomon (1976)).

The Chinese government identifies the needs for air quality assessment and emission control, and has built a large monitoring network all over the country since 2013. Now there are over 1,500 national stations in over 300 cities. Hourly readings of air pollutants are regularly recorded and directly transferred to China National Environmental Monitoring Center (CNEMC). To guarantee appropriate air pollution management, accurate and reliable data stream is pressingly required, otherwise serious health damages and economic implications would follow. Data quality of monitor readings, however, may vary across regions and time. It can be affected by instrumental failures, improper maintenance or certain interventions (e.g., using sprinklers and fog guns around monitoring stations). So far, the CNEMC simply picks up outliers by visual checking, which requires a lot of manpower and may not identify abnormal data promptly. Therefore, an efficient and objective way to detect inaccurate readings is urgently needed.

Methods to evaluate the reliability of air pollution data are limited in literature. In Fassò et al. (2007), a spatio-temporal calibration model for data in Italy was developed to manage a heterogeneous air pollution monitoring network caused by instrumental biases, where some instruments were known to underestimate the “true” level. In their method, dimension reduction was addressed by deterministic basis function from decomposition of the spatial covariance. However, such a technique failed to provide a sufficient description of a spatio-temporal process evolution (Huang et al., 2007). Another work from Ghanem and Zhang (2014)

claimed that the evidence of data manipulation could be found in the discontinuities of the probability density functions of air pollution index (API, usually determined by

) around the threshold value of 100 for different Chinese cities ( indicating “blue-sky days”; otherwise, indicating “non-blue-sky days”). This choice of threshold was too arbitrary, though informative in some cases, can not be directly applied to general pollutants and more complex pollution scenarios.

Due to the complex nature of environmental data, especially their underlying spatiotemporal structure, it is better to fully use the information from the nearby monitoring stations and time points to calibrate the readings of target site. Spatiotemporal statistical model, which considers both the spatial covariance and temporal dependence, is thus a good and natural choice (Cressie and Wikle, 2015). Huang and Cressie (1996) introduced a widely used spatiotemporal model in environmental field, which included a dynamic random field with a separable spatiotemporal covariance structures. By referring the model as the hidden dynamic geostatistical (HDG) model, Calculli et al. (2015) generalized the HDG model by adding the effects from covariates. Although the generalized HDG model performed well in modelling air pollutant dynamics, it cannot be directly applied to calibration problems.

In our study, based on the generalized HDG model, we develop a hidden dynamic geostatistical calibration (HDGC) model to identify biased readings from monitoring stations. Particularly, in the HDGC model, we add a calibration component for each station to account for the station-specific effect. If certain calibration component is substantially lower than neighboring stations, it will send an alarm that those stations have abnormal readings. In addition to incorporating the calibration component, the HDGC model can also take covariates into consideration, and reveal spatial correlation structure and temporal dynamic mechanism in data. Therefore, we explore the association between and other pollutants, meteorological variables, as well as geographical factors. To some degree, it helps provide insight of the formation mechanism of pollution. In the study, we apply the modified HDGC model to readings of 36 stations in Hebei, China, a province often suffering from severe air pollution where many heavy industry factories are deployed.

The paper is organized as follows. In Section 2, we describe the data of the study region. In Section 3, we establish the HDGC model with calibration components, and introduce Kalman filter and expectation maximization (EM) algorithm for parameter estimation, following by variance covariance matrix for model parameters. Then we conduct a simulation study to validate the performance of our model in Section 4. We give a comprehensive interpretation of the results in Sections 5 and provide conclusions in Section 6.

2. Data Description

In this study, we specially target Hebei province, as it is one of the most heavy air polluted areas in China. In the list of top ten polluted cities in China, more than half belong to Hebei province. This province is thus the main target of complaints regarding air pollution. Hebei province locates in north China with most of its central and southern parts lie within the North China Plain. The western part of Hebei rises to the Taihang Mountain, while the eastern part is bordered by Bohai Gulf. In the province, there locates many heavy industry enterprises, especially iron and steel manufactures. For years, the coal consumption in Hebei has been the highest among all provinces in China.

In this study, we collect the hourly concentration of (in ) from 36 monitoring stations from 8 cities in Hebei for three seasonal years from March 2014 to February 2017. The spatial locations of the 36 stations are shown in Figure 1 (highlighted with black dots).

For covariates, firstly, we consider one-hour lagged concentration of other four pollutant gases: sulfur dioxide (), nitrogen dioxide (), ozone (), and carbon monoxide (), since a great portion of contains secondary particles, which are formed in the atmosphere through chemical reactions involving other accumulated gases in the past few hours (Wang et al., 2013). The four gases are all measured in . Secondly, we collect meteorological data: barometric pressure (, in hectopascal), air temperature (, in degree celsius), dew point temperature (, in degree celsius), integrated rainfall (, in millimeter), integrated wind speed (, in meter per second), and wind directions in four categories (northeast, northwest, southeast, and south west), as well as the interaction between integrated wind speed and wind directions. Previous studies have shown that there are non-neglectable meteorological impacts on (Liang et al., 2015). Moreover, we measure the distance of each site to the Taihang Mountains (, in kilometer) to further eliminate the influence of different dispersion conditions caused by topography.

Both the meteorological variables and pollutants are recorded in the same monitoring network, which can be matched for each monitoring station. In Table 1, we show each station’s number, name, and the city where it locates. Due to significantly different patterns of across the four seasons, and the association between air pollution and the meteorological variables can also vary across seasons (e.g. Zheng et al. (2005), Ito et al. (2007) and Liang et al. (2015)), we estimate the model for each season respectively. The stations with missing value rate of more than 20% in one season are removed from the data set. All the variables are standardized to ease the comparison of their effects.

Site City Site Names
16 Baoding Dibiaoshuichang  Huadianerqu  Jiancezhan
Jiaopianchang  Jiedaizhongxin  Youyongguan
79 Cangzhou Cangxianchengjianju  Dianshizhuanbozhan
Shihuanbaoju
1013 Handan Congtaigongyuan  Dongwushuichulichang
Huanbaoju  Kuangyuan
1416 Hengshui Dianjibeichang  Shihuanbaoju  Shijiancezhan
1720 Langfang Beihuahangtianxueyuan  Jiancezhongxin
Kaifaqu  Yaocaigongsi
2126 Shijiazhuang Gaoxinqu  Renminhuitang  Shijigongyuan
Xibeishuiyuan  Xinangaojiao  Zhigongyiyuan
2732 Tangshan Gongxiaoshe  Leidazhan  Shierzhong
Taocigongsi  Wuziju  Xiaoshan
3336 Xingtai Dahuoquan  Luqiaogongsi  Shihuanbaoju
Xingshigaozhuan
Table 1. 36 monitoring stations in Hebei Province, displaying in specific city and location

Figure 1. Spatial locations of the 36 monitoring stations in Hebei, China (black dots).

3. Methodology

In this section, we first introduce the setting of the hidden dynamic geostatistical calibration model. In the model, we add calibration components to account for the heterogeneity among readings from different stations. Next we introduce Kalman filter and smoother to obtain the estimation of hidden random variable. Then the main process of EM algorithm

(Krishnan and McLachlan, 1997) is discussed to obtain the maximum likelihood estimates of model parameters. Eventually, computational information matrix is evaluated to assess the accuracy of model parameters.

3.1. The hidden dynamic geostatistical calibration model

Denote as the measured concentration of at site , and a discrete time . Denote as the set of locations of an environmental monitoring network with sties, each of which is identified by its latitude and longitude coordinates. Exogenous variables, or fixed impact can enter the observation equation (1) to make it more extensive. Hence, we include fixed effects of dimension to account for the effects of other pollutants, meteorological variables and geographic information. The residual is a time-space independent Gaussian measurement error with variance . We also assume that is independent of and . Our model can be expressed as follows,

(1)
(2)

In this model, represents the observed concentration at location and time . After ruling out the influences from other pollutants, meteorological factors and dispersion conditions, the random effect can be considered as the “true” emission of at location and time as discussed in Fassò et al. (2007). The are the calibration components accounting for heterogeneity among readings of different stations. The relationship between the “true” emission and observed is assumed to be similar in trend for all the stations. That is, as long as there is no instrumental error or interference, the calibration component should be similar for all stations .

The state equation (2) determines the rule for generation of latent variable from past , reflected by an order one autogression . The innovation is a sequence of unit-variance time homogenous Gaussian random fields, with exponential spatial covariance function , where is the Euclidean distance between and , and is a scale parameter.

To give the matrix presentation, we introduce the observations at time stored in the

dimension vector

Hence we can rewrite the model in matrix notation as follows

(3)

where is a design matrix of dimension , with elements given by dimension vector in analogy to equation 1. Moreover, the random effect is a also dimension vector. Correspondingly, the calibration parameter is an matrix given by

Based on the matrix presentation, we show how to obtain unobserved latent variable, parameter estimation and variance covariance matrix. Since the HDGC model contains random effect and air pollution data has missing values, the expectation maximization (EM) algorithm is a natural choice for parameter estimation (Smith et al., 2003). Consistency of the estimators, especially the ’s, are guaranteed under Gaussianity assumptions of the random effects (see Chapter 6 in Sumway and Stoffer (2006); Wu (1988)).

3.2. Kalman filter and smoother

We define the conditional mean and conditional variance covariance for the underlying hidden variable , given the observed data to time ,

For simplicity, we use the notation when .

The Kalman filter specifies how to update from to once the new observation is obtained, without having to reprocess the entire dataset . The filtered values are

where is called Kalman gain. It is obvious the Kalman predicted value should be given by

Now we consider the problem of obtaining estimators for based on the entire data sample , namely . The Kalman smoother recursion for are as follows,

3.3. EM algorithm

Suppose for observed data and latent

, we have the following conditional Gaussian probability distributions,

:

:

:

Let the unknown parameter set be . The complete-data log-likelihood function can be written as,

The initial are set as estimated from least square estimate, for all stations , and the other parameters . Conditioning on the observed data , we calculate the expectation of the complete data log-likelihood function under the parameter ,

At the M-step, the parameter set maximizes the conditional expectation function. It can be shown that, for the iteration, the parameter set is updated as below,

:
(4)

where is the unit vector with the s-th component equals to 1.

:
(5)
:
(6)
:
(7)
:
(8)

where

Since the geographic parameter can not be solved in a closed form, we update through numerical optimization by the Nelder-Mead algorithm. The EM algorithm stops when the change of Euclidean distance of the parameter set and log likelihood are both significantly small.

3.4. Computational information matrix from marginal likelihood

The direct likelihood computation is based on matrix, which is exceedingly high dimension. Therefore, based on the kalman filter we calculate the information matirx, which is computationally affordable as it requires inversions of dimensional matrices. Standard results in Sumway and Stoffer (2006) 6.3 allow the computational information matrix. And we define

where and are the Kalman filter outputs at convergence. The loglikelihood function excluding an additive constant is

Denote the information matrix , we can obtain the variance-covariance matrix by,

where

and the involved derivatives are given in the sequel given by recursions.

  • ,

  • ,

  • ,

  • ,

  • ,

The closed form for the derivativers is given in the following to compute the above recursions,

  • if and else;

  • if and else;

  • if and else;

  • if and else;

  • if and else;

4. Simulation study

To demonstrate the ability of the calibration component to detect biased readings from certain site, we conduct the following simulation studies. The responses are generated from the model with a grid cell,

(9)
(10)

where , and . The residuals in equation (9

) are independently generated from a common Gaussian distribution with zero mean and

. The innovation in equation (10) is generated from a unit-variance time homogenous Gaussian random fields, with scale parameter used for the exponential spatial covariance. The 9 stations are arranged in Figure 2.

Figure 2. Grid cell of 9 stations.

We consider two different scenarios. In the first scenario, we assume site 5 in the center is the site with biased readings. We set , while for other stations. In the second scenario, the site with biased readings is at the corner (site 3), with , while

for other stations. For each scenario, we report the empirical means, standard error, and 95% confidence intervals (CIs) of the estimated parameters in Table

2 and 3 based on 1000 simulation. From Table 2 and 3, we can see that no matter where the suspicious site locates, it can be identified by the estimated . Moreover, other parameters , and can be accurately estimated.

g
T=100 0.8 0.8 0.8 0.8 0.5 0.8 0.8 0.8 0.8 0.5 100 0.1
Mean 0.828 0.828 0.826 0.830 0.522 0.831 0.828 0.828 0.830 0.495 95 0.108
Sd 0.048 0.049 0.049 0.048 0.046 0.050 0.048 0.049 0.048 0.041 10 0.005
LB 0.734 0.734 0.726 0.735 0.433 0.723 0.731 0.734 0.718 0.411 77 0.098
UB 0.928 0.927 0.923 0.924 0.612 0.933 0.929 0.930 0.926 0.571 117 0.121
T=500 0.8 0.8 0.8 0.8 0.5 0.8 0.8 0.8 0.8 0.5 100 0.1
Mean 0.825 0.824 0.825 0.824 0.521 0.825 0.825 0.823 0.825 0.500 95 0.106
Sd 0.019 0.018 0.019 0.018 0.018 0.018 0.018 0.019 0.019 0.017 4.4 0.002
LB 0.788 0.790 0.790 0.786 0.485 0.789 0.792 0.786 0.789 0.465 86 0.102
UB 0.866 0.862 0.867 0.860 0.559 0.861 0.864 0.865 0.864 0.534 103 0.110
Table 2.

Scenario 1. Site 5 is the site with biased reports. Based on 1000 simulation, we show mean, standard deviation and 95% lower and upper bounds of estimate results.

g
T=100 0.8 0.8 0.5 0.8 0.8 0.8 0.8 0.8 0.8 0.5 100 0.1
Mean 0.831 0.827 0.524 0.826 0.827 0.828 0.826 0.826 0.828 0.497 95 0.107
Sd 0.046 0.048 0.048 0.048 0.047 0.047 0.047 0.049 0.048 0.041 10 0.005
LB 0.739 0.730 0.429 0.724 0.727 0.732 0.728 0.728 0.727 0.414 78 0.097
UB 0.925 0.921 0.623 0.919 0.925 0.918 0.921 0.926 0.925 0.571 115 0.117
T=500 0.8 0.8 0.5 0.8 0.8 0.8 0.8 0.8 0.8 0.5 100 0.1
Mean 0.825 0.824 0.524 0.823 0.821 0.823 0.824 0.822 0.825 0.500 94 0.106
Sd 0.018 0.017 0.019 0.018 0.017 0.017 0.018 0.018 0.018 0.017 4.5 0.002
LB 0.792 0.791 0.486 0.788 0.787 0.790 0.788 0.788 0.790 0.463 86 0.102
UB 0.864 0.861 0.563 0.862 0.856 0.862 0.862 0.860 0.863 0.537 103 0.110
Table 3. Scenario 2. Site 3 is the site with biased reports. Based on 1000 simulation, we show mean, standard deviation and 95% lower and upper bounds of estimate results.

The simulations results illustrate that the HDGC model is able to reveal those stations whose readings have different patterns from their neighbors. Although we have spatially-correlated time series data, the EM algorithm can still provide consistent estimates of model parameters (e.g. Smith et al. (2003); Fassò and Cameletti (2009)).

5. Analysis of pollution in Hebei

In the following sections, we show and interpret the parameter estimates as well as the goodness of fit of the model. We then compare the estimated values for all stations (see Table 5), and check for the stations with consistent outlying values. Note that lower values indicate that observed PM2.5 substantially deviates from true , which calls for further investigation of these stations.

5.1. Covariates impact

Table 4 shows the estimated values for parameters at each season over three years. Since all the variables in our model are standardized, the estimated for each covariate can provide insight in its contribution to level. Most of the estimates are statistical significant in each season, indicating the complicated formation mechanism of pollution.

The root of mean square error of model residuals ranges from to , with a mean of , which is shown in the last second line in Table 4. Comparing to the mean level of the measured of the 12 seasons, which ranges from to with an overall mean of , we claim that our model has a decent goodness of fit. The last line of the table shows missing rate of for all 36 monitoring stations, with a maximum of in autumn 2016.

Period Year 2014 Year 2015 Year 2016
Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter
0.05*** 0.02*** 0.02*** 0.08*** 0.06*** 0.03*** 0.10*** 0.00 0.07*** 0.00 0.05*** 0.02***
0.15*** 0.09*** 0.19*** 0.20*** 0.13*** 0.09*** 0.26*** 0.18*** 0.21*** 0.06*** 0.22*** 0.22***
-0.05*** -0.05*** -0.01*** 0.01*** -0.05*** 0.02*** 0.18*** 0.03*** 0.12*** 0.06*** 0.11*** 0.01***
0.07*** 0.06*** 0.08*** 0.26*** 0.09*** 0.07*** 0.29*** 0.17*** 0.10*** 0.03*** 0.19*** 0.28***
0.03 -0.13*** -0.11*** -0.06*** 0.03 -0.08*** 0.01 -0.09*** -0.17*** -0.01 -0.12*** -0.10***
-0.22*** -0.00 -0.33*** -0.09*** -0.17*** 0.06*** -0.33*** -0.17*** -0.47*** -0.11*** -0.51*** -0.12***
0.32*** 0.32*** 0.29*** 0.28*** 0.33*** 0.34*** 0.21*** 0.39*** 0.37*** 0.34*** 0.34*** 0.24***
-0.04*** -0.00 -0.04*** -0.01*** -0.03*** -0.01*** -0.05*** -0.03*** -0.06*** -0.05*** -0.07*** -0.03***
0.01 0.01 0.05** 0.01 0.03 0.05 -0.03 -0.01 0.13*** -0.01 -0.06*** 0.07***
0.01 0.01 0.02*** 0.01*** 0.01 0.01 0.02** 0.01*** 0.03*** 0.02*** 0.02*** 0.03***
0.02*** 0.01 0.02*** 0.00 0.03*** 0.01 -0.01 0.01** 0.04*** 0.01* 0.03*** 0.02***
0.00 -0.01 0.03*** 0.02*** -0.00 -0.00 0.04*** 0.02*** 0.05*** -0.01 0.02*** 0.02***
0.01 0.01** 0.00 0.00 0.01 -0.00 -0.01* 0.01 0.01 -0.00 -0.01** 0.00
-0.01 -0.01 -0.05*** -0.03* -0.04 -0.07 -0.09 -0.05** -0.13*** -0.03** -0.02 -0.11***
-0.04 -0.01 -0.07*** -0.01 -0.08* -0.07 -0.01 -0.04 -0.16*** -0.01 -0.03** -0.10***
-0.02 0.00 -0.08*** -0.07*** -0.02 -0.05 -0.04 -0.04** -0.16*** 0.02* 0.00 -0.09***
-0.02 -0.02 -0.03* -0.01 -0.04 -0.01 0.05 -0.02 -0.12*** 0.01 0.07*** -0.06***
-0.08** -0.02 -0.09** -0.17*** -0.01 -0.10*** -0.08** -0.07*** -0.03 0.01 -0.12*** -0.14***
g 0.94*** 0.92*** 0.96*** 0.92*** 0.94*** 0.94*** 0.94*** 0.93*** 0.93*** 0.93*** 0.94*** 0.93***
(km) 17*** 13*** 25*** 17*** 52*** 26*** 111*** 23*** 110*** 9*** 125*** 160***
0.03 0.05 0.01 0.02 0.05 0.04 0.03 0.02 0.05 0.03 0.03 0.02
root of MSE 8.94 7.79 7.34 8.27 9.28 5.49 10.38 9.40 9.14 3.69 9.47 15.14
Missing Rate 0.019 0.026 0.032 0.017 0.011 0.021 0.031 0.010 0.013 0.043 0.056 0.054
Table 4. Parameter estimates and their levels of significance (* for p-value ; ** for p-value ; *** for p-value ) from spring 2014 to winter 2016.

The three gaseous precursors , , and almost all have significantly positive effects (with -values less than 0.001) on pollution. Among them, has the largest influence on average. The second large effect on observed is from , and has the similar impact as in autumn and winter. Besides, we cannot determine the exact effect of on . is predominantly produced by photochemical reactions involving two major precursors, volatile organic compounds () and oxides of nitrogen (). An inverse relationship has been found between and (e.g. De Nevers (2010), Han et al. (2011) and Sillman (1999)). Hence, a positive association between and may cover the relationship between and .

For meteorological variables, dew point has the strongest positive effect on among all meteorological variables. An increase in dew point temperature indicates the amount of water vapor in the atmosphere is increasing, which would promote the formation of (Yang et al., 2011). Mostly, temperature is negatively related with . Rainfall can help dilute the haze and increase the sedimentation of aerosols, and thus significantly reduce concentration (Jacob and Winner, 2009). The barometric pressure is found to have a negative effect on , although not always significant, which is similar to the findings in Liang et al. (2015). Integrated wind speed interacted with directions can disperse and dilute , especially in the winter heating period (Dawson et al., 2007).

As expected, the distance to Taihang Mountain () mostly has a significant negative effect on , since it can reflect the dispersion condition caused by topography. That is the nearer to Taihang Mountain, the easier to accumulate the polluted air, while the farther, the easier to disperse.

In addition, the mode number of estimated parameter is , which is the lag one coefficient of random effect , and thus reveals a strong temporal persistency of hourly air pollution data. Range parameter presents the spatial correlation, and from Table 4 we can conclude that the spatial stations are correlated in city range (around kilometers) in most case, whereas have become more correlated since the end of 2016.

5.2. Comparison of estimates

In this section, we specially analyze the estimated values of , which indicate the relationships between the observed and the “true” PM2.5 level for each site. The estimated values and their standard deviation of the 36 monitoring stations for 12 seasons are shown in Table 5 and Table LABEL:alphanew2.

Site Year 2014 Year 2015 Year 2016
Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter
1 0.5860 0.6817 0.6192 0.5687 0.4239 0.3158 0.4566 0.4522 0.3616 0.2852 0.4682 0.4229
(0.0153) (0.0151) (0.0134) (0.0137) (0.0174) (0.0187) (0.0149) (0.0138) (0.0177) (0.0205) (0.0147) (0.0146)
2 0.4002 0.4539 0.4609 0.5896 0.3988 0.3452 0.5588 0.6891 0.5423 0.6132 0.4628 0.4933
(0.0171) (0.0173) (0.0134) (0.0139) (0.0175) (0.0176) (0.0141) (0.0127) (0.0151) (0.0145) (0.0146) (0.0141)
3 0.5671 0.4082 0.5004 0.4807 0.3505 0.4699 0.4735 0.4434 0.3978 0.7120 0.4544 0.4382
(0.0150) (0.0179) (0.0130) (0.0141) (0.0181) (0.0158) (0.0145) (0.0136) (0.0166) (0.0138) (0.0144) (0.0142)
4 0.4116 0.4176 1.1242 0.4452 0.3361 0.3086 0.4271 0.4587 0.4087 0.2421 0.3900 0.3979
(0.0169) (0.0181) (0.0122) (0.0147) (0.0187) (0.0189) (0.0153) (0.0136) (0.0167) (0.0220) (0.0153) (0.0149)
5 0.4343 0.9967 0.5612 0.7375 0.3632 0.2940 0.4890 0.4165 0.4879 0.3020 0.4332 0.4356
(0.0168) (0.0138) (0.0132) (0.0140) (0.0185) (0.0193) (0.0147) (0.0142) (0.0158) (0.0199) (0.0149) (0.0145)
6 0.9225 1.8331 1.0667 0.4917 0.3870 0.4468 0.5146 0.3844 0.3908 0.3138 0.4672 0.4314
(0.0134) (0.0123) (0.0116) (0.0139) (0.0171) (0.0154) (0.0141) (0.0138) (0.0165) (0.0184) (0.0142) (0.0142)
7 0.2452 0.2712 0.2071 0.1561 0.2796 0.2778 0.3421 0.2201 0.3077 0.1946 0.3088 0.2526
(0.0222) (0.0232) (0.0197) (0.0230) (0.0225) (0.0205) (0.0182) (0.0191) (0.0203) (0.0248) (0.0178) (0.0185)
8 0.2335 0.2532 0.2126 0.1384 0.2095 0.3791 0.3758 0.2185 0.3363 0.2165 0.3690 0.2622
(0.0215) (0.0223) (0.0182) (0.0231) (0.0240) (0.0174) (0.0173) (0.0184) (0.0192) (0.0225) (0.0165) (0.0180)
9 0.2347 0.2900 0.2025 0.1351 0.1977 0.2843 0.3003 0.1743 0.2925 0.1801 0.3413 0.2662
(0.0213) (0.0210) (0.0181) (0.0228) (0.0238) (0.0187) (0.0184) (0.0191) (0.0197) (0.0233) (0.0166) (0.0177)
10 0.4230 0.4247 1.2155 0.2967 0.3131 0.3978 0.5074 0.5382 0.4179 0.2455 0.4604 0.4495
(0.0164) (0.0174) (0.0117) (0.0162) (0.0191) (0.0164) (0.0144) (0.0137) (0.0165) (0.0209) (0.0148) (0.0144)
11 0.3060 0.4073 0.3195 0.2597 0.3086 0.3701 0.4769 0.6549 0.4866 0.2722 0.4377 0.4404
(0.0194) (0.0184) (0.0153) (0.0173) (0.0199) (0.0176) (0.0148) (0.0135) (0.0160) (0.0207) (0.0151) (0.0146)
12 0.3220 0.4212 0.4644 0.4031 0.3901 0.4439 0.4037 0.5642 0.4302 0.2604 0.4298 0.4507
(0.0177) (0.0169) (0.0129) (0.0141) (0.0176) (0.0156) (0.0153) (0.0134) (0.0160) (0.0196) (0.0148) (0.0142)
13 0.2614 0.3361 0.3934 0.3109 0.3606 0.3852 0.3939 0.5572 0.4860 0.3091 0.4289 0.4377
(0.0216) (0.0209) (0.0146) (0.0164) (0.0191) (0.0176) (0.0162) (0.0140) (0.0163) (0.0199) (0.0156) (0.0149)
14 0.2391 0.2488 0.2467 0.2589 0.2518 0.3825 0.4405 0.3103 0.4486 0.5801 0.4114 0.3599
(0.0224) (0.0235) (0.0175) (0.0175) (0.0221) (0.0173) (0.0159) (0.0161) (0.0168) (0.0149) (0.0156) (0.0159)
15 0.2651 0.2966 0.2509 0.2490 0.2843 0.3570 0.4603 0.2881 0.4144 0.3178 0.4097 0.3412
(0.0206) (0.0215) (0.0171) (0.0173) (0.0208) (0.0173) (0.0156) (0.0163) (0.0170) (0.0183) (0.0156) (0.0159)
16 0.2296 0.2585 0.2318 0.3458 0.2680 0.3847 0.4194 0.3041 0.4145 0.3822 0.3706 0.3445
(0.0233) (0.0240) (0.0185) (0.0160) (0.0226) (0.0179) (0.0163) (0.0167) (0.0175) (0.0179) (0.0164) (0.0162)
17 0.1980 0.2115 0.1944 0.1485 0.2611 0.2446 0.4595 0.2215 0.4229 0.2218 0.2670 0.2427
(0.0235) (0.0252) (0.0193) (0.0224) (0.0213) (0.0209) (0.0160) (0.0182) (0.0166) (0.0232) (0.0190) (0.0188)
18 0.2007 0.2239 0.1786 0.1479 0.2468 0.2625 0.4848 0.1993 0.4072 NaN NaN NaN
(0.0234) (0.0247) (0.0199) (0.0227) (0.0217) (0.0202) (0.0158) (0.0189) (0.0169) (NaN) (NaN) (NaN)
19 0.1928 0.3124 0.2001 0.1729 0.3059 0.2433 NaN 0.1665 0.4084 0.2678 0.2551 0.2298
(0.0258) (0.0222) (0.0202) (0.0225) (0.0214) (0.0227) (NaN) (0.0220) (0.0175) (0.0217) (0.0201) (0.0200)
20 0.1920 0.2199 0.1876 0.1444 0.2420 0.2613 0.4187 0.2008 0.4009 0.2584 0.2592 0.2255
(0.0258) (0.0265) (0.0208) (0.0241) (0.0233) (0.0217) (0.0167) (0.0200) (0.0177) (0.0221) (0.0199) (0.0199)
21 0.3257 0.2851 0.2579 0.3294 0.4767 0.6090 0.4404 0.3411 0.4144 0.4995 0.4600 0.4579
(0.0198) (0.0231) (0.0179) (0.0163) (0.0173) (0.0154) (0.0154) (0.0159) (0.0172) (0.0163) (0.0150) (0.0149)
22 0.2818 0.2803 0.2657 0.2404 0.4073 0.4579 0.4407 0.2999 0.5427 0.4895 0.5175 0.4687
(0.0194) (0.0214) (0.0160) (0.0172) (0.0171) (0.0157) (0.0146) (0.0153) (0.0148) (0.0157) (0.0139) (0.0143)
23 0.2994 0.3057 0.2703 0.3134 0.3426 0.4912 0.4511 0.3964 0.4488 0.7917 0.5295 0.5167
(0.0192) (0.0209) (0.0164) (0.0155) (0.0184) (0.0157) (0.0147) (0.0141) (0.0158) (0.0140) (0.0141) (0.0142)
24 0.2915 0.3406 0.2984 0.2708 0.4359 0.4786 0.4219 0.3037 0.4241 0.7829 0.5281 0.5098
(0.0216) (0.0217) (0.0170) (0.0181) (0.0183) (0.0169) (0.0161) (0.0171) (0.0174) (0.0150) (0.0148) (0.0147)
25 0.3414 0.2773 0.2604 0.4262 0.3266 0.3939 0.4078 0.2519 0.5057 0.8064 0.4645 0.4960
(0.0193) (0.0230) (0.0174) (0.0144) (0.0198) (0.0175) (0.0156) (0.0175) (0.0155) (0.0143) (0.0149) (0.0145)
26 0.2952 0.3587 0.2600 0.2629 0.3224 0.5451 0.4192 0.2700 0.5214 0.4358 0.5040 0.4993
(0.0201) (0.0199) (0.0170) (0.0172) (0.0197) (0.0154) (0.0154) (0.0168) (0.0153) (0.0168) (0.0144) (0.0144)
27 0.2843 0.3802 0.2992 0.2408 0.3750 0.3772 0.4377 0.2115 0.4591 0.3829 0.4387 0.3575
(0.0186) (0.0178) (0.0152) (0.0169) (0.0174) (0.0164) (0.0150) (0.0177) (0.0157) (0.0163) (0.0146) (0.0152)
28 0.2568 0.3829 0.2780 0.2999 0.4873 0.3446 0.4541 0.2067 0.4115 0.3698 0.4152 0.3935
(0.0206) (0.0186) (0.0166) (0.0160) (0.0164) (0.0177) (0.0151) (0.0187) (0.0166) (0.0174) (0.0151) (0.0150)
29 0.3090 0.4838 NaN 0.2490 0.4704 0.3834 0.4976 0.1925 0.5238 0.4670 0.4539 0.3602
(0.0180) (0.0166) (NaN) (0.0164) (0.0161) (0.0162) (0.0144) (0.0182) (0.0150) (0.0151) (0.0144) (0.0152)
30 0.2674 0.3750 0.2569 0.2420 0.3766 0.3756 0.4564 0.2018 0.4188 0.4652 0.3871 0.3366
(0.0206) (0.0191) (0.0172) (0.0179) (0.0184) (0.0174) (0.0152) (0.0194) (0.0167) (0.0162) (0.0156) (0.0159)
31 0.2747 0.2810 0.2852 0.2743 0.4059 0.4015 0.4331 0.2039 0.4369 0.3531 0.4292 0.3543
(0.0181) (0.0192) (0.0149) (0.0152) (0.0164) (0.0153) (0.0153) (0.0173) (0.0155) (0.0162) (0.0144) (0.0150)
32 0.2814 0.3621 0.2885 0.2777 0.4317 0.3955 0.4195 0.2104 0.5036 0.3291 0.4665 0.3672
(0.0192) (0.0186) (0.0154) (0.0161) (0.0168) (0.0163) (0.0153) (0.0181) (0.0153) (0.0178) (0.0145) (0.0152)
33 0.6782 0.3635 0.3431 0.2569 0.3591 0.4925 0.4426 0.3092 0.3879 0.3103 0.4241 0.4325
(0.0146) (0.0196) (0.0150) (0.0174) (0.0187) (0.0158) (0.0152) (0.0162) (0.0173) (0.0193) (0.0153) (0.0148)
34 0.7057 0.7703 0.4077 0.3058 0.3802 0.4870 0.5149 0.3133 0.5033 0.4312 0.4717 0.4025
(0.0144) (0.0144) (0.0142) (0.0162) (0.0185) (0.0159) (0.0144) (0.0162) (0.0158) (0.0168) (0.0149) (0.0150)
35 0.4773 0.5307 0.3696 0.2997 0.3601 0.3963 0.4544 0.2985 0.4507 1.0139 0.4542 0.4014
(0.0159) (0.0160) (0.0144) (0.0162) (0.0186) (0.0169) (0.0151) (0.0161) (0.0163) (0.0135) (0.0149) (0.0149)
36 0.5652 0.5458 0.3376 0.3300 0.3686 0.4409 0.4465 0.2123 0.3741 0.3234 0.4014 0.4008
(0.0152) (0.0160) (0.0152) (0.0156) (0.0184) (0.0165) (0.0151) (0.0188) (0.0174) (0.0189) (0.0154) (0.0150)
Note: Table 1 shows the specific city and site names with respect to site number.
Table 5. estimates and standard deviation in parentheses for twelve seasons from spring 2014 to winter 2016.

Since within the same city, the estimated values of the 36 monitoring stations for 12 seasons bear close resemblance, we take the average of values of the monitoring stations in the same city to facilitate comparison. Thus, we compute the average values for the 8 cities in 12 seasons of 3 years. We rank them from 1 to 8, with 1 indicating the lowest average value among the cities. Those with low values are suspect of having biased readings. To assist visual inspection, we plot them in Figure 3 for the four seasons respectively. Figure 3 shows that the two cities–Cangzhou and Langfang, have consistently lower values.

(a) Rank of average in spring.
(b) Rank of average in summer.
(c) Rank of average in autumn.
(d) Rank of average in winter.
Figure 3. The average values for 8 cities, ranking from lowest to highest for each season.

We use hierarchical clustering to classify 36 stations into several groups based on the distance matrix (Euclidean metric) of

values, and Figure 4 shows the classification tree using the hierarchical clustering algorithm with Ward’s method. The figure clearly presents two categories. The mean and standard deviation of values for the two groups are summarized in Table 6. Again, we find that the seven monitoring stations in the low value category all belong to the two cities–Cangzhou and Langfang.

Figure 4. Cluster Result by Hierarchical Clustering. Monitoring stations in the box belong to Langfang and Cangzhou cities.
Group Mean Sd
1. stations in Langfang and Cangzhou cities 0.2548 0.0769
2. stations in other cities 0.4156 0.1578
Table 6. Mean and standard deviation of estimated values for the two groups.

Moreover, we compare the estimates by randomly choosing one site in the two identified cities with its nearest site in other city. For Cangzhou city, we choose the Shihuanbaoju site and its nearest site–the Shijiancezhan site in Hengshui city. The results for estimate and its 95% standard error bands of the two stations (see Figure 5) show a big difference emerges since late 2014, thus indicating that the readings from the site in Cangzhou may become biased since that time. In Figure 6, we compare the Jiancezhongxin site in Langfang city and its nearest site Huadianerqu in Baoding city, and find that the biggest difference occurs in winter heating period with extremely high , which indicates the readings in Langfang site may be biased when heavy air pollution comes up.

Figure 5. Estimated with its 95% standard error bands for Site 9–Shihuanbaoju site in Cangzhou city, compared with Site 16–Shijiancezhan site in Hengshui city.
Figure 6. Estimated with its 95% standard error bands for Site 18-Jiancezhongxin site in Langfang city, compared with Site 2–Huadianerqu site in Baoding city. for Jiancezhongxin site is not estimated in summer, autumn, or winter 2016 due to the high missing rate (more than 20%) of the observed .

To further investigate these two stations, we examine their pollution data through comparing both and levels. We consider the concentration ratio of to . If the concentration is much lower than that of , it may indicate biased readings. After checking the data, we do find some suspicious cases. For example, in Table 7, we find that the average from Shihuanbaoju, Cangzhou, was much lower than that from Shijiancezhan, Hengshui, whereas the average was much closer in the same period from February 8 to 10, 2016. In Table 8, from June 19 to 20, 2014, the average from Jiancezhongxin site in Langfang city and its nearest site–the Huadianerqu site in Baoding city– was quite high, while the average from Jiancezhongxin, Langfang was much less than that of Baoding. The big difference in the average ratio of to again provides evidence that the readings of in Langfang city may be biased.

Biased reading of in Langfang is also mentioned in the Air Quality Assessment Report (4) –Regional Pollution Situation Assessment of Beijing, Tianjin and Hebei in 2013-2016, published by Peking University. In the report, the four stations in Langfang are compared to the two geographically nearest monitoring stations in the south of Beijing. In the four seasons of 2016, the quarterly average concentration of the two stations in Beijing was much higher than the quarterly average of the four stations in Langfang city, which also hints that there may exist biased reading in Langfang.

Ratio ()
Shihuanbaoju in Cangzhou city 60 152 38%
(Site 9) (35) (80) (9%)
Shijiancezhan in Hengshui city 134 187 70%
(Nearest site, Site 16) (75) (97) (10%)
Table 7. Mean of , and their ratio, with standard deviation in parentheses for the two stations from February 8 to 10, 2016.
Ratio ()
Jiancezhongxin in Langfang city 104 226 46%
(site 18) (56) (124) (4%)
Huadianerqu in Baoding city 160 242 69%
(Nearest site, site 2) (76) (113) (13%)
Table 8. Mean of , and their ratio, with standard deviation in parentheses for the two stations from June 19 to 20, 2014.

6. Discussions

In this paper, we propose a hidden dynamic geostatistical calibration model to detect biased readings from monitoring stations. The method has several advantages. First, the station-specific calibration component of a spatial-temporal model help us automatically find out stations with biased readings. We further investigate those detected stations through the comparison of calibration component between monitoring stations, and from another perspective the concentration ratio of to . Second, as a hierarchical spatial temporal statistical model, the model is flexible enough to handle data with missing values and latent variable while capturing spatio-temporal dynamics.

Moreover, the method is efficient in discovering relational patterns from chemical reaction, meteorological, and geographical factors on the formation of . Among the four gaseous precursors, contributes to the largest influence on on average. Given that motor vehicle exhaust fumes are the main source of , this result indicates a more specific traffic-related type of pollution (e.g. Richter et al. (2005) and Zhang et al. (2007)). Motor vehicles data from National Bureau of Statistics of the People’s Republic of China shows a phenomenal surge in the number and use of motor vehicles in Hebei province. For example, the number of possessions of private cars in Hebei was 719 million in 2014, and it increased to 1143 million (nearly 60%) in 2016. So we suggest urban traffic should be one focus of policy action if air quality is to be improved. In the heating period (autumn and winter), the impact from exceeds that of . can come from fossil fuel combustions as well as motor vehicle emission. This result hints that fossil fuel combustions can largely exacerbate air pollution in addition to the increasing motor vehicle. Nearly all the meteorological variables have significant impact on , and dew point is the most influential meteorological factor. Beside, the geographic variable, distance to mountain, negatively relates to concentration, as it is easier to accumulate the polluted air near the mountain.

Finally, confronted with the great air-pollution challenges in China, we appeal that all Chinese cities should commit to ensuring the accuracy and reliability of air pollution data and further reducing levels.

Acknowledgements

This research is funded by a China’s National Key Research Special Program Grants 2016YFC0207701 and 2016YFC0207702. We would also like to thank the support from Center for Statistical Science in Peking University, and the Key Laboratory of Mathematical Economics and Quantitative Finance (Peking University), Ministry of Education.

References

  • Calculli et al. (2015) Calculli, C., A. Fassò, F. Finazzi, A. Pollice, and A. Turnone (2015). Maximum likelihood estimation of the multivariate hidden dynamic geostatistical model with application to air quality in apulia, italy. Environmetrics 26(6), 406–417.
  • Chapko and Solomon (1976) Chapko, M. K. and H. Solomon (1976). Air pollution and recreational behavior. The Journal of social psychology 100(1), 149–150.
  • Cressie and Wikle (2015) Cressie, N. and C. K. Wikle (2015). Statistics for spatio-temporal data. John Wiley & Sons.
  • Dawson et al. (2007) Dawson, J., P. Adams, and S. Pandis (2007). Sensitivity of pm2.5 to climate in the eastern us: a modeling case study. Atmospheric chemistry and physics 7(16), 4295–4309.
  • De Nevers (2010) De Nevers, N. (2010). Air pollution control engineering. Waveland press.
  • Fassò and Cameletti (2009) Fassò, A. and M. Cameletti (2009). The em algorithm in a distributed computing environment for modelling environmental space–time data. Environmental Modelling & Software 24(9), 1027–1035.
  • Fassò et al. (2007) Fassò, A., M. Cameletti, and O. Nicolis (2007). Air quality monitoring using heterogeneous networks. Environmetrics 18(3), 245–264.
  • Ghanem and Zhang (2014) Ghanem, D. and J. Zhang (2014). ‘effortless perfection:’do chinese cities manipulate air pollution data? Journal of Environmental Economics and Management 68(2), 203–225.
  • Han et al. (2011) Han, S., H. Bian, Y. Feng, A. Liu, X. Li, F. Zeng, and X. Zhang (2011). Analysis of the relationship between o3, no and no2 in tianjin, china. Aerosol Air Qual. Res 11(2), 128–139.
  • Hoek et al. (2013) Hoek, G., R. M. Krishnan, R. Beelen, A. Peters, B. Ostro, B. Brunekreef, and J. D. Kaufman (2013). Long-term air pollution exposure and cardio-respiratory mortality: a review. Environmental Health 12(1), 43.
  • Hu et al. (2010) Hu, H., Q. Yang, X. Lu, W. Wang, S. Wang, and M. Fan (2010). Air pollution and control in different areas of china. Critical Reviews in Environmental Science and Technology 40(6), 452–518.
  • Huang and Cressie (1996) Huang, H.-C. and N. Cressie (1996). Spatio-temporal prediction of snow water equivalent using the kalman filter. Computational Statistics & Data Analysis 22(2), 159–175.
  • Huang et al. (2007) Huang, H.-C., F. Martinez, J. Mateu, and F. Montes (2007). Model comparison and selection for stationary space–time models. Computational Statistics & Data Analysis 51(9), 4577–4596.
  • Ito et al. (2007) Ito, A., A. Ito, and H. Akimoto (2007). Seasonal and interannual variations in co and bc emissions from open biomass burning in southern africa during 1998–2005. Global biogeochemical cycles 21(2).
  • Jacob and Winner (2009) Jacob, D. J. and D. A. Winner (2009). Effect of climate change on air quality. Atmospheric environment 43(1), 51–63.
  • Krishnan and McLachlan (1997) Krishnan, T. and G. McLachlan (1997). The em algorithm and extensions. Wiley 1(997), 58–60.
  • Lelieveld et al. (2015) Lelieveld, J., J. Evans, M. Fnais, D. Giannadaki, and A. Pozzer (2015). The contribution of outdoor air pollution sources to premature mortality on a global scale. Nature 525(7569), 367–371.
  • Liang et al. (2015) Liang, X., T. Zou, B. Guo, S. Li, H. Zhang, S. Zhang, H. Huang, and S. X. Chen (2015). Assessing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating. Proc. R. Soc. A 471(2182), 20150257.
  • Pope et al. (1995) Pope, C. A., D. W. Dockery, and J. Schwartz (1995). Review of epidemiological evidence of health effects of particulate air pollution. Inhalation toxicology 7(1), 1–18.
  • Pope III et al. (2002) Pope III, C. A., R. T. Burnett, M. J. Thun, E. E. Calle, D. Krewski, K. Ito, and G. D. Thurston (2002). Lung cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pollution. Jama 287(9), 1132–1141.
  • Richter et al. (2005) Richter, A., J. P. Burrows, H. Nüß, C. Granier, and U. Niemeier (2005). Increase in tropospheric nitrogen dioxide over china observed from space. Nature 437(7055), 129–132.
  • Schwartz et al. (1996) Schwartz, J., D. W. Dockery, and L. M. Neas (1996). Is daily mortality associated specifically with fine particles? Air Repair 46(10), 927.
  • Shang et al. (2013) Shang, Y., Z. Sun, J. Cao, X. Wang, L. Zhong, X. Bi, H. Li, W. Liu, T. Zhu, and W. Huang (2013). Systematic review of chinese studies of short-term exposure to air pollution and daily mortality. Environment international 54, 100–111.
  • Sillman (1999) Sillman, S. (1999). The relation between ozone, nox and hydrocarbons in urban and polluted rural environments. Atmospheric Environment 33(12), 1821–1845.
  • Smith et al. (2003) Smith, R. L., S. Kolenikov, and L. H. Cox (2003). Spatiotemporal modeling of pm2. 5 data with missing values. Journal of Geophysical Research: Atmospheres 108(D24).
  • Sumway and Stoffer (2006) Sumway, R. H. and D. S. Stoffer (2006). Time series analysis and its applications with r examples. Time series analysis and its applications with R examples.
  • Wang et al. (2013) Wang, Y., Q. Zhang, K. He, Q. Zhang, and L. Chai (2013). Sulfate-nitrate-ammonium aerosols over china: response to 2000–2015 emission changes of sulfur dioxide, nitrogen oxides, and ammonia. Atmospheric Chemistry and Physics 13(5), 2635–2652.
  • Wu (1988) Wu, Y. (1988).

    Strong consistency and exponential rate of the “minimum l1-norm” estimates in linear regression models.

    Computational Statistics & Data Analysis 6(3), 285–295.
  • Yang et al. (2011) Yang, F., J. Tan, Q. Zhao, Z. Du, K. He, Y. Ma, F. Duan, and G. Chen (2011). Characteristics of pm 2.5 speciation in representative megacities and across china. Atmospheric Chemistry and Physics 11(11), 5207–5219.
  • Zhang et al. (2007) Zhang, X., P. Zhang, Y. Zhang, X. Li, and H. Qiu (2007). The trend, seasonal cycle, and sources of tropospheric no2 over china during 1997–2006 based on satellite measurement. Science in China Series D: Earth Sciences 50(12), 1877–1884.
  • Zheng et al. (2005) Zheng, M., L. G. Salmon, J. J. Schauer, L. Zeng, C. Kiang, Y. Zhang, and G. R. Cass (2005). Seasonal trends in pm2. 5 source contributions in beijing, china. Atmospheric Environment 39(22), 3967–3976.