On-board diagnostics (OBD) data is multi-attribute trajectory data obtained from sensors in vehicles. It contains time series of engine and vehicle performance parameters. Guided by a low-order combustion-physics-based model, this paper aims to develop an OBD-data-driven AI model to predict vehicle emissions values.
This problem is of significant societal importance because transportation is the biggest worldwide contributor to greenhouse gases such as (Carbon dioxide) and toxic gases like (Oxides of Nitrogen like , , etc.). These emissions impair people’s health  and global climate. An understanding of vehicle and engine behavior in the real world is essential for tracking and eventually mitigating these emissions by aiding the design of cleaner and more efficient vehicle systems.
This problem is challenging because the processes by which they are produced are complex and dependent on many parameters. Traditional laboratory experiments conducted to measure emissions values are usually based on engine-specific steady-state measurements. However, data collection inside a laboratory is expensive compared to low-cost sensor data from vehicles on the road.
Two types of related work are relevant: purely phenomenological methods or purely AI approaches.  introduced a low-order physics (LOP) model that uses a purely phenomenological method for predicting the emission index of (, grams of per kilogram fuel) of a diesel engine. It assumes that depends on the intake oxygen concentration, duration of combustion, and the peak adiabatic (i.e. without heat loss) flame temperature, which was validated using engine testing observations in laboratory conditions. To show its performance in the real world, we evaluated it using an OBD dataset from transit buses in the Metro Transit (local public transportation agency), which showed that the LOP method had poor accuracy (Figure 1).
An example of a purely AI approach is 
, which evaluates the performance of an artificial neural network (ANN) on data from a laboratory test rig for an engine. However, it provides no understanding offormation and had spurious non-physical results. Instead, engine scientists prefer an approach to predicting emissions that is interpretable using domain knowledge [3, 4, 5, 6].
Contributions: Here, we propose a novel physics-aware AI model that leverages the concepts of variability across driving scenarios, co-occurrence patterns, and a low-order combustion-physics-based model. We evaluate the proposed model using OBD data from a transit bus in the Metro Transit, Twin Cities, MN fleet. The evaluation results show that the proposed physics-aware AI model predictions are more accurate ( lower RMSE for training data) than those of the low-order physics model for our OBD dataset.
Scope: The scope of this paper is limited to physics-aware, transparent, and interpretable AI models such as co-occurrence rules guided by a low-order physics-based model. Other AI models such as neural networks fall outside the scope of this work. Proprietary manufacturers’ physics-based AI models are also outside the scope, due to lack of public availability of proprietary engine calibration data. This paper focuses on the prediction of emissions from vehicles, thus other vehicular emissions are not considered.
2 Proposed Baseline Approach
We first introduce a baseline physics-aware AI model to predict emission values. The emission index for is given by a chemical kinetic equation for the extended Zeldovich Mechanism  :
where, is the in grams per kilogram fuel at time ‘k+’; is the adiabatic flame temperature in kelvin at time ‘’; is the duration of combustion in seconds at time stamp ‘’, which is approximately equal to the fuel injection duration; are constants; and is the time lag between the adiabatic flame temperature and duration of combustion with the corresponding emission index . More details of the physics calculations are provided in the Appendix Sections of 
We evaluated the baseline physics-aware AI model using the same OBD dataset as the one used for the LOP method. First, we used six engine attributes (i.e. intake air flowrate(kilograms per hour), fuel consumed (kilograms per hour), rail pressure (pascal), intake pressure (pascal), intake temperature (kelvin), engine speed (revolutions per minute)) to calculate and . Then, we applied a nonlinear regression method from Python Scikit-Learn package 
to estimate the values ofand in Equation 1. The value of
was derived using hand computation and data visualization.
3 Proposed Variability-Aware Approach
To overcome the limitations of the baseline method, we propose a spatiotemporal (ST) variability-aware AI approach. Since one group of estimated parameters (e.g., ) values in Equation 1 does not fit all scenarios well, it may be beneficial to initially partition the data into multiple homogeneous groups, and estimate parameter values group-wise.
The top half of Figure 3 shows our proposed ST variability-aware AI framework. First, we test in-coming OBD data to identify emissions that diverge from the predictions made by the baseline model. We define divergence as the large (i.e. above a given threshold) absolute error between the observed and predicted values. In general, when a vehicle exhibits divergence, there are two potential pathways for understanding the issues and improving the model: (1) using AI to improve the prediction results, or (2) using physics-based methods to develop new and refined process-based mechanistic models.
This paper focuses on AI model refinement based on data partitioning and fitting separate models to each partition. The partitioning is based on ST correlates of divergent observations, thus we call it an ST variability-aware AI approach. This approach can potentially reduce prediction errors as illustrated in Figure 3 (lower half).
A divergent window of emissions refers to a period of a certain length in a time series of OBD data records within which the prediction errors of the baseline approach exceed an input threshold (summationThreshold). A co-occurrence pattern in a time series of OBD data records is similar to a sequential association pattern  except for the use of spatial statistical interest measure, i.e., temporal form of Ripley’s cross-k function () [1, 11]. These pattern represents those subsets of engine attributes with their specific value ranges, which are present together in many divergent (time-) windows and have cross-k function values above a given threshold (). Engine scientists review and group co-occurrence patterns into scenarios (for example, cold start of an engine, sudden acceleration, etc.) for physically interpreting situations where the baseline model performs poorly.
Given the co-occurrence pattern groups formed in the first step, the original OBD data is split into multiple subsets corresponding to different pattern groups. Within each subset, we use the baseline approach to calculate the values of and , and then estimate the parameter () values in Equation 1 by fitting nonlinear regression models independently. Since the scenarios when the baseline approach does not perform well are handled separately, the ST variability-aware AI model is expected to yield better predictive accuracy by lowering errors.
4 Experimental Evaluation and Discussion
We conducted experiments to compare predictive accuracy of the proposed approaches with the low-order physics (LOP) approach detailed in Section 1  to address the following questions: (1) How do the predictions of the proposed approaches compare with those from the low-order physics approach? (2) How sensitive is the proposed spatiotemporal variability-aware AI approach to the number of partitions, input divergence threshold, and window length?
Data: The dataset used in the experiments is the Metro Transit OBD dataset that was used to evaluate the LOP approach and the proposed baseline approach in the earlier sections. It contains 99,895 data entries containing measurements of 90 engine and vehicle attributes . The OBD data was obtained from transit buses traversing 3 different routes for 16 different runs in the Minneapolis-St.Paul region. We used 8 runs for training sample and the remaining 8 runs for testing, ensuring each route is represented in both samples. More details are provided in 
Candidate methods and metrics: The methods evaluated in the experiments include low-order physics model (LOP), the proposed baseline (P-Base), and the proposed spatiotemporal variability-aware AI approach (P-STVA). Predictive accuracy was measured using values, root mean square error (RMSE) and mean absolute error (MAE).
Experimental Results: Figure 4 shows a comparison of the refined prediction using the P-STVA method with the observed values in the training data. Compared with Figure 1 and 2, the dots in Figure 4 are closer to the line, and the number of green and yellow dots in the upper-left part of Figure 2 reduces dramatically, which indicates improved predictive accuracy.
How do the predictions of the proposed approaches compare with those from the low-order physics approach? Table 2 and 2 summarize predictive accuracy metrics for the candidate methods on training and testing data respectively with and summationThreshold = 30 ppm (parts per million). The proposed physics-aware AI methods outperformed the low-order physics model. For the training data, the P-Base method provides about improvement in RMSE and improvement in MAE when compared to the LOP method, while RMSE and MAE of the P-STVA method with are both around smaller than the P-base method. For the testing data, the P-base method provides about improvement in RMSE and improvement in MAE when compared to the LOP method, while RMSE and MAE of the P-STVA method with are both around smaller than the P-base method.
|P-STVA n = 4||0.3900||132.60||102.13|
|P-STVA n = 4||0.4769||117.39||92.99|
How sensitive is the proposed spatiotemporal variability-aware AI approach to the number of partitions, input divergence threshold, and window length? Figures 5 a,b, and c show the sensitivity of predictive accuracy metrics of the P-STVA method to number of patterns (number of partitions is ), summationThreshold, and window Length L respectively for training and testing data. Optimum values are found at , summationThreshold = 30 ppm and L = 3 s.
|Pattern1||EngTq:||High Engine Load|
|Pattern2||EngTq:||High Load Transient|
|High Engine Idling|
|Low Engine Speed|
|Subscript||Scale of values|
|0,1||Very low value|
Domain interpretation of partitions: The P-STVA method uses partitions including the non-divergent case (and divergent cases not covered by the patterns) and partitions of divergent cases, one for each co-occurrence pattern. Table 3 shows 4 co-occurrence patterns for the case , summationThreshold = 30 ppm and L = 3 seconds along with their domain interpretation in terms of different scenarios.
5 Conclusions and Future Work
We proposed a novel physics-aware AI emission prediction model and evaluated it with an on-board diagnostics dataset. The experimental evaluation shows the proposed models outperform the non-AI low-order physics model. Furthermore, the resultant models were interpreted using domain concepts as different vehicle scenarios.
In the future, we will explore other AI models such as neural networks guided by combustion physics. We will characterize the sensitivity of the computation time of the proposed P-STVA method to parameters such as the number of partitions. We will also investigate physics-aware AI models to predict vehicle emissions other than as well as to predict energy use. This will assess the applicability of the Co-Occurrence Pattern Based Approach (Shown in top half of Figure 3
) to other problems. Currently, engine scientists review and group co-occurrence patters into scenarios for domain interpretation. For the future work, we will explore building a machine learning model for grouping the patterns to assist human experts and reduce manual labor.
This material is based upon work supported by the National Science Foundation under Grant No. 1901099. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
-  (2017) Discovering non-compliant window co-occurrence patterns. GeoInformatica 21 (4), pp. 829–866. Springer,US (eng). External Links: Cited by: §3.
-  (2008-04) Late intake valve closing as an emissions control strategy at tier 2 bin 5 engine-out nox level. SAE Int. J. Engines 1 (), pp. 427–443. SAE Int.. External Links: Cited by: §1, §4.
-  (2018) Physics guided machine learning: a new paradigm for modeling dynamical systems. AGUFM 2018, pp. IN12A–03. AGU.org. Cited by: §1.
Theory-guided data science: a new paradigm for scientific discovery from data. IEEE Trans. on Knowledge and Data Engg. 29 (10), pp. 2318–2331. IEEE. Cited by: §1.
-  (2020) Physics-guided energy-efficient path selection using on-board diagnostics data. ACM Trans. on Data Science (TDS) 1 (1), pp. . ACM. External Links: Cited by: §1.
-  (2018) Physics-guided energy-efficient path selection: a summary of results. In SIGSPATIAL’18, pp. 99–108 (eng). External Links: Cited by: §1.
-  (1998) Skeletal mechanism for nox chemistry in diesel engines. SAE Trans. 107, pp. 786–801. SAE Int.. External Links: Cited by: §2.
-  (2009-10) Evaluation of artificial nns performance in predicting diesel engine nox emissions. Research Journal of Applied Sciences, Engg. and Technology 33. Maxwell Scientific Publ., pp. . Cited by: §1.
-  (2020) Vehicle Emissions Prediction with Physics-Aware AI Models: Technical Report (20-003). Technical report Retrieved from the University of Minnesota Digital Conservancy, http://hdl.handle.net/11299/216628. Cited by: §2, §4.
-  (2011) Scikit-learn: ml in python. Journal of ML Research 12 (85), pp. 2825–2830. JMLR.org. External Links: Cited by: §2.
The second-order analysis of stationary point processes.
Journal of applied probability13 (2), pp. 255–266. Cambridge University Press. Cited by: §3.
-  (1996) Mining sequential patterns: generalizations and performance improvements. In Proc. of the 5th Int. Conf. on Extending Database Technology: Advances in Database Technology, EDBT ’96, Berlin, Heidelberg, pp. 3––17. Springer–Verlag. External Links: Cited by: §3.
-  (2020) Reducing mortality from air pollution in the us by targeting specific emission sources. Env. Science & Tech. Letters. American Chemical Society (English (US)). External Links: Cited by: §1.