Decision-making and data analysis tasks are made nontrivial by the presence of missing data in a database. The decisions made by decision makers are likely to be more accurate and reliable with complete datasets than with incomplete datasets containing missing data entries. Also, data analysis and data mining tasks yield more representative results and statistics when all the required data is available. As a result, there has been a lot of research interest in the domain of missing data imputation with researchers developing novel techniques to perform this task accurately and in a reasonable amount of time due to the time sensitive nature of some real life applications [Aydilek and Arslan (2012), Rana et al. (2015), Koko et al. (2015), Mistry et al. (2009), Nelwamondo et al. (2007), Leke et al. (2014), Mohamed et al. (2007), Abdella and Marwala (2005), Zhang et al. (2011) and Zhang (2011)]. Applications such as in medicine, manufacturing or energy that use sensors in instruments to report vital information that makes time sensitive decisions, may fail when there are missing data in the database. In such cases, it is very important to have a system capable of imputing the missing data from the failed sensors with high accuracy as quickly as possible. The imputation procedure in such cases requires the approximation of the missing value taking into account the interrelationships that exist between the values of other sensors in the system. There are several reasons that could lead to data being missing in a dataset. These could be as a result of data entry errors or respondents not answering certain questions in a survey during the data collection phase. Furthermore, failure in instruments and sensors could be a reason for missing data entries. The table below depicts a database consisting of seven feature variables with the values of some of the variables missing. The variables are , , , , , and .
Consider that the database in question has several records of the seven variables with some of the data entries for some variables not available. The question of interest is, can we say with some degree of certainty what the missing data entries are? Furthermore, can we introduce techniques for approximation of the missing data when correlation and interrelationships between the variables in the database are considered? We aim to use deep learning techniques, Genetic Algorithms (GAs), Maximum Likelihood Estimator (MLE) and Swarm Intelligence (SI) techniques to approximate the missing data in databases with the different models created catering to the different missing data mechanisms and patterns. Therefore, with knowledge of the presence of interrelationships or lack thereof between feature variables in the respective datasets, one will know exactly what model is relevant to the imputation task at hand. Also we plan to use fuzzy logic with deep learning techniques to perform the imputation tasks.
In this section, we present details on the problem of missing data and the Deep Learning techniques we aim to use to solve the problem.
2.1 Missing Data
Missing data is a scenario in which some of the components of the dataset are not available for all feature variables, or may not even be defined within the problem domain in the sense that the values do not match the problem definition by either being outliers or inaccurate(Rubin, 1978). This produces a variety of problems in several application domains that rely on the access to complete and quality data. As a result, techniques aimed at handling the problem have been an area of research for a while in several disciplines [Allison (1999), Little and Rubin (2014) and Rubin (1978)]. Missing data may occur in several ways in a dataset. For example, it may occur due to several participants’ non-response to questions in the data collection process or data entry process . There are also other situations in which missing data may occur due to failures of sensors or instruments in the data recording process for sectors that use these. The following subsubsections present the different missing data mechanisms.
2.1.1 Missing Data Mechanisms
The way to handle missing data in a reasonable manner depends on how the data points go missing. According to Little and Rubin (2014), there exist three missing data mechanisms. They are: Missing Completely at Random (MCAR), Missing at Random (MAR), and a Missing not at Random or Non-Ignorable case(MNAR or NI).
Missing Completely at Random
MCAR scenario arises when the chances of there being a missing data entry for a feature variable is not dependent on the feature variable itself or on any of the other feature variables in the dataset (Leke et al., 2014). This implies that the missing value is independent of the feature variable being considered or the other feature variables within the dataset (Rubin, 1978). In Table 1, the nature of the missing value in for row 5 is said to be MCAR if the nature of this missing value does not depend on , , , , and and the variable itself.
Missing at Random
MAR occurs if the chances of there being a missing value in a specific feature variable depends on all the other feature variables within the dataset, but not on the feature variable of interest (Leke et al., 2014). MAR means the value for the feature variable is missing, but conditional on some other feature variable observed in the dataset, although not on the feature variable of interest (Scheffer, 2002). In Table 1, the nature of the missing value in is said to be MAR if the missing nature of the value depends on , , , , and but not on itself.
Missing Not at Random or Non-Ignorable Case
The third type of missing data mechanism is the non-ignorable case. The non-ignorable case occurs when the chances of there being a missing entry in variable, for example, is influenced by the value of the variable regardless of whether or not the other variables in the dataset are altered and modified [Leke et al. (2014), Allison (1999)]. In this case, the pattern of missing data is not random and it is impossible to predict this missing data using the rest of the variables in the dataset. Non-ignorable missing data is the most difficult to approximate and model than the other two missing data mechanisms (Rubin, 1978). In Table 1 the nature of the missing value in is said to be non-ignorable if the missing value in depends on the variable itself and not on the other variables.
2.2 Missing Data Patterns
There are two main missing data patterns defined by Little and Rubin (2014). These patterns are the arbitrary and monotone missing data patterns. In the arbitrary missing data pattern, missing observations may occur anywhere and the ordering of the variables is of no importance as in rows 1 to 5. In monotone missing patterns, the ordering of the variables is of importance and occurrence is not random. In this case, if we have a dataset with variables as in Table 1, it is said to be a monotone missing pattern if a variable is observed for a particular scenario, and this implies that all the previous variables , where , are also observed for that scenario (Little and Rubin, 2014). Table 1 shows an arbitrary missing data pattern from rows 1 to 5 and a monotone missing data pattern from rows 6 to 9. In Table 1 the missing values are random and can happen at any point in the dataset from rows 1 to 5 while it can be seen that missing values have some common order in rows 6 to 9. This means that if the values for a variable are missing, so are the values for other variables , where .
2.3 Deep Learning
Deep Learning comprises of several algorithms in machine learning that make use of a cataract of nonlinear processing units organized into a number of layers that extract and transform features from the input data [Deng et al. (2013), Deng and Yu (2014)
]. Each of the layers use the output from the previous layer as input and a supervised or unsupervised algorithm could be used in the training or building phase. With these come applications in supervised and unsupervised problems like classification and pattern analysis respectively. It is also based on the unsupervised learning of multiple levels of features or representations of the input data whereby higher-level features are obtained from lower level features to yield a hierarchical representation of the data(Deng and Yu, 2014). By learning multiple levels of representations that depict different levels of abstraction of the data, we obtain a hierarchy of concepts.
There are different types of Deep Learning architectures such as Convolutional Neural Networks (CNN), Convolutional Deep Belief Networks (CDBN), Deep Neural Networks (DNN), Deep Belief Networks (DBN), Stacked (Denoising) Auto-Encoders (SAE/SDAE) and Deep/Stacked Restricted Boltzmann Machines (DBM). We intend to make use of DNNs and SAEs predominantly, and the others with the exception of CNNs and CDBNs. DNNs are commonly understood in terms of the Universal Approximation Theorem, Probabilistic Inference or Discrete Signal Processing. An artificial neural network (ANN) with numerous hidden layers of nodes between the input layer and the output layer is known as a DNN. They are typically designed as feed forward networks and can be trained discriminatively utilizing standard back propagation with updates of the weights being done by use of stochastic gradient descent. Typical choices for the activation and cost functions are the softmax and cross entropy functions for classification tasks, with sigmoid and standard error functions used for regression or prediction tasks with normalized inputs. In Figures1-3, the architectures of four deep learning techniques are depicted. Figure 1 shows a DNN with eight input nodes in the input layer, three hidden layers each with nine nodes and an output layer with four nodes. The nodes from each layer are connected with those from the subsequent and preceding layers. Figure 2 shows a DBN and a DBM whereby the first layer of nodes (bottom-up) is the input layer (v) with visible units representing the database feature variables and the subsequent layers of nodes are binary hidden nodes (h). The arrows in the DBN indicate that the training is a top-down approach while the lack of arrows in the DBM is a result of the training being both top-down and bottom-up. In Figure 3, we see individual RBMs being stacked together to form the encoder part of an autoencoder, which is transposed to yield the decoder part. The autoencoder is then fine-tuned using back propagation to modify the interconnecting weights with the aim being to minimize the network error.
2.4 Related Work
In this section, we present some of the work that has been done by researchers to address the problem of missing data. In Zhang et al. (2011), it is suggested that information within incomplete cases, that is, instances with missing values be used when estimating missing values. A nonparametric iterative imputation algorithm (NIIA) is proposed that leads to a root mean squared error value of at least 0.5 on the imputation of continuous values and a classification accuracy of at most 87.3% on the imputation of discrete values with varying ratios of missingness. Lobato el al. (2015) present a multi-objective genetic algorithm approach for missing data imputation. It is observed that the results obtained outperform some of the well known missing data methods with accuracies in the 90 percentile. In Zhang (2011), the shell-neighbor method is applied in missing data imputation by means of the Shell-Neighbor Imputation (SNI) algorithm which is observed to perform better than the k-Nearest Neighbor imputation method in terms of imputation and classification accuracy as it takes into account the left and right nearest neighbors of the missing data as well as varying number of nearest neighbors contrary to k-NN that considers just fixed k nearest neighbors. Rana et al. (2015) use robust regression imputation for missing data in the presence of outliers and investigate its effectiveness. Abdella and Marwala (2005) implement a hybrid genetic algorithm-neural network system to perform missing data imputation tasks with varying number of missing values within a single instance while Aydilek and Arslan (2012)
create a hybrid k-Nearest Neighbor-Neural Network system for the same purpose. In some cases, neural networks were used with Principal Component Analysis (PCA) and genetic algorithm as inMistry et al. (2009), Mohamed et al. (2007) and Nelwamondo et al. (2007). Leke et al. (2014)
use a hybrid of Auto-Associative neural networks or autoencoders with genetic algorithm, simulated annealing and particle swarm optimization to impute missing data with high levels of accuracy in cases where just one feature variable has missing input entries. Novel algorithms for missing data imputation and comparisons between existing techniques can be found in papers such asSchafer and Graham (2002), Liew et al. (2011), Myers (2011), Lee and Carlin (2010), Baraldi and Enders (2010), Van Buuren (2012), Jerez et al. (2010) and Kalaycioglu et al. (2015).
3 Theoretical Model
In this section, we outline the methodology used to address the problem of missing data. The approach used to design the novel imputation techniques with SAE/SDAE involves the following six steps which are depicted in figure 4:
Train the Deep Neural Network with a complete set of records to recall the inputs as the outputs. Inputs are the dataset feature variables, for example to in Table 1, and the outputs are these same feature variables as the aim is to reproduce these inputs at the output layer. For the network to be able to do this, it needs to extract information from the input data, which is captured in the updated network weights and biases. The extraction of information is done during the training phase whereby lower level features are extracted from the input data after which low-level features are extracted till high-level features are obtained yielding a hierarchical representation of the input data. The overall idea is that features are extracted from features to get as good a representation of the data as possible. In the encoder phase mentioned in the previous section, a deterministic mapping function,
, creates a hidden representation,, of the input data . It is typically represented by an affine mapping and subsequently a nonlinearity, (Isaacs, 2014). The parameter comprises of the matrix of weights
and the vector of offsets/biases. In the decoder phase, being the hidden representation is remapped to which is a vector reconstruction in the input space with (Isaacs, 2014). The function is the decoder function which is an affine mapping deliberately ensued by a non-linearity with squashing traits that either follows the form or with the parameter set comprising of the transpose of the weights and biases from the encoder (Isaacs, 2014).
Obtain the objective function from step 1 as depicted in Figure 4 as input to the optimization techniques. The updated weights and biases mentioned in step 1 are gotten by back propagating the error at the output layer obtained by comparing the actual output to the network output through the network. The function or equation used to compare the actual output to the network output is used as the objective function. from step 1 is not explained as a rigorous regeneration of but rather as the parameters of a distribution in probabilistic terms, that may yield
with high probability(Isaacs, 2014). This thus leads to . From this, we obtain an associated reconstruction error which is to be optimized by the optimization techniques and is of the form . This equation could also be written as (Bengio et al., 2013). For a denoising autoencoder, the reconstruction error to be optimized is expressed as where averages over the corrupted examples drawn from a corruption process (Bengio et al., 2013).
Approximate the missing data entries using the approximation techniques. MCAR, MAR and MNAR missing data mechanisms will be considered as well as arbitrary and monotone missing data patterns. Different models will be created to experiment with these and test the hypothesis. In testing the hypothesis, we use the test set of data which consist of known feature variable values and unknown or missing feature variable values as input to the trained deep learning technique. The values are passed as input to the network while the values are first estimated by the approximation techniques before being passed into the network as input. The optimal value is obtained when the objective function from step 2 is minimized.
Use the now completed database with the approximated missing values in the trained Deep Learning method from step 1 to observe whether or not the objective has been minimized. In this case, that will be checking if the error is minimized as we attempt to reconstruct the input.
If so, the complete dataset is presented as output.
If not, do step 3.
4 Possible Benefits
In this research, we are introducing novel data imputation techniques which we expect will be of benefit to the research community interested in missing data imputation. Some of these are:
With the techniques introduced, we expect to yield improved missing data imputation accuracies compared against existing methods by looking at the relative prediction accuracy, correlation coefficient, standard square error, mean and root mean squared errors and other relevant representative metrics in comparison to existing techniques. This expectation stems from the manner in which deep learning methods extract information and features from the input data.
With literature stating that deep neural networks are capable of representing and approximating more complex functions and relations than simple neural networks, we hope these techniques will be applicable in a variety of sectors regardless of the complexity of the problem with high accuracy. This will be tested against existing techniques and the aforementioned.
Possible parallelization of the imputation tasks using the methods to be introduced could lead to faster imputed missing values which benefits time sensitive applications.
5 Possible Limitations
Although there are possible benefits to using the novel techniques to be introduced, there could possibly be limitations observed, for example:
Using Deep Neural Networks could possibly lead to a lot of time being required to do the imputations and obtaining a complete dataset due to the number of parameters that need to be optimized during training and also the number of computations done during testing. The full effect of long computation times could be felt in time sensitive applications such as in medicine, finance or manufacturing. The slow computation time could be addressed by parallelizing the processes on a multicore system. Each core could handle the imputation of the missing data value(s) in different rows depending on the number of cores. Also, dynamic programming could be used to speed up the computation time.
Besides time being a factor, there could also be a problem of space required to do the computations. To address these two drawbacks, a complexity analysis will be done to verify the time and space complexities of the proposed methods. Anything less than will be preferable with being regarded as the ideal complexity for both.
In this article, we propose a new hypothesis that the use of deep learning techniques in conjunction with swarm intelligence, genetic algorithms and maximum likelihood estimator methods will lead to better imputations due to the fact that a hierarchical representation of the input data is obtained as higher level features are further extracted from lower level features in deep learning methods. This hypothesis is investigated by taking into account a comparison between the techniques to be introduced and the existing methods like Neural Networks with Genetic Algorithm, Auto-Associative Neural Network with Genetic Algorithm, K-Nearest Neighbor with Neural Networks, Neural Networks with Principal Component Analysis and Genetic algorithm and so on. The main motivation behind this hypothesis is the need to provide datasets with highly representative and accurate feature values from which trustworthy decisions and data analytics and statistics will emerge.
- Abdella and Marwala (2005) Abdella, M. and Marwala, T. The use of genetic algorithms and neural networks to approximate missing data in database. Computational Cybernetics, 2005. ICCC 2005. IEEE 3rd International Conference on. IEEE. pp.207-212. 2005.
- Allison (1999) Allison, Paul D. Multiple imputation for missing data: A cautionary tale. Philadelphia. 1999.
Arel et al. (2010)
Arel, I., Rose, D. C. and Karnowski, T. P. Deep machine learning-a new frontier in artificial intelligence research [research frontier].Computational Intelligence Magazine, IEEE. IEEE. 5(4):13-18. 2010.
- Aydilek and Arslan (2012) Aydilek, I. B. and Arslan, A. A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks. International Journal of Innovative Computing, Information and Control. 7(8):4705-4717. 2012.
- Baraldi and Enders (2010) Baraldi, A. N. and Enders, C. K. An introduction to modern missing data analyses. Journal of School Psychology. Elsevier. 48(1):5-37. 2010.
- Bengio et al. (2013) Bengio, Y., Courville, A. and Vincent, P. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on. IEEE. 35(8):1798-1828. 2013.
- Deng et al. (2013) Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J. and others. Recent advances in deep learning for speech research at Microsoft. Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE. pp.8604-8608. 2013.
- Deng and Yu (2014) Deng, L. and Yu, D. Deep learning: methods and applications. Foundations and Trends in Signal Processing. Now Publishers Inc. 7(3-4):197-387. 2014.
- Donders et al. (2006) Donders, A. R. T., van der Heijden, G. J., Stijnen, T., Moons, K. G. Review: a gentle introduction to imputation of missing values. Journal of clinical epidemiology. Elsevier. 59(10):1087-1091. 2006.
- Erhan et al. (2010) Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P. and Bengio, S. Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research. JMLR. org. (11):625-660. 2010.
- Gu and Matloff (2015) Gu, X. and Matloff, N. A Different Approach to the Problem of Missing Data. arXiv preprint arXiv:1509.04992. 2015.
- Hinton and Salakhutdinov (2006) Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reducing the dimensionality of data with neural networks. Science. American Association for the Advancement of Science. 313(5786):504-507. 2006.
- Hinton et al. (2006) Hinton, G. E., Osindero, S. and Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural computation. MIT Press. 18(7):1527-1554. 2006.
- Isaacs (2014) Isaacs, J. C. Representational learning for sonar ATR. Proc. SPIE. doi: 10.1117/12.2053057. http://dx.doi.org/10.1117/12.2053057. 9072:907203-907203-9, 2014.
- Jerez et al. (2010) Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M. and Franco, L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial intelligence in medicine. Elsevier. 50(2):105-115. 2010.
- Kalaycioglu et al. (2015) Kalaycioglu, O., Copas, A., King, M. and Omar, R. Z. A comparison of multiple-imputation methods for handling missing data in repeated measurements observational studies. Journal of the Royal Statistical Society: Series A (Statistics in Society), Wiley Online Library. 2015.
Koko et al. (2015)
Koko, E. E. M. and Mohamed, A. I. A. Missing data treatment method on cluster analysis.International Journal of Advanced Statistics and Probability, 3(2):191-209. 2015.
- Lee and Carlin (2010) Lee, K. J. and Carlin, J. B. Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. American journal of epidemiology. Oxford Univ Press. 171(5):624-632. 2010.
- Leke et al. (2014) Leke, C., Twala, B. and Marwala, T. Modeling of missing data prediction: Computational intelligence and optimization algorithms. Systems, Man and Cybernetics (SMC), 2014 IEEE International Conference on. IEEE. pp.1400-1404. 2014.
- Liew et al. (2011) Liew, A. W.-C., Law, N.-F. and Yan, H. Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Briefings in bioinformatics. Oxford Univ Press. 12(5):498-513. 2011.
- Lobato el al. (2015) Lobato, F., Sales, C., Araujo, I., Tadaiesky, V., Dias, L., Ramos, L. and Santana, A. Multi-Objective Genetic Algorithm For Missing Data Imputation. Pattern Recognition Letters, Elsevier. 2015.
- Little and Rubin (2014) Little, R. J. and Rubin, D. B. Statistical analysis with missing data. John Wiley & Sons. 2014.
- Ma and Zhong (2015) Ma, X. and Zhong, Q. Missing value imputation method for disaster decision-making using K nearest neighbor. Journal of Applied Statistics. Taylor & Francis. pp.1-15. 2015.
- Mistry et al. (2009) Mistry, F. J., Nelwamondo, F. V. and Marwala, T. Missing Data Estimation Using Principle Component Analysis and Autoassociative Neural Networks. Journal of Systemics, Cybernatics and Informatics. 7(3):72-79. 2009.
- Mohamed et al. (2007) Mohamed, A. K., Nelwamondo, F. V., Marwala, T. Estimating missing data using neural network techniques, principal component analysis and genetic algorithms. Proceedings of the Eighteenth Annual Symposium of the Pattern Recognition Association of South Africa. 2007.
- Myers (2011) Myers, T. A. Goodbye, listwise deletion: Presenting hot deck imputation as an easy and effective tool for handling missing data. Communication Methods and Measures. Taylor & Francis. 5(4):297-310. 2011.
- Nelwamondo et al. (2007) Nelwamondo, F. V., Mohamed, S. and Marwala, T. Missing data: A comparison of neural network and expectation maximisation techniques. arXiv preprint arXiv:0704.3474. 2007.
- Rana et al. (2015) Rana, S., John, A. H., Midi, H., and Imon, A. Robust Regression Imputation For Missing Data in the Presence of Outliers. Far East Journal of Mathematical Sciences. Pushpa Publishing House. 97(2):183. 2015.
- Rubin (1978) Rubin, Donald B. Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. Proceedings of the survey research methods section of the American Statistical Association. American Statistical Association. 1:20-34. 1978.
- Salakhutdinov and Hinton (2009) Salakhutdinov, R. and Hinton, G. E. Deep boltzmann machines. International Conference on Artificial Intelligence and Statistics. pp. 448-455. 2009.
- Sandberg and Barnard (1997) Sandberg, J. and Barnard, Y. Deep learning is difficult. Instructional Science. Springer. 25(1):15-36. 1997.
- Schafer and Graham (2002) Schafer, J. L. and Graham, J. W. Missing data: our view of the state of the art. Psychological methods. American Psychological Association. 7(2):147. 2002.
- Scheffer (2002) Scheffer, J. Dealing with missing data. Massey University. 2002.
- Schmidhuber (2015) Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks. Elsevier. (61):85-117. 2015.
- Van Buuren (2012) Van Buuren, S. Flexible imputation of missing data. CRC press. 2012.
- Zhang et al. (2011) Zhang, S., Jin, Z. and Zhu, X. Missing data imputation by utilizing information within incomplete instances. Journal of Systems and Software. Elsevier. 84(3):452-459. 2011.
- Zhang (2011) Zhang, S. Shell-neighbor method and its application in missing data imputation. Applied Intelligence. Springer. 35(1):123-133. 2011.