Previous research across a wide range of academic fields suggests that decision-making and data analysis tasks are made nontrivial by the presence of missing data. As such, it can be assumed that decisions are likely to be more accurate and reliable when complete/representative datasets are used instead of incomplete datasets. This assumption has led to a lot of research in the data mining domain, with novel techniques being developed to perform this task accurately -
. Research suggests that applications in various professional fields such as in medicine, manufacturing or energy that use sensors in instruments to report vital information and enable decision-making processes may fail and lead to incorrect outcomes due to the presence of missing data. In such cases, it is very important to have a system capable of imputing the missing data from the failed sensors with high accuracy. The imputation procedure will require the approximation of missing values taking into account the interrelationships that exist between the data from sensors in the system. Another instance where the presence of missing data poses a threat in decision-making is in image recognition systems, whereby the absence of pixel values renders the image prediction or classification task difficult and as such, systems capable of imputing the missing values with high accuracy are needed to make the task more feasible.
Consider a high dimensional dataset such as the Mixed National Institute of Standards and Technology (MNIST) dataset with 784 feature variables being the pixel values as shown in Fig. 1
above. Assuming that pixel values are missing at random as observed in the bottom row and a statistic analysis is required to classify the above dataset, the questions of interest would be: (i) Can we impute with some degree of certainty the missing data in high dimensional datasets with high accuracy? (ii) Can new techniques be introduced for approximation of the missing data when correlation and interrelationships between the variables are considered? This paper therefore aims to use a Deep Learning (DL) technique built with Restricted Boltzmann machines stacked together to form an autoencoder in tandem with a swarm intelligence (SI) algorithm to estimate the missing data with the model created which would cater to the mechanisms of interest and the arbitrary pattern. The dataset used is the MNIST database of handwritten digits by Yann LeCun. It has a training set of 60,000 sample images and a test set of 10,000 sample images with 784 features. These images show handwritten digits from 0 to 9. Due to the fact that the research discussed in this paper was conducted at a time when there was little or no interest in the DL-SI missing data predictors on high dimensional data, this paper seeks to exploit the use of this technique on the MNIST dataset.
The remainder of this paper is structured as follows, Section 2 introduces missing data, the deep learning techniques used as well as the swarm intelligence algorithm implemented. This section also presents related work in the domain. Section 3 presents the experimental design and procedures used, while Section 4 focuses on the results and key findings from the experiments conducted in this article. Discussions, concluding remarks and suggestions for future research are further presented in Section 5.
This article implements a Deep Learning technique referred to as a Stacked Autoencoder built using Restricted Boltzmann machines, all of which have been individually trained using the Contrastive Divergence algorithm and stacked together in a bottom-up manner. The estimation of missing values is performed by using the Firefly Algorithm, which is the swarm intelligence method. However, this article will first briefly discuss the methods used and the problem it aims to solve.
2.1 Missing Data and Deep Learning
Missing data is a situation whereby some features within a dataset are lacking components . With this ensues problems in application domains that rely on the access to complete and quality data which can affect every academic/professional fields and sectors. Techniques aimed at rectifying the problem have been an area of research in several disciplines -. The manner in which data points go missing in a dataset determines the approach to be used in estimating these values. As per , there exist three missing data mechanisms. This article focuses on investigating the Missing Completely at Random (MCAR) and Missing at Random (MAR) mechanisms. Previous research suggests that MCAR scenario arises when the chances of there being a missing data entry for a feature is not dependent on the feature itself or on any of the other features in the dataset . This implies a lack of correlation or cross-correlation between features including the feature of interest . MAR on the other hand arises when missingness in a specific feature is reliant upon the other features within the dataset, but not the feature of interest itself . According to , there are two main missing data patterns. These are the arbitrary and monotone missing data patterns. In the arbitrary pattern, missing observations may occur anywhere and the ordering of the variables is of no importance. In monotone missing patterns, the ordering of the variables is of importance and occurrence is not random. Based upon this realization, this article will go on to focus on the arbitrary missing pattern.
Deep Learning comprises of several algorithms in machine learning that make use of a cataract of nonlinear processing units organized into a number of layers that extract and transform features from the input data, . Each of the layers use the output from the previous layer as input and a supervised or unsupervised algorithm could be used in the training phase. With these come applications in supervised and unsupervised problems like classification and pattern analysis, respectively. It is also based on the unsupervised learning of multiple levels of features or representations of the input data whereby higher-level features are obtained from lower level features to yield a hierarchical representation of the data . By learning multiple levels of representations that depict different levels of abstraction of the data, we obtain a hierarchy of concepts. In this article, the Deep Learning technique used is the Stacked AutoEncoder.
2.2 Restricted Boltzmann Machine (RBM)
Firstly, a Boltzmann machine (BM) is an undirected network with nodes possessing stochastic traits that can be described as a neural network. It is used amongst other things to extract vital information from an unknown probability distribution using samples from the distribution, which is generally a difficult process. This learning process is made simple by implementing restrictions on the network structure leading to Restricted Boltzmann machines (RBMs). An RBM can be described as an undirected, probabilistic, parameterized graphical model also known as a Markov random field (MRF). RBMs became techniques of interest after being suggested as components of multi-layer topologies termed deep networks . The idea is that hidden nodes extract vital information from the observations, which subsequently represent inputs to the next RBM. Stacking these RBMs together has as objective, obtaining high level representations of data by learning features from features. An RBM which is also an MRF associated with a bipartite undirected graph consists of visible nodes, representing input data, and hidden nodes, capturing interdependencies between features in the input layer . In this article, the features have as values, , while have as values, . The distribution given by the Gibbs distribution has as energy function :
In (2), , which is the most important part of an RBM model is a real valued weight between units and , while and are the bias terms for the visible and hidden variables, respectively. If is negative, and and are equal to one, the probability decreases leading to a high energy. On the contrary, if is positive, and and are equal to zero, the probability increases leading to a lower energy. If is negative and , increases leading to a low probability. Therefore, there is a preference for instead of . However, if is positive and , decreases leading to a high probability, and a preference for instead of . A negative value decreases the second term in (2), while a positive value for increases this second term. The same applies for and the third term in (2). The Gibbs distributions or probabilities from (1) or (2) are then obtained by :
Here, the exponential terms are factors of a markov network with vector nodes, whileis the intractable partition function. It is intractable courtesy of the exponential number of values it can take. For an RBM, . Another key aspect of RBMs is that is conditionally independent of and vice versa, due to the fact that there are no connections between nodes in the same layer. This property is expressed mathematically as :
2.3 Contrastive Divergence (CD)
The objective in training an RBM is to minimize the average negative log-likelihood (loss) without regularization using a stochastic gradient descent algorithm as it scales well with high-dimensional datasets. Achieving this objective requires the partial derivative of any parameter,
, of the loss function as per the following equation:
The first term in (5) is the expectation over the data distribution and is referred to as the positive phase, while and represent the same variables as in (1)-(4). The second term, which is the expectation over the model distribution is termed the negative phase. This phase is hard to compute and also intractable because an exponential sum is required over both and
. Furthermore, many sampling steps are needed to obtain unbiased estimates of the log-likelihood gradient. However, it has been shown recently that running a markov chain for just a few steps leads to estimates that are sufficient for training a model. This approach has led to the contrastive divergence (CD) algorithm. CD is a training method for undirected probabilistic graphical models with the idea being to do away with the double expectations in the negative phase in (5) and instead focus on estimation. It basically implements a Monte-Carlo estimate of the expectation over a single input data point. The idea of -step CD (CD-) is that rather than the second term being approximated in (5) by a sample from the model distribution, steps of a Gibbs chain is run, with frequently set to 1 . The Gibbs chain starts with a training sample of the training data and returns after steps . Each step, , entails sampling from , then obtaining samples from 
. For one training pattern,, the log-likelihood gradient w.r.t. is approximated by :
Due to the fact that is not a obtained from the stationary model distribution, the approximation (6) is biased. The bias in effect fades away as . Another aspect that points to CD being biased is that it maximizes the difference between two Kullback-Liebler (KL) divergences :
Here, the experimental distribution is and the distribution of the visible variables after steps of the Markov chain is . If stationarity in the execution of the chain is already attained, holds, and therefore , and the error of the approximation by CD fades away .
2.4 Autoencoder (AE)
An Autoencoder is an artificial neural network that attempts to reproduce its input at the output layer. The basic idea behind autoencoders is that the mapping from the input to the output, reveals vital information and the essential structure in the input vector that is otherwise abstract. An autoencoder takes an input vector
and maps it to a hidden representationvia a deterministic mapping function of the form . The parameter comprises of the matrix of weights and the vector of offsets/biases .
is the sigmoid activation function expressed as:
The hidden representation is then mapped to a reconstructed vector which is obtained by the functions :
Here, the parameter set comprises of the transpose of the matrix of weights and vector of biases from the encoder prior to the fine-tuning phase . When the aforementioned transposition of weights is done, the autoencoder is said to have tied weights. is not explained as a rigorous regeneration of but instead as the parameters of in probabilistic terms, which may yield with high probability . This thus leads to:
From this, we obtain a reconstruction error which is to be optimized by the optimization technique and is of the form . This equation as per  could also be expressed as:
2.5 Firefly Algorithm (FA)
FA is a nature-inspired metaheuristic algorithm based on the flashing patterns and behavior of fireflies . It is based on three main rules being: (i) Fireflies are unisex so all fireflies are attracted to all other fireflies, (ii) Attractiveness is proportional to the brightness and they both decrease as the distance increases. The idea is the less brighter firefly will move towards the brighter one. If there is no obvious brighter firefly, they move randomly, and, (iii) Brightness of a firefly is determined by the landscape of the objective function . Considering that attractiveness is proportional to light intensity, the variation of attractiveness can be defined with respect to the distance as :
In (12), is the attractiveness of a firefly, is the initial attractiveness of a firefly, and is the distance between two fireflies. The movement of a firefly towards a brighter one is determined by :
Here, and are the positions of two fireflies, and the second term is due to the attraction between the fireflies. and represent different time steps, is the randomization parameter controlling the step size in the third term, while
is a vector with random numbers obtained from a Gaussian distribution. If, the movement is then a simple random walk . If
, the movement reduces to a variant of the particle swarm optimization algorithm. The parameters used in this research are: (i) n = number of missing cases per sample, (ii) 1000 iterations, (iii) , (iv) and (v)
. The parameters were selected as they yielded the more optimal results after experimemtation with different permutations and combinations of values. The FA algorithm is used because although it has been successfully applied in a number of domains such as digital image compression, eigenvalue optimization, feature selection and fault detection, scheduling and TSP, etc., its efficiency has not been investigated in missing data estimation tasks on high-dimensional datasets.
2.6 Related Work
We present some of the work that has been done by researchers to address the problem of missing data. The research done in 
implements a hybrid genetic algorithm-neural network system to perform missing data imputation tasks with varying number of missing values within a single instance while creates a hybrid k-Nearest Neighbor-Neural Network system for the same purpose. In 
, a hybrid Auto-Associative neural network or autoencoder with genetic algorithm, simulated annealing and particle swarm optimization model is used to impute missing data with high levels of accuracy in cases where just one feature variable has missing input entries. In some cases, neural networks were used with Principal Component Analysis (PCA) and genetic algorithm as in-. In 
, they use robust regression imputation for missing data in the presence of outliers and investigate its effectiveness. In, it is suggested that information within incomplete cases, that is, instances with missing values be used when estimating missing values. A nonparametric iterative imputation algorithm (NIIA) is proposed that leads to a root mean squared error value of at least 0.5 on the imputation of continuous values and a classification accuracy of at most 87.3% on the imputation of discrete values with varying ratios of missingness. In , the shell-neighbor method is applied in missing data imputation by means of the Shell-Neighbor Imputation (SNI) algorithm which is observed to perform better than the k-Nearest Neighbor imputation method in terms of imputation and classification accuracy as it takes into account the left and right nearest neighbors of the missing data as well as varying number of nearest neighbors contrary to k-NN that considers just fixed k nearest neighbors. In , a multi-objective genetic algorithm approach is presented for missing data imputation. It is observed that the results obtained outperform some of the well known missing data methods with accuracies in the 90 percentile. Novel algorithms for missing data imputation and comparisons between existing techniques can be found in papers such as -.
3 Experimental Design and Procedure
In the design of the experiments, MATLAB R2014a software was used on a Dell Desktop computer with Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz processor, 4.00 GB RAM, 32 GB virtual RAM, 64-bit Operating System running Windows 8.1 Pro. Additionally, the MNIST database was used and it contains 60,000 training images and 10,000 test images. Each of these images is of size pixels. This results in a training set of size and a test of size . Data preprocessing was performed normalizing all pixel values in the range [0, 1]. The individual network layers of the Deep AE were pretrained using RBMs and CD to initialize the weights and biases in a good solution space. The individual layers pretrained were of size , , , and . These are stacked and subsequently transposed to obtain the encoder and decoder parts of the autoncoder network, respectively. The resulting network architecture is of size,
, with an input and output layer with the same number of nodes, and seven hidden layers with varying number of nodes. The network is then fine-tuned using backpropagation, minimizing the mean squared network error. The error value obtained after training is 0.0025. The training is done using the entire training set of data that are divided into 600 balanced mini-batches. The weight and bias updates are done after every mini-batch. Training higher layers of weights is achieved by having the real-valued activations of the visible nodes in preceeding RBMs being transcribed as the activation probabilities of the hidden nodes in lower level RBMs. The Multilayer Perceptron (MLP) AE has an input and output layer, both consisting of 784 nodes, and one hidden layer consisting of 400 nodes obtained by experimenting with different numbers of nodes in the hidden layer, and observing which architecture leads to the lowest mean squared network error. A
network architecture led to the lowest mean squared network error value of 0.0032. The hidden and output layer activation function used is the sigmoid funtion. The training is done using the scaled conjugate gradient descent algorithm for 1000 epochs. Missingness in the test set of data is then created at random according to the MAR and MCAR mechanisms, as well as the arbitrary pattern, and these missing values are approximated using the swarm intelligence algorithm which has as objective function minimizing the loss function of the fine-tuned network. The tolerance error is intially set to 0.05 (5% ) in one of the networks, and is considered reasonable for a first time investigation of the proposed method. The overall approach consist of four consecutive steps being:
Train the individual RBMs on a training set of data with complete records using the greedy layer-by-layer pre-training algorithm described in  starting from the bottom layer. Each layer is trained for 50 epochs with the learning rate for the weights, visible unit biases and hidden unit biases set to 0.1. The initial and final momentum are set to 0.5 and 0.9, respectively. The final parameter is the weight cost which is set to 0.0002.
Stack the RBMs to form the Encoder and Decoder phases of a Deep Autoencoder with tied weights.
Fine-tune the Deep Autoencoder using back-propagation for 1000 epochs through the entire set of training data.
Estimate the missing data with the fine-tuned deep network as part of the objective function in the Firefly Algorithm parsing the known variable values to the objective function, while first estimating the unknown values before parsing these estimates to the objective function. The estimation procedure is terminated when a stopping criterion is achieved, which is either an error tolerance of 5% (0.05), or the maximum number of function evaluations being attained.
4 Experimental Results
In the investiagtion of the imputation technique, we used the test set of data which contained missing data entries accounting for approximately 10% of the data. We present in Tables 2 and 2, Actual, Estimate and Squared Error values from the proposed Deep Autoencoder system without tolerance (Table 2), and from MLP Autoencoder system (Table 2). The distance, , from the estimate to the actual value, added to the squared error are parameters that determine the performance of the method. In all cases presented in both tables, the Deep Autoencoder system shows , while for the same entries (actual values), the MLP Autoencoder shows that , respectively. They show better performance of the proposed technique without a set error tolerance when compared to the existing MLP Autoencoder. This knowledge is validated by the squared error which is always smaller for the proposed technique, for all cases presented in Tables 2 and 2. We could consider this enough to conclude of on the performance of both compared techniques, but we need to analyse the processing time, which seems to be better for the existing method when compared to the proposed Deep Autoencoder system. This is demonstrated by Fig. 3, where we compare processing times for both techniques. It is evident that setting an error tolerance value makes the estimation process faster as observed in Fig. 3. However, this is at the expense of accuracy which is the main aspect in such a task as seen in Fig. 3. The bigger the error tolerance value, the faster the estimation of the missing data.
5 Discussion and Conclusion
This paper investigates the use of a deep neural network with a swarm intelligence algorithm to impute missing data in a high-dimensional dataset. According to the arbitrary missing data pattern, MAR and MCAR mechanisms, missing data could occur anywhere in the dataset. The experiment in this paper considers a scenario in which 10% of the test set of data is missing. These values are to be estimated with a set error tolerance of 5%, as well as no set error tolerance. Also, the proposed method is compared to an MLP Autoencoder estimation system. The results obtained reveal that the proposed system yields the more accurate estimates, especially when there is no set error tolerance value. This is made evident when the distance and squared error values are considered. The AE systems both yield better estimates than the MLP system. However, with these accurate estimates come longer running times which are observed to become smaller when error tolerance values are set. The bigger the tolerance value, the smaller the running time. Based on the findings in this article, we intend to perform an in-depth parameter analysis in any future research in order to observe which parameters are optimal for the task and we will generalize this aspect using several datasets. Another obstacle faced in this research was the computation time to estimate the missing values and to address this, we will parallelize the process on a multi-core system to observe whether parallelizing the task does indeed speed up the process and maintain efficiency.
-  Abdella, M., Marwala, T.: The use of genetic algorithms and neural networks to approximate missing data in database. 3rd International Conference on Computational Cybernetics, ICCC 2005, pp. 207-212. IEEE (2005)
-  Aydilek, I.B., Arslan, A.: A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks. International Journal of Innovative Computing, Information and Control, vol. 7, no. 8, pp. 4705-4717 (2012)
Koko, E.E.M., Mohamed, A.I.A.: Missing data treatment method on cluster analysis.International Journal of Advanced Statistics and Probability. vol. 3, no. 2, pp. 191-209 (2015)
-  Leke, C., Twala, B., Marwala, T.: Modeling of missing data prediction: Computational intelligence and optimization algorithms. International Conference on Systems, Man and Cybernetics (SMC). pp.1400-1404. IEEE (2014)
-  Mistry, F.J., Nelwamondo, F.V., Marwala, T.: Missing Data Estimation Using Principle Component Analysis and Autoassociative Neural Networks. Journal of Systemics, Cybernatics and Informatics. vol. 7, no. 3, pp. 72-79 (2009)
-  Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: A comparison of neural network and expectation maximisation techniques. arXiv preprint arXiv:0704.3474. (2007)
-  Rana, S., John, A.H., Midi, H., Imon, A.: Robust Regression Imputation for Missing Data in the Presence of Outliers. Far East Journal of Mathematical Sciences. vol. 97, no. 2, pp. 183. Pushpa Publishing House (2015)
-  Zhang, S., Jin, Z., Zhu, X.: Missing data imputation by utilizing information within incomplete instances. Journal of Systems and Software. vol. 84, no. 3, pp. 452-459. Elsevier (2011)
-  Zhang, S.: Shell-neighbor method and its application in missing data imputation. Applied Intelligence. vol. 35, no. 1, pp. 123-133. Springer (2011)
-  LeCun, Y. The MNIST database of handwritten digits http://yann.lecun.com/exdb/mnist/. last accessed: 1 January 2016.
-  Rubin, D.B.: Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. Proceedings of the survey research methods section of the American Statistical Association. vol. 1, pp. 20-34. American Statistical Association (1978)
-  Allison, Paul D.: Multiple imputation for missing data: A cautionary tale. Philadelphia (1999)
-  Little, R.J., Rubin, D.B.: Statistical analysis with missing data. John Wiley & Sons (2014)
-  Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J.: Recent advances in deep learning for speech research at Microsoft. International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 8604-8608. IEEE (2013)
-  Deng, L., Yu, D.: Deep learning: methods and applications. Foundations and Trends in Signal Processing. vol. 7, no. 3-4, pp. 197-387. Now Publishers Inc (2014)
-  Fischer, A., Igel, C.: An Introduction to Restricted Boltzmann Machines. . isbn: 978-3-642-33275-3, pp. 14-36. Springer Berlin Heidelberg (2012).
-  Isaacs, J.C.: Representational learning for sonar ATR. Proceedings SPIE 9072, Detection and Sensing of Mines, Explosive Objects, and Obscured Targets XIX. 907203 (June 9, 2014), http://dx.doi.org/10.1117/12.2053057 (2014)
-  Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798-1828. IEEE (2013)
-  Yang, X.-S.: Firefly algorithm, Levy flights and global optimization In: Research and Development in Intelligent Systems XXVI (Eds M. Bramer, R. Ellis, M. Petridis). pp. 209-218. Springer London (2010)
-  Lobato, F., Sales, C., Araujo, I., Tadaiesky, V., Dias, L., Ramos, L., Santana, A.: Multi-Objective Genetic Algorithm For Missing Data Imputation. Pattern Recognition Letters. Elsevier (2015)
-  Jerez, J.M., Molina, I., García-Laencina, P.J., Alba, E., Ribelles, N., Martín, M., Franco, L.: Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial intelligence in medicine. vol. 50, no. 2, pp. 105-115. Elsevier (2010)
-  Kalaycioglu, O., Copas, A., King, M., Omar, R.Z.: A comparison of multiple-imputation methods for handling missing data in repeated measurements observational studies. Journal of the Royal Statistical Society: Series A (Statistics in Society). Wiley Online Library (2015)
-  Lee, K.J., Carlin, J.B.: Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. American journal of epidemiology. vol. 171, no. 5, pp. 624-632. Oxford Univ Press (2010)
-  Liew, A.W.-C., Law, N.-F., Yan, H.: Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Briefings in bioinformatics. vol. 12, no. 5, pp. 498-513. Oxford Univ Press (2011)
-  Myers, T.A.G.: Listwise deletion: Presenting hot deck imputation as an easy and effective tool for handling missing data. Communication Methods and Measures. vol. 5, no. 4, pp. 297-310. Taylor & Francis (2011)
-  Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychological methods. vol. 7, no. 2, pp. 147. American Psychological Association (2002)
-  Van Buuren, S.: Flexible imputation of missing data. CRC press. (2012)
-  Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science. vol. 313, no. 5786, pp. 504-507. American Association for the Advancement of Science (2006)