Machine learning architectures that support varying number of input features can be a game changer in many real life applications which deal with learning in dynamic complex environments. Examples include imparting intelligence to a node in ad hoc communication networks, a device in smart city environment, and an autonomous vehicle in complex driving environment. To model such dynamic environment of inconsistent and scalable nature with assumption some reliable data channels as base channels from on-board sensor array, which we refer to as base input features and denote as . In addition, it may receive other information about the environment through auxiliary sensor arrays or communication channels. We call the corresponding input features as auxiliary input features, denote them by . Here, denotes input features, in superscript and subscript denotes base feature and the number of base features respectively. Similarly, in superscript and subscript denotes auxiliary feature and the number of auxiliary features respectively. Due to the intermittent availability, only a subset of auxiliary features arrive along with the base features at any time instance as shown in Figure 2. This problem can be approached in either minimalist or maximalist approach. In the minimalist approach, all uncertain inputs are dropped and a single knowledge model is trained using only the base input features. This knowledge model provides certain base accuracy, but does not utilize the additional information from auxiliary inputs. The trade-off is the loss of opportunity for better performance. In the maximalist approach, an ensemble of networks can be formed to cater for all possible combinations of availability of auxiliary features. Therefore the network with the smallest dimensionality caters to only the base features and the network with the largest dimensionality caters to all the base and auxiliary features, where we refer to the number of inputs to a network as its dimensionality. However, learning the knowledge model in such ensemble of networks is cumbersome, as explained next. Given inputs features at a time , subsets of these features can be formed and therefore the network corresponding to each subset needs to be trained. This results into long training durations, further illustrated in supplementary. Another trade-off is that huge number of networks need to be maintained throughout. An ideal solution would be an agile and scalable network architecture that adapts itself to the availability of auxiliary inputs without needing to maintain or train multiple networks.
In this paper, we present a new paradigm of learning in the presence of inconsistently available auxiliary inputs, which we call auxiliary network (Aux-Net). The key-stone of Aux-net is the separation of learning corresponding to the auxiliary inputs and the base inputs into separate modules parallel to each other (see Figure 2). The base features are processed as a chunk in the base module while the auxiliary module contains one independent layer per auxiliary input in parallel with the other layers. Therefore, the dimensionality of the active network can be varied by simply freezing the portion corresponding to an unavailable auxiliary input. In this manner, the knowledge base of Aux-Net corresponds to the maximalist approach, where as the active knowledge model of suitable dimensionality can be invoked from it in an agile manner. Support for knowledge models of various dimensionalities makes our approach scalable. At the same time, agility of our framework and stability during dynamic operation is attributed to the special output weighing mechanism of the auxiliary block which dynamically spools the relative contribution of the auxiliary data depending on the availability of the auxiliary features and their influence on the final outcome.
We construct the initial framework on the basis of online deep learning (ODL) method 
and show our results on Italy power demand dataset. We observe robust, agile, and scalable performance of Aux-Net in situations as challenging as half of inputs being available 50% of the time and when all except one inputs are intermittently available. We show that in the most challenging scenarios, Aux-Net performs quite close to ODL (trained using only the base features) even while supporting agility and in the more favorable scenarios, it performs better than ODL.
The outline of the paper is as follows. Related work is discussed in section 2. Aux-Net is presented in section 3 and diverse numerical experiments and its discussion are presented in section 4. The paper is concluded in section 5 and the broader impact of this work is presented in the last section.
2 Related Works
3 Auxiliary Network (Aux-Net)
3.1 Problem Setting
Let’s denote the streaming classification data by where is the input at time instance . The base features are denoted by , where in superscript and subscript denotes the base features and total number of base features respectively and denotes the base feature at time instance . The auxiliary features at any time instance is represented by where is the subset of auxiliary features received at time instance . The in superscript and subscript denotes the auxiliary features and total number of auxiliary features respectively and denotes the auxiliary feature at time instance . The input where is the dimension of varying with time as shown in Figure 2. The output is the class label associated with where C is the total number of classes. The Aux-Net learns a mapping . The prediction of the model is given by . The model trains in an online setting where at any time instance , the input feature arrives, the model predicts an output , the actual output is revealed and an update is made to the model based on the loss incurred. Exhaustive list of all the notations is given in the supplementary.
Consider a DNN with number of base layers, one middle layer, number of auxiliary layers and number of end layers. The base layers, middle layer, auxiliary layers and end layers constitute the base module, middle module, auxiliary module and end module respectively. The base, middle and end modules are stacked sequentially and auxiliary module is placed in parallel to the base and middle module with a connection to the end module as shown in the Figure 2. A softmax classifier is attached to each of the layer. The detailed architecture of the model is presented in Figure 3. The output of the Aux-Net model is given as the weighted combination of all the classifiers by the equation:
where denotes all the modules, and in superscript and subscript denotes the module name and the total number of layers in the module respectively. The notation and represents the output of the classifier associated with the layer of the module and weight of the classifier respectively.
The architecture of each layer is shown in Figure 3(b). Each layer is attached to a classifier parameterized by that gives an output as , where is the hidden feature of the layer. Each layer is parameterized by and that takes the hidden feature of the previous layers as an input, and generates its hidden feature as , where
is the activation function and, and are learnt using OGD approach. A hedge block is used to compute based on the loss incurred by the classifier.
Now, we describe the inputs to the different layers. The first base layer receives the complete as the input i.e. . The subsequent base layers receive the hidden feature of its previous layer as its input. The middle layer receive the hidden feature of the last base layer as its input, i.e. . The auxiliary layer receives the auxiliary feature as its input, i.e., . All the end layers, except the first end layer receive its previous layer features as the input. The first end layer is special in terms of the input since the input to it needs to support agility arising from only a subset of auxiliary features being available at any time instance . The input to the first end layer at time instance
is a vector derived by concatenating weighted hidden featuresof the middle and the auxiliary layers corresponding to the currently available auxiliary inputs. It is therefore given by , where is the importance of the layers connected to first end layer denoting the fraction of the connected layer’s output passed as an input to the first end layer.
3.3 Parameters Learning
The learning of the model occurs in an online setting through the use of a loss function defined as:
where is the loss of the classifier associated with layer z of the module Z. On the basis of the loss incurred at each time step, the values of are updated as described next.
Updating : The highlight of Aux-Net is the update of which allows for soft handling of the asynchronous availability of auxiliary features. It depends only on its classifiers weights and are calculated as follows:
Updating : The classifiers parameters is learned through OGD. The parameter is associated with only one classifier and does not depend on the other classifiers. Therefore, its update will only be with respect to the loss of its own classifier. After every time instance , of classifier of the module is updated as:
, is the learning rate of the parameters and .
Learning and : The weights (W) and bias (c) of a layer are learned by back propagation on the final loss similar to OGD. But, since each layer is associated with a classifier unlike the traditional DNN where only last layer gives a prediction, the gradient descent is different. Here, the parameters of a layer depends on the loss of all its successive layers that directly or indirectly influence it. The following equation shows the weight update rule and the same is applicable for bias too.
where if respectively, and .
Learning : We learn the value of through hedge algorithm. Initially, the value of
is uniformly distributed i.e.,, where is the total number of layers, . The loss incurred by the classifier of module at time instance is and its weight is . The weights of the classifier are updated on the basis of its loss as:
where is the discount rate parameter. There may come a situation where . To avoid that since we don’t want to neglect any layer, a smoothing parameter is introduced where . It ensures a minimum weight for each classifier by using the equation . The value of all is then normalized such that .
The Aux-Net is a test-then-train approach and since the number of auxiliary features are changing, the trained model learned at time step can’t be used as it is for training or testing at time step . We define a knowledge base which is updated after each time instance . We represent all the parameters of the knowledge base by . The knowledge base at any time instance is given by
Before training or testing, the model needs to incorporate the incoming dynamic auxiliary features. We define a model given by equation 8, that handles the asynchronous availability of auxiliary features () by introducing the variable . The model predicts an output , given and updates its parameter giving based on the loss incurred. Before moving to the next instance, we update the final parameters of based on , giving knowledge base . The block diagram of the algorithm is shown in Figure 4 and algorithm is given in Algorithm 1.
Creating Model (): Based on the auxiliary features received at time step and knowledge base , the model is created before prediction and training. The auxiliary layers corresponding to are kept active and all the other auxiliary layers are freeze. Freezing of layers means all the parameters associated with this layer will not be trained. In other terms, freezing means removing the layer from the model. Since, some of the auxiliary layers are removed, the value of the model changes and a parameter is introduced. The model is given by:
where if , where , and , .
Obtaining knowledge base for next instance: The parameters of the model are updated based on the loss incurred at time instance . The updated model, represented by is given by:
where are the parameters obtained by updating the parameters of the model by using equation 2, 4, 5, and 6. After training the model at time step , we create the knowledge base before moving to the next iteration. All the parameters updated at time step and the parameters of the freezed layers () are collected. Then, is given by:
where if , and where and .
4 Experimental Results
We evaluate our model using the Italy power demand dataset . It has 1096 data instances with 24 input features. In all the studies, we retain the original order of the input features. To the best of our knowledge, there is no method that incorporates the intermittently available input data in an online setting. Thus, we compare the Aux-Net model with the ODL model. We train both the models in a purely online setting where after each instance the model predicts and trains.
Architecture details We fix the number of base layers () to be 5, the number of middle layer () is 1, and the number of end layers () is also 5 for Aux-Net. The number of auxiliary layers () are equal to the number of auxiliary features. The number of layers for ODL is set as 11 (
). For both Aux-Net and ODL, we used the the ReLU activation function and the number of nodes in each layer was set as 50. The Adam optimizer (
) was used for backpropagation. The smoothing rate and the discount rate was set as= 0.2 and = 0.99 respectively. The cross-entropy loss was chosen as the loss function. The above settings are true for all the following experiments.
Varying the probability of the availability of auxiliary inputs in Aux-Net
The first 12 input features of Italy power demand dataset are considered as the base features and remaining as the auxiliary features. The availability of each auxiliary feature at a given time instance is modeled as a uniform distribution with probability. The same value of is used for all the auxiliary features but the availability of each is computed independently. The results of Aux-Net with varying values of , ODL with all the 24 features, and ODL with only the 12 base features are presented in Table 1. We report the average of losses observed at all the time instances, and similarly the average accuracy observed across all the time instances. The cumulative average loss and accuracy curves are shown in Fig. 5. We study the performance of Aux-Net and comparison with ODL with the following aims:
Sensitivity of Aux-Net to and its performance: The average accuracy and loss for all the time instances in the dataset shows monotonic trend as a function of , as noted in Table 1. This shows that Aux-Net is sensitive to the availability of the auxiliary inputs, as expected. Yet, the performance of Aux-Net degrades gracefully as reduces. Moreover, Aux-Net still performs better compared to ODL with 12 features when (as ODL can not work with inconsistent features). Further, the best case performance of Aux-Net when , is comparable to the scenario of ODL with 24 features. This means that even though the knowledge base of Aux-Net supports for knowledge models, only the knowledge model with largest dimensionality is invoked and trained. In this case loss of Aux-Net is poorer than ODL, but the accuracy is better. In case of which means no consistency in either availability or unavailability of the auxiliary inputs the observed poorer performance of Aux-Net in comparison to ODL is only marginal, indicating robustness of Aux-Net to the extremely challenging scenario and its graceful degradation.
Agile adaptation of Aux-Net: The demand on agility significantly enhances as reduces. For example, for in Figure 5(c), not only a different knowledge model needs to be invoked at every instance but also the same knowledge model may not be invoked in next many time instances. The situation is easier when even though there are many time instances when a different knowledge model is invoked. Nonetheless, Aux-Net remains stable in either case and adapts to the agility needs in an efficient manner, indicated in accuracy and loss plots in Figure 5(a,b). Indeed, the accuracy is better and the loss decreases faster over time for . Nonetheless, when , the accuracy and loss curves closely follow ODL with 12 features, indicating that even though new knowledge models are being dynamically invoked every single instance, the performance of Aux-Net does not deteriorate in comparison to ODL and Aux-Net is indeed able to maintain agility over time, contributing to reduced loss and improved accuracy as time passes.
Decreased loss and improved accuracy over time: We note that for the situation of 12 auxiliary inputs, support for knowledge models, and invocation of each knowledge model multiple times is needed to study the aspects such as convergence of knowledge base over time. Yet, the decreasing loss in Figure 5(b) is a positive indicator of performance improvement over time and possible convergence.
Varying number of base features In this experiment, we fix as 0.9, but vary the number of base features (B) from 1-23. The number of auxiliary features (A) are consequently (24-B). The first features in the dataset are used as base features in Aux-Net and the only features in ODL. The average loss of Aux-Net and ODL are compared in Fig. 6 as a function of . We observe the following:
Extreme scalability: As expected, the performances of both Aux-Net and ODL deteriorate as reduces. Nonetheless, the loss of Aux-Net is significantly smaller than ODL in the challenging scenarios when more than 4 inputs are inconsistently available. This clearly indicates that Aux-Net is able to leverage the auxiliary inputs for better learning even if they are inconsistently available. Especially, the extremely challenging scenarios ( = 1 for example) demonstrate that Aux-Net is indeed able to step up to the need of supporting several knowledge models of varying inputs and dimensionalities and provide better performance than the minimalist approach.
Poorer performance than ODL when : During initialization, Aux-Net assigns the same weights () to each classifier. However, the classifier corresponding to an auxiliary feature will be lossier as compared to the classifier of middle layer that uses base features. As time progresses, the value of for each layer gets customized to suit its contribution towards accurate classification. Often, it means that of auxiliary layer reduces in the first few time instances, indicating that Aux-Net has learnt that its inconsistent availability may cause increased loss if corresponding to it is high.
5 Discussion and conclusion
Scalability and knowledge entity: Aux-Net supports scalability for the situations ranging from no auxiliary input being available to all auxiliary inputs being available. Aux-Net incorporates knowledge models corresponding to all possible combinations of auxiliary inputs within a single knowledge base. The architectural support in Aux-Net for auxiliary inputs in the form of dedicated parallel layers is a critical feature for scalability. At the same time, being able to update the pertinent knowledge models selectively and reflect the new knowledge back into the main knowledge base (see Fig. 4) ensures that a single knowledge base needs to be maintained as opposed to the resource-heavy maximalist approach. Further more, online learning in the current framework dispenses away the need of offline storage of data. Nonetheless, in the future application-specific framework, maintaining a stash of offline data may provide an advantage and exploring that is a possibility.
Agility and stability: Agility in Aux-Net is characterized by its ability to dynamically invoke the relevant knowledge model without making the network unstable or unadaptive. A key factor that supports dynamic stability and agility is the importance parameter , which automatically adjusts the contributions of base inputs (through the middle layer) and the currently available auxiliary inputs so that neither the new auxiliary features introduce inordinate instability, nor are they suppressed.
It is of interest to investigate the convergence of Aux-Net, which could not be investigated rigorously in this study though indicators of convergence were observed (results in the supplementary). Since there was no possibility so far for dealing with intermittently available inputs, we found that there is a dearth of suitable benchmark datasets and applications, which allow us to empirically assess aspects such as convergence. However, we deem that more elaborate studies are needed on suitably designed datasets and investigation of rigorous theoretical proofs of convergence will be significantly useful.
So far, we have demonstrated scalability, agility, and stability of Aux-Net and its ability to deal with intermittently available inputs in a completely online manner. This, in our observation is not only the first such architecture, it is also a first demonstration of results on intermittently available input features. Having set a new paradigm, we hope that new datasets, new frameworks, new applications, and more extensive studies are developed in the near future to exploit the possibility of learning in extremely dynamic and uncertain scenarios. We hope that advanced artificial intelligence for dynamic complex environments will soon emerge. We will work on providing further conceptual groundwork to Aux-Net, identifying or creating new benchmark datasets, extending Aux-Net to perform tasks such as prediction and detection or deal with more variety of inputs, such as images, adapt it to use convolutional kernels, and working with asynchronous multi-modal inputs in the future.
Machine learning community has been afflicted by rigid architectures for long even though activities in extreme learning, neuro-evolution, and incremental or online learning have been explored to ease the problem of architectural rigidity. Yet, the dream to perform advanced artificial intelligence in a highly dynamic, situation adaptive manner for efficient operation in real-world dynamic complex environments is far from accomplished. Even the most advanced AI agents, such as autonomous cars considers rigid architecture as the amount and type of data availability changes. We are all waiting for a truly agile, scalable, self-adapting non-rigid artificial intelligence approach that redefines how intelligent machines deal with varying amount and types of input features and ad hoc environment.
Aux-Net is a baby step towards the above mentioned dream. For the first time (in our knowledge) we have showcased that an architecture can be online, scalable and agile in nature. In the future, scalable and agile machine learning will bring the next wave of research and development activities, which can address the pressing needs of advanced machine learning in complex dynamic environment. To support the initial activity on this new direction, we will release the Aux-Net source code and its development platform for the benefit of the further research and development activities.
Aux-Net in its current form has its limitations. It needs to be developed for online image based classification similar to the current state-of-the-art deep learning architecture. It has been currently built on keras with tensorflow backend, which currently lack support for Aux-Net type online scalable and agile learning. We will soon release the basic functionality libraries which can support the scalable, agile and online learning. We invite researchers for developing suitable benchmarking datasets from dynamic complex environment with scalable and agile machine learning needs as well as contribute to better implementation and other AI tasks.
-  (2006) A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering 18 (5), pp. 577–589. Cited by: §2.
-  (2019) Autonomous deep learning: continual learning approach for dynamic environments. In SIAM International Conference on Data Mining, pp. 666–674. Cited by: §2.
FERNN: a fast and evolving recurrent neural network model for streaming data classification. In International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.
-  (2019) MUSE-rnn: a multilayer self-evolving recurrent neural network for data stream classification. In IEEE International Conference on Data Mining (ICDM), pp. 110–119. Cited by: §2.
-  (2016) IeRSPOP: a novel incremental rough set-based pseudo outer-product with ensemble learning. Applied Soft Computing 46, pp. 170–186. Cited by: §2.
-  (2019) The ucr time series archive. IEEE/CAA Journal of Automatica Sinica 6 (6), pp. 1293–1305. Cited by: §1, §4.
An incremental learning algorithm for non-stationary environments and class imbalance.
International Conference on Pattern Recognition, pp. 2997–3000. Cited by: §2.
-  (2000) Mining high-speed data streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80. Cited by: §2.
A desicion-theoretic generalization of on-line learning and an application to boosting.
European Conference on Computational Learning Theory, pp. 23–37. Cited by: §2.
-  (2012) A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence 1 (1), pp. 45–55. Cited by: §2.
-  (2017) Learning with feature evolvable streams. In Advances in Neural Information Processing Systems (NIPS), pp. 1417–1427. Cited by: §2.
-  (2017) One-pass learning with incremental and decremental features. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (11), pp. 2776–2792. Cited by: §2.
-  (2018) PIE-RSPOP: a brain-inspired pseudo-incremental ensemble rough set pseudo-outer product fuzzy neural network. Expert Systems with Applications 95, pp. 172–189. Cited by: §2.
-  (2007) Multiple classifiers based incremental learning algorithm for learning in nonstationary environments. In International Conference on Machine Learning and Cybernetics, Vol. 6, pp. 3618–3623. Cited by: §2.
-  (2008) Learn ++ nc: combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes. IEEE Transactions on Neural Networks 20 (1), pp. 152–168. Cited by: §2.
-  (2015) A survey on data stream clustering and classification. Knowledge and Information Systems 45 (3), pp. 535–569. Cited by: §2.
-  (2010) Learn++. mf: a random subspace approach for the missing feature problem. Pattern Recognition 43 (11), pp. 3817–3832. Cited by: §2.
-  (2001) Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man, and Cybernetics, part C 31 (4), pp. 497–508. Cited by: §2.
-  (2017) Online deep learning: learning deep neural networks on the fly. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 2660–2666. Cited by: §1, §2.
-  (2009) Indexing density models for incremental learning and anytime classification on data streams. In International Conference on Extending Database Technology: Advances in Database Technology, pp. 311–322. Cited by: §2.
-  (2007) Simpler core vector machines with enclosing balls. In International Conference on Machine Learning (ICML), pp. 911–918. Cited by: §2.
-  (2003) Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), pp. 928–936. Cited by: §2.