In this paper, we introduce Multi-Output Dependence (MOD) learning as an algorithmic family that models dependencies between multiple outputs. Traditional supervised learning seeks to map an input vectorto an output vector where is set of possible outputs. Further, multi-label classification specifically examines the case where multiple target labels must be assigned to each instance  while structured prediction predict multiple target labels where is structured. MOD addresses problems where the outputs are dependent on each other and where there are multiple correct output vectors for a given . Any one output may be considered correct or incorrect only when considered in the context of other outputs.
An example MOD problem is the following. Assume we want to propose an action for a company to take that generates sales and/or retains a customer. For example, a particular customer may be contemplating switching to a competitor. What should the company do to retain this customer? There could be multiple correct actions. A sales person could write the customer and offer incentives for staying, or the CEO could call the customer to express how important he is to them. Of course, each action incurs a certain cost. Having the CEO call a customer is more expensive than having a help-desk employee write the customer an e-mail, but both are viable solutions. However, if that customer happens to be the largest source of revenue for that company, then sending a generic e-mail may not be the best course of action to take. It may be the case where calling would only be correct if the CEO made the phone call but writing an e-mail would be correct if it came from a different person.
MOD problems can be seen as those problems where the outputs are important in addition to the inputs when making a decision. MOD learning requires approximating a relation, as opposed to the more traditional function approximation. We define a training data set to be a set of input vectors each labeled with an appropriate output . An input vector can be associated with multiple output vectors where . In this case, there are multiple correct outputs for . Some outputs may be more desirable than others, but there are still multiple outputs that would be acceptable given the input . This changes the task from finding a mapping function to finding a relation from to . We consider the relation where there are multiple outputs (where ) and there is a dependency between the outputs. This gives rise to interesting questions about which correct solutions to choose when there are multiple correct solutions available.
Many current algorithms fail to directly support multiple outputs. Current approaches either induce one model per output or create a single model that gives multiple outputs without explicitly modeling the dependencies. Different models support multiple independent outputs with a varying degree of success, without further modification. Decision Tree learning algorithms must either induce multiple trees in order to produce multiple outputs or must induce a single tree that blows up exponentially, but neither of these approaches can model dependence between the output variables.
-Nearest Neighbor algorithms can support multiple outputs with little change to the basic algorithm. Multi-Layer Perceptron (MLP) models can give multiple outputs with a single model or multiple model approach. However, none of these algorithms explicitly model dependent outputs. Auto-associative models, such as Hopfield networks, come close to this capability, but they are unable to handle arbitrary input and output mappings in contrast to the hetero-associative model that we present.
We introduce the Hierarchical Multi-Output Nearest Neighbor model(HMONN) in order to solve the MOD problem. This hierarchical model has two layers. The first layer is a naïve approach with one learning model per output. The models that comprise the first layer can be any traditional machine learning model. The second layer is a modified nearest neighbor model that refines the predictions made on the first layer. HMONN is shown graphically in Figure1. Though HMONN focuses on tasks with nominal features, it also gives improvement for some tasks with real-valued features by implicitly modeling a similarity function for the feature space.
2 Related Work
Other work has examined classification with multiple labels, although the labels are generally not considered to be dependent on each other and multiple correct labels are not considered. Multi-label classification considers problems with multiple outputs, but no dependency between the outputs is modeled. Tsoumakas et al.  give an overview of multi-label classification. They define two main approaches for multi-label classification. The first approach is problem transformation, where the given data is transformed into a single problem that already has a well defined solution. The second approach is algorithm adaptation, where current algorithms are modified to solve the multi-label classification problem. Recent work has looked at correlations between labels in multi-label classification to improve accuracy . Read [9, 10]
introduced chain classifiers for supporting these correlations which could be viable for supporting MOD problems.
Many problems have a structure that is missed by standard classification algorithms . Structured Prediction (SP) seeks to solve this problem by modeling the structure of the outputs. This structure could be a sequence, a tree, a graph, or an image. This allows for multi-output as well as output dependencies. However, these dependencies are almost always limited to Markovian dependencies — related by time or space. Theoretically, SP algorithms are capable of modeling any problem with structure, and MOD problems would be an example of this kind of problem. The main difference between MOD and SP is that MOD problems are assumed to have some inputs with multiple correct outputs, whereas with current SP algorithms there is a single correct output assumed for each input. Bakir et al.  give an overview of the state of the art in SP.
While MOD learning is relation approximation, this should not be confused with relational learning. Statistical Relational Learning [8, 4] and Multi-Relational Learning  both handle relational data, not relation approximation. These relational learning models learn a function from relational data and handle specially formatted and structured data.
We present the Hierarchical Multi-Output Nearest Neighbor model (HMONN) to solve the MOD problem. We define an output prediction to be correct for a given input vector if there is some training instance in the training data that has labeled with output and (or if for the current test instance). The traditional definition of a correct prediction only takes into account the labeling on the instance currently being tested. This definition allows the model to use information contained within the training data to determine which output predictions should be counted as correct.
Traditional models are not able to model the dependencies between outputs. This is, in part, due to the fact that traditional models are function approximators, and MOD problems are relational. As an example of this, consider a training set that contains two training patterns with the sameand different . A traditional MLP will oscillate between the multiple possible outputs, and may not give any of the possible correct output vectors An MLP will adjust weights towards outputting whenever the first instance is encountered, and whenever the second instance is encountered it will adjust the weights towards outputting . The network will consequently adjust the weights towards the output (without ever stabilizing), rather than towards either of the correct outputs. A graphical example of the problem faced by an algorithm trying to learn a problem with multiple correct outputs is shown in Figure 2. The solid curve represents the relation in the training data and the dotted curve represents the function that could be learned, for example by an MLP. An appropriate algorithm should output both branches of the relation, following the relation exactly, choose one of the branches arbitrarily, or choose one based on some criteria. It should not, however, output something completely different.
|Actual Relation||Possible Learned Function|
HMONN favors one output vector over the others. Even though we could give a distribution of potential outputs from the neighborhood of the initial prediction, this version gives one of the possible correct output vectors for the given input vector . This output vector is the most common among the given neighborhood, and thus varies with neighborhood size and makeup. HMONN is a first step towards solving the MOD problem. HMONN starts with an initial prediction and then uses the extra information provided by that initial prediction to give the output. The initial prediction is obtained using any machine learning method. Here, we choose to train an MLP classifier for each output. The outputs from each MLP classifier are combined into an initial prediction. We present a modified K
-Nearest Neighbor (KNN) algorithm to give the final prediction. HMONN uses a different distance function where the initial prediction is used as part of the features for the distance function:
where is the number of features in the input space, is the number of outputs, and is a weight on the range . The value for emphasizes either the input space or the output space as more important. This modification of KNN captures the dependency between output variables by incorporating them into the input feature space. HMONN takes the initial prediction from the MLP classifiers, uses this initial prediction as part of the features in a KNN algorithm, and chooses the majority output vector from the neighborhood as the final prediction.
The relationship between the dependence between outputs and multiple correct output vectors for a given input vector is shown in Theorem 3.1. Theorem 3.1 claims that we can observe the dependence between two output variables directly in the training data. There is also a case for loose dependence that relies on different vectors being only similar, but this work considers exact equality.
Given random variables
Given random variables, , and , where is an input vector of nominal features and and are scalars from the output vector , if the two output variables, and , are conditionally dependent on each other given the input and the training data , then there is some input vector, , associated with multiple output vectors, , in .
Assume that outputs and are conditionally dependent given the input variable and the training data . By the definition of statistical dependence this implies that, for some input vector , . Assume that the output vector is the only possible correct output for . Then it is the case that . This contradicts the definition of statistical dependence. Thus, there must be multiple possible output vectors for the input vector . ∎
4 Experimental Results
The accuracy of MOD classifiers was evaluated on three different types of data: synthetic data, UCI repository data, and real-world data. This accuracy was compared to a baseline model that consists of a single classifier trained separately for each output, which we call the naïve model, where each separate prediction is combined into a single output vector. Accuracy is defined as follows.
where is the test set, is the predicted output vector for instance , is the training set, and is the Kronecker delta function returning 1 if the expression is true and 0 otherwise. This accuracy metric counts a prediction as correct if an input vector in the data set is labeled with the predicted output vector . This considers all correct output vectors as equally good.
Standard machine learning tasks with only nominal input features are common, and we assume that the same will hold for MOD data sets. HMONN shows clear improvement on these data sets. Many tasks also have real-valued features. While it is more difficult to find a duplicate in these data sets, real-valued features will often have some level of discretization done to them, through either binning or rounding that increases the likelihood of finding duplicate vectors in the data set. This alters the amount of dependence between the output variables. Thus, in many current data sets, real-valued features do not necessarily take on a large range of values. This allows the given definition of accuracy to work in many cases with real-valued features. To better handle real-valued features, the definition of accuracy could be extended to allow for similar values, as opposed to requiring values to be exactly equal. We are currently working on extending MOD accuracy metrics to better support real-valued features.
Despite the issue of the frequency of exact vectors for real-valued features, HMONN improves the accuracy in some of the experiments on synthetic and UCI data that have real-valued inputs. This is due to the fact that the nearest neighbor portion of the algorithm creates an implicit similarity function for the feature space. The similarity function behaves differently based on the neighborhood size. This gives a distance-based voting for which outputs are correct for any given portion of the feature space. This causes the majority class for any given neighborhood in the feature space to always be the correct value. Selecting outputs in this fashion avoids some of the difficulty with real-valued features, even though it does not solve the problem completely. We are currently exploring ways to fully resolve this problem as future work.
Some initial experimentation was used to determine values for and . We tested values of from 1 to 11 and values of from . We found that there was little difference between values of and except for , which performed slightly worse. In the following experiments, we use representative values , allowing for a reasonably sized neighborhood, and
, to give an equal balance between the input and output features. Experiments are run using 10-fold cross validation. The naïve neural network layer had a standard MLP with a single hidden layer ofnodes for each output with
being the number of attributes, including the outputs, in the corresponding data set. All experiments are run with a learning rate of 0.1 and stop after 10 epochs without any improvement on a held-out validation set. Statistical significance is determined using the Wilcoxon signed rank test with significance at.
4.1 Synthetic Data
Two different types of synthetic data were created. One used real-valued features in order to determine whether HMONN implicitly models a similarity function for the feature space, as hypothesized. The other used only nominal features.
Real-valued synthetic data was created using the following process. Given output variables, a data set is generated by selecting
points in the input space as centroids. These points are each randomly assigned a number of probability vectors. A probability vector contains a probability distribution over possible output vectors. To generate an instance, a centroid is selected at random, the input values for that instance are generated by randomly perturbing the centroid according to a Gaussian distribution. An output vector is chosen by randomly selecting an output vector according to the probability distribution of a randomly chosen probability vector for that centroid. This generation process attempts to model the fact that, for MOD problems, a portion of the input space can belong to more than one output vector. Nominal synthetic data was created using the process outlined above with one difference. To generate a centroid, a center point for each feature was chosen from. New inputs were generated by adding a randomly selected value from to the center point for that feature. Values above were set to and values below were set to . The parameters were set to (with 4 possible values for each output) and . The number of inputs was set to 3 times the number of outputs. The number of probability vectors was the same as the number of centroids, 1.5 times the number of centroids, or 2 times the number of centroids. 5000 instances were generated for each data set. This results in 12 data sets for each of 2 outputs, 3 outputs, and 4 outputs, giving a total of 36 data sets used.
|Real-Valued Features||Nominal Features|
The results of comparing HMONN to the naïve model for real-valued and nominal features are given in Table 1. HMONN outperformed the naïve model for the real-valued synthetic data, and the improvement was always statistically significant. This is likely due to the fact that HMONN exploits the information contained in the local neighborhood in order to produce outputs. HMONN will have more information available with more outputs. This will make the neighborhood more specific, thus giving the algorithm a higher chance of finding a correct output. HMONN outperformed the naïve model for the nominal synthetic data as well, and the improvement was always statistically significant.
4.2 UCI Data
The UCI repository  does not contain any data sets that are MOD decision problems. Therefore, MOD data sets were created from the original UCI Data sets by allowing each nominal feature to act as an output class for a derivative data set. If, for example, the number of outputs was set to two, each data set would create derived data sets where is the number of nominal features for the chosen data set. Each of these derived data sets consists of a nominal feature combined with the original output class acting as the output classes, with all of the other features acting as inputs. Similarly, for three or four outputs the original output class is combined with two or three (respectively) nominal features to act as the outputs. The number of data sets scales linearly in the number of inputs with two outputs, quadratically with three outputs, and cubicly with four outputs. This is a contrived solution, but we assume that there is some dependency between input variables and the output variable — especially for data sets from the UCI repository. Twenty UCI data sets were used for the experiments. These data sets were chosen arbitrarily from those that had more than five nominal input features. Nominal input features were necessary in order to create the derivative data sets. Information for each data set is provided in Table 2. Missing values were replaced by the mean/mode.
Table 2 shows the results for the UCI data set experiments. The table contains values for both HMONN and the naïve model compared by number of outputs. Each number is obtained by averaging the results across all the derived data sets from the original UCI data set for the given algorithm. Statistically significant results are highlighted. HMONN outperformed the naïve model 79% of the time (with 68% of the time being statistically significant, see the Total columns). In some cases there was not a significant difference. In four cases, the naïve model outperformed HMONN. HMONN outperforms the naïve model in the majority of cases. Occasionally, the naïve model performs better, but never with the same magnitude. This further demonstrates the potential of HMONN as a model to solve MOD decision problems. This also validates the assumption that there is some dependence between the output variable and the input variables in the UCI data sets.
4.3 Business Application Data
The motivation for defining MOD problems stems from a local business, InsideSales.com, that provided data for a real world MOD task. Due to the proprietary nature of this business data, we are only permitted to reproduce a de-identified version of this data. This data includes a two output data set and a three output data set. The data sets have fourteen nominal features and eight real-valued features. The two output data set has 32544 instances, and the three output data set has 32774 instances.
The task is to determine the timing and method to contact business leads. Business practices would imply that these variables are dependent (given the input ), the time you contact a lead depends on the method used, and the method used depends on the timing. The results are shown in Table 3. HMONN outperformed the naïve model in both cases. This shows that the improvement of HMONN seen in the UCI and synthetic data can also be seen in real-world MOD problems. The synthetic data and the real-world business data are definitely MOD problems. However the synthetic data is not necessarily representative of real data, and there is little real data. The UCI data is used to supplement the other data sources, although it can only be assumed to represent MOD data.
We provided a definition for MOD problems, as a well as a method to solve such problems. We have defined the Hierarchical Multi-Output Nearest Neighbor model, with a naïve independence model as a first layer and a modified nearest neighbor model as the second layer. This model is based on the assumption that local context is a key element to solving MOD problems. HMONN consistently outperforms the baseline model, typically with statistical significance. This holds true for synthetic data, UCI repository data, and for one real-world business task.
Future work will develop solutions using other types of models (such as relaxation networks), an improved method for calculating accuracy on MOD problems, improved methods for validating new MOD algorithms, and new methods for identifying and collecting MOD data. With MOD problems, it is difficult to know how much dependency any given problem may have. Many of the data sets that we used for validation could only be assumed to have some level of dependency. A method to identify the degree of output dependency in a given data set is another piece of future work.
-  Bakır, G., Hofmann, T., Schölkopf, B.: Predicting structured data. The MIT Press (2007)
-  Džeroski, S.: Multi-relational data mining: an introduction. ACM SIGKDD Explorations Newsletter 5(1), 1–16 (2003)
-  Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
-  Getoor, L., Mihalkova, L.: Learning statistical models from relational data. In: Proceedings of the 2011 international conference on Management of data. pp. 1195–1198. ACM (2011)
-  Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. In: Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp. 22–30. Springer (2004)
-  Heath, D., Zitzelberger, A., Giraud-Carrier, C.: A Multiple Domain Comparison of Multi-label Classification Methods. Working Notes of the 2nd International Workshop on Learning from Multi-Label Data p. 21 (2010)
-  Hopfield, J., Tank, D.: Neural computation of decisions in optimization problems. Biological cybernetics 52(3), 141–152 (1985)
-  Neville, J., Rattigan, M., Jensen, D.: Statistical relational learning: Four claims and a survey. In: Workshop SRL, Int. Joint. Conf. on AI (2003)
-  Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II. pp. 254–269. Springer-Verlag (2009)
-  Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Machine Learning 85(3), 333–359 (2011)
-  Taskar, B., Guestrin, C., Koller, D.: Max-margin markov networks. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16. MIT Press (2004)
-  Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. Data Mining and Knowledge Discovery Handbook pp. 667–685 (2010)