1 Introduction
The reemergence of Deep Learning DeepLearningBook2016 has found significant and successful applications in difficult realworld domains such as image krizhevsky2012imagenet, audio audio and video processing videoCVPR. Deep Learning has also been increasingly applied to structured domains, where the data is represented using richer symbolic or graph features to capture relational structure between entities and attributes in the domain. Intuitively, deep learning architectures are naturally suited to learning and reasoning over such multirelational domains as they are able to capture increasingly complex interactions between features with deeper layers. However, the combinatorial complexity of reasoning over a large number of relations and objects has remained a significant bottleneck to overcome.
Recent work in relational deep learning has sought to address this particular issue. This includes relational neural networks
KazemiPoole18RelNNs; SourekEtAl15LRNNs, relational Restricted Boltzmann machines
KaurEtAl18RRBM and neurosymbolic architectures such as CILP FrancaEtAlCILP14. In our work, we focus upon the framework of Column Networks (CLNs) developed by pham2017column. Column networks are composed of several (feedforward) minicolumns each of which represents an entity in the domain. Relationships between two entities are modeled through edges between minicolumns. These edges allow for the shortrange exchange of information over successive layers of the column network; however, the true power of column networks emerges as the depth of interactions increases, which allows for the natural modeling of longrange interactions.Column networks are an attractive approach for several reasons: (1) hidden layers of a CLN share parameters, which means that making the network deeper does not introduce more parameters, (2) as the depth increases, the CLN can begin to model feature interactions of considerable complexity, which is especially attractive for relational learning, and (3) learning and inference are linear in the size of the network and the number of relations, which makes CLNs highly efficient. However, like other deep learning approaches, CLNs rely on vast amounts of data and incorporate little to no knowledge about the problem domain. While this may not be an issue for lowlevel applications such as image or video processing, it is a significant issue in relational domains, since the relational structure encodes rich, semantic information. This suggests that ignoring domain knowledge can considerably hinder generalization.
It is well known that biasing learners is necessary in order to allow them to inductively leap from training instances to true generalization over new instances Mitchell80
. Indeed, the inductive bias towards “simplicity and generality” leads to network architectures with simplifying assumptions through regularization strategies that aim to control the complexity of the neural/deep network. While deep learning does incorporate one such bias in the form of domain knowledge (for example, through parameter tying or convolution, which exploits neighborhood information), we are motivated to develop systems that can incorporate richer and more general forms of domain knowledge. This is especially germane for deep relational models as they inherently construct and reason over richer representations. Such domainknowledgebased inductive biases have been applied to a diverse array of machine learning approaches, variously known as advicebased, knowledgebased or humanguided machine learning.
One way in which a human can guide learning is by providing rules over training examples and features. The earliest such approaches combined explanationbased learning (EBLNN, shavlik89ebnn) or symbolic domain rules with ANNs (KBANN, towell1994knowledge). Domain knowledge as rules over input features
can also be incorporated into support vector machines (SVMs,
Cortes1995; Scholkopf98; fung2003knowledge; LeSmolaGartner06; kunapuli2010online). Another natural way a human could guide learning is by expressing preferences and has been studied extensively within the preferenceelicitation framework due to Boutilier et al. (BoutilierEtAl06). We are inspired by this form of advice as they have been successful within the context of inverse reinforcement learning
KunapuliEtAl13odomaaai15 and planning DasEtAl18.These approaches span diverse machine learning formalisms, and they all exhibit the same remarkable behavior: better generalization with fewer training examples because they effectively exploit and incorporate domain knowledge as an inductive bias. This is the prevailing motivation for our approach: to develop a framework that allows a human to guide deep learning by incorporating rules and constraints that define the domain and its aspects. Incorporation of prior knowledge into deep learning has begun to receive interest recently, for instance, the recent work on incorporating prior knowledge of color and scene information into deep learning for image classification DingEtAl18. However, in many such approaches, the guidance is not through a human, but rather through a preprocessing algorithm to generate guidance. Our framework is much more general in that a human provides guidance during learning. Furthermore, the human providing the domain advice is not an AI/ML expert but rather a domain expert who provides rules naturally. We exploit the rich representation power of relational methods to capture, represent and incorporate such rules into relational deep learning models.
We make the following contributions: (1) we propose the formalism of Knowledgeaugmented Column Networks, (2) we present, inspired by previous work (such as KBANN), an approach to inject generalized domain knowledge in a CLN and develop the learning strategy that exploits this knowledge, and (3) we demonstrate, across four real problems in some of which CLNs have been previously employed, the effectiveness and efficiency of injecting domain knowledge. Specifically, our results across the domains clearly show statistically superior performance with small amounts of data. As far as we are aware, this is the first work on humanguided CLNs.
The rest of the paper is organized as follows. We first review the background necessary for the paper including CLNs. Then we present the formalism of KCLNs and demonstrate with examples how to inject knowledge into CLNs. Next, we present the experimental results across four domains before concluding the paper by outlining areas for future research.
2 Background and Related Work
The idea of several processing layers to learn increasingly complex abstractions of the data was initiated by the perceptron model
rosenblatt1958perceptron and was further strengthened by the advent of the backpropagation algorithm lecun1998gradient. A deep architecture was proposed by krizhevsky2012imagenet and have since been adapted for different problems across the entire spectrum of domains, such as, Atari games via deep reinforcement learning mnih2013playing, sentiment classification glorot2011domainand image superresolution
dong2014learning.Applying advice to models has been a long explored problem to construct more robust models to noisy data fung2003knowledge; LeSmolaGartner06; towell1994knowledge; kunapuli2010adviceptron; odom2018human. fu1995introduction
presents a unified view of different variations of knowledgebased neural networks, namely, rule based, decision tree based and semantic constraints based neural networks. Rulebased approaches translate symbolic rules to neural architectures, decision tree based ones impose bounded regions in the parameter space and in constraint based neural networks, each node denotes a concept and each edge denotes relationships between these concepts.Such advice based learning has been proposed for support vector machines
fung2003knowledge; le2006simpler in propositional cases and probabilistic logic models odom2018human for relational cases. There has also been some work on applying advice to neural networks. towell1994knowledge introduce the KBANN algorithm which compiles first order logic rules into a neural network and kunapuli2010adviceptronpresent the first work on applying advice, in the form of constraints, to the perceptron. In the rule based neural networks, the data attributes are assigned as input nodes, the target concept(s) as the output nodes and the intermediate concept(s) as the hidden nodes. The decision tree based network inherits its structure from the underlying decision tree. Each decision node in the tree can be viewed as an input space hyperplane and a decision region bounded by these hyperplanes can be viewed as a leaf node. In constraint based neural network, each node denotes a concept and each edge denotes relationships between these concepts. The knowledgebased neural network framework has been applied successfully to various real world problems such as recognizing genes in DNA sequences
noordewier1991training, microwave design wang1997knowledge, robotic control handelman1990integrating and recently in personalised learning systems melesko2018semantic. Combining relational (symbolic) and deep learning methods has recently gained significant research thrust since relational approaches are indispensable in faithful and explainable modeling of implicit domain structure, which is a major limitation in most deep architectures in spite of their success. While extensive literature exists that aim to combine the two sutskever2009modelling; rocktaschel2014low; lodhi2013deep; battaglia2016interaction, to the best of our knowledge, there has been little or no work on incorporating the advice in any such framework.Column networks transform relational structures into a deep architecture in a principled manner and are designed especially for collective classification tasks pham2017column. The architecture and formulation of the column network are suited for adapting it to the advice framework. The GraphSAGE algorithm hamilton2017inductive shares similarities with column networks since both architectures operate by aggregating neighborhood information but differs in the way the aggregation is performed. Graph convolutional networks kipf2016semi is another architecture that is very similar to the way CLN operates, again differing in the aggregation method. diligenti2017semantic presents a method of incorporating constraints, as a regularization term, which are first order logic statements with fuzzy semantics, in a neural model and can be extended to collective classification problems. While it is similar in spirit to our proposed approach it differs in its representation and problem setup.
Several recent approaches aim to make deep architectures robust to label noise. They include (1.) learning from easy samples (w/ small loss) by using MentorNets which are neural architectures that estimate curriculum (
i.e. importance weight on samples) jiang2018mentornet, (2.) noiserobust loss function via additional noise adaptation layers
goldberger2016training or via multiplicative modifiers over the error/network parameters patrini2017makingand (3) introduction of a regularizer in the loss function for smoothing in presence of adversarial randomizations on the distribution of the response variable
miyato2018virtual.While the above approaches enable effective learning of deep models in presence of noise, there are some fundamental differences between our problem setting and these related approaches.

[Type of noise]: Our approach aims to handle a specific type of noise, namely systematic/targeted noise odom2018human. It occurs frequently in realworld data due to several factors including cognitive bias of humans (or errors in the processes) who record data and sample sparsity.

[Type of error]: Systematic noise leads to generalization errors in the learned model (see Example 1).

[Structured data]: KCLN works in the context of structured data (entities/relations). Faithful modeling of structure is crucial in most real domains but has the limitation that the data is inherently sparse (most entities are not related to each other i.e., most relations are false in the real world).

[Noise prior]: All the noise handling approaches for deep models mentioned earlier explicitly try to model the noise either via prior knowledge of noise distribution or by estimating the same with some proposal distribution. While in adversarial regularization miyato2018virtual the learned label distribution is used as proposal for generating perturbations, it requires a lot of data for the learner to converge. KCLN does not explicitly model noise but allows expert knowledge to guide the learner towards better generalization via an inductive bias.
3 Knowledgeaugmented Column Networks
Column Networks pham2017column
allow for encoding interactions/relations between entities as well as the attributes of such entities in a principled manner without explicit relational feature construction or vector embedding. This is important when dealing with structured domains, especially, in the case of collective classification. This enables us to seamlessly transform a multirelational knowledge graph into a deep architecture making them one of the robust
relational deep models. Figure 1 illustrates an example column network, w.r.t. the knowledge graph on the left. Note how each entity forms its own column and relations are captured via the sparse intercolumn connectors.Consider a graph , where is the set of vertices/entities. For brevity, we assume only one entity type. However, there is no such theoretical limitation in the formulation. is the set of arcs/edges between two entities and denoted as . Note that the graph is multirelational, i.e., where is the set of relation types in the domain. To obtain the equivalent Column Network from , let be the feature vector representing the attributes of an entity and its label predicted by the model^{1}^{1}1Note that since in our formulation every entity is uniquely indexed by , we use and interchangeably. denotes a hidden node w.r.t. entity at the hidden layer ( is the index of the hidden layers). As mentioned earlier, the context between 2 consecutive layers captures the dependency of the immediate neighborhood (based on arcs/edges/intercolumn connectors). For entity , the context w.r.t. and hidden nodes are computed as,
(1)  
(2) 
where are all the neighbors of w.r.t. in the knowledge graph . Note the absence of context connectors between and (Figure 1, right) since there does not exist any relation between and (Figure 1, left). The activation of the hidden nodes is computed as the sum of the bias, the weighted output of the previous hidden layer and the weighted contexts where and are weight parameters and
is a bias for some activation function
. is a predefined constant that controls the parameterized contexts from growing too large for complex relations. Setting to the average number of neighbors of an entity is a reasonable assumption. The final output layer is a softmax over the last hidden layer.(3) 
where is the label ( is the set of labels) and is the index of the last hidden layer.
Following pham2017column, we choose to formulate our approach in the context of a relationsensitive predictive modeling, specifically collective classification tasks. However, structured data is implicitly sparse since most entities in the world are not related to each other, thereby adding to the existing challenge of faithful modeling of the underlying structure. The challenge is amplified as we aim to learn in the presence of knowledgerich, datascarce problems. As we show empirically, sparse samples (or targeted noise) may lead to suboptimal learning or slower convergence.
Example 1
Consider a problem of classifying whether a published article is about carcinoid metastasis
zuetenhorst2005metastaticor is irrelevant, from a citation network, and textual features extracted from the articles themselves. There are several challenges: (1) Data is implicitly sparse due to rarity of studied cases and experimental findings, (2) Some articles may cite other articles related to carcinoid metastasis and contain a subset of the textual features, but address another topic and (3) Finally, the presence of targeted noise, where some important citations were not extracted properly by some citation parser and/or the abstracts are not informative enough.
The above cases may lead to the model not being able to effectively capture certain dependencies, or converge slower, even if they are captured somewhere in the advanced layers of the deep network. Our approach attempts to alleviate this problem via augmented learning of Column Networks using human advice/knowledge. We formally define our problem in the following manner,
Given: A sparse multirelational graph , attributes of each entity (sparse or noisy) in , equivalent ColumnNetwork and access to a Humanexpert To Do: More effective and efficient collective classification by knowledge augmented training of , where is the set of all the network parameters of the Column Network.
We develop Knowledgeaugmented CoLumn Networks (KCLN), that incorporates humanknowledge, for more effective and efficient learning from relational data (Figure 2 illustrates the overall architecture). While knowledgebased connectionist models are not entirely new, our formulation provides  (1) a principled approach for incorporating advice specified in an intuitive logicbased encoding/language (2) a deep model for collective classification in relational data.
3.1 Knowledge Representation
Any model specific encoding of domain knowledge, such as numeric constraints or modified loss functions etc., has several limitations, namely (1) counterintuitive to the humans since they are domain experts and not experts in machine learning (2) the resulting framework is brittle and not generalizable. Consequently, we employ preference rules (akin to IFTHEN statements) to capture human knowledge.
Definition 1
A preference is a modified Horn clause,
where and the are variables over entities, are attributes of and is a relation. and indicate the preferred nonpreferred labels respectively. Quantification is implicitly and hence dropped. We denote a set of preference rules as .
Note that we can always, either have just the preferred label in head of the clause and assume all others as nonpreferred, or assume the entire expression as a single literal. Intuitively a rule can be interpreted as conditional rule, IF [conditions hold] THEN label is preferred. A preference rule can be partially instantiated as well, i.e., or more of the variables may be substituted with constants.
Example 2
For the prediction task mentioned in Example 1, a possible preference rule could be,
Intuitively, this rule denotes that an article is not a relevant clinical work to carcinoid metastasis if it cites an ‘AI’ article and contains the word “domain”, since it is likely to be another AI article that uses carcinoid metastatis as an evaluation domain.
3.2 Knowledge Injection
Given that knowledge is provided as partiallyinstantiated preference rules , more than one entity may satisfy a preference rule. Also, more than one preference rules may be applicable for a single entity. The main intuition is that we aim to consider the error of the trained model w.r.t. both the data and the advice. Consequently, in addition to the “data gradient” as with original CLNs, there is a “advice gradient”. This gradient acts a feedback to augment the learned weight parameters (both column and context weights) towards the direction of the advice gradient. It must be mentioned that not all parameters will be augmented. Only the parameters w.r.t. the entities and relations (contexts) that satisfy should be affected. Let be the set of entities and relations that satisfy the set of preference rules . The hidden nodes (equation 1) can now be expressed as,
(4) 
where and and are advicebased soft gates with respect to a hidden node and its context respectively. is some gating function, is the “advice gradient” and is the tradeoff parameter explained later. The key aspect of soft gates is that they attempt to enhance or decrease the contribution of particular edges in the column network aligned with the direction of the “advice gradient”. We choose the gating function as an exponential . The intuition is that soft gates are natural, as they are multiplicative and a positive gradient will result in increasing the value/contribution of the respective term, while a negative gradient results in pushing them down. We now present the “advice gradient” (the gradient with respect to preferred labels).
Proposition 1
Under the assumption that the loss function with respect to advice / preferred labels is a loglikelihood, of the form , then the advice gradient is, , where is the preferred label of entity and and is an indicator function over the preferred label. For binary classification, the indicator is inconsequential but for multiclass scenarios it is essential ( for preferred label and for ).
Since an entity can satisfy multiple advice rules we take the MAX preferred label, i.e., we take the label to the preferred label if is given by most of the advice rules that satisfies. In case of conflicting advice (i.e. different labels are equally advised), we simply set the advice label to be the label given by the data, .
Proof Sketch: Most advice based learning methods formulate the effect of advice as a constraint on the parameters or a regularization term on the loss function. We consider a regularization term based on the advice loss and we know that . We consider in its functional form following prior nonparametric boosting approaches odomaaai15. Thus . A functional gradient w.r.t. of yields,
Alternatively, assuming a squared loss such as , would result in an advice gradient of the form .
As illustrated in the KCLN architecture (Figure 2
), at the end of every epoch of training the
advice gradients are computed and soft gates are used to augment the value of the hidden units as shown in Equation 3.2.Proposition 2
Given that the loss function of original CLN is crossentropy (binary or sparsecategorical for the binary and multiclass prediction cases respectively) and the objective w.r.t. advice is loglikelihood, the functional gradient of the modified objective for KCLN is,
(5) 
where is the tradeoff parameter between the effect of data and effect of advice, and are the indicator functions on the label w.r.t. the data and the advice respectively and and are the gradients, similarly, w.r.t. data and advice respectively.
Proof Sketch: The original objective function (w.r.t. data) of CLNs is crossentropy. For clarity, let us consider the binary prediction case, where the objective function is now a binary crossentropy of the form
.
Ignoring the summation for brevity, for every entity , . Extension to the multilabel prediction case with a sparse categorical crossentropy is straightforward and is an algebraic manipulation task. Now, from Proposition 1, the loss function w.r.t. advice is the log likelihood of the form, . Thus the modified objective is expressed as,
(6) 
where is the tradeoff parameter. can be implicitly understood. Now we know from Proposition 1 that the distributions, and , can be expressed in their functional forms, given that the activation function of the output layer is a softmax, as . Taking the functional (partial) gradients (w.r.t. and ) of the modified objective function (Equation 6), followed by some algebraic manipulation we get,
(Eqn 5) 
Hence, it follows from Proposition 2 that the data and the advice balances the training of the KCLN network parameters
via the tradeoff hyperparameter
. When data is noisy (or sparse with negligible examples for a region of the parameter space) the advice (if correct) induces a bias on the output distribution towards the correct label. Even if the advice is incorrect, the network still tries to learn the correct distribution to some extent from the data (if not noisy). The contribution of the effect of data versus the effect of advice will primarily depend on . If both the data and human advice are suboptimal (noisy), the correct label distribution is not even learnable.3.3 The Algorithm
Algorithm 1 outlines the key steps involved in our approach. KCLN() is the main procedure [lines: 114] that trains a Column Network using both the data (the knowledge graph ) and the human advice (set of preference rules ). It returns a KCLN where are the network parameters, which are initialized to any arbitrary value ( in our case; [line: 3
]). As described earlier, the network parameters of KCLN (same as CLN) are manipulated (stored and updated) via tensor algebra with appropriate indexing for entities and relations. Also recall that our gating functions are piecewise/nonsmooth and apply only to the subspace of entities, features and relations where the preference rules are satisfied. Thus, as a preprocessing step, we create tensor masks that compactly encode such a subspace with a call to the procedure
CreateMask() [line: 4], which we elaborate later.The network is then trained through multiple epochs till convergence [lines: 612
]. At the end of every epoch the output probabilities and the gradients are computed and stored in a shared data structure [
line: 11] such that they can be accessed subsequently to compute the advice gates [lines: 78]. Our network is trained largely similar to the original CLN with two key modifications [line: 9], namely,
Equation 3.2 is used as the modified expression for hidden units

The data tradeoff is multiplied with the original loss while its counterpart, the advice tradeoff , is used to compute the gates [line: 8]
Procedure CreateMask() [lines: 1527] constructs the tensor mask(s) over the space of entities, features and relations/contexts that are required to compute the gates (as seen in line: 8). Data (the ground knowledge graph ) and the set of preference rules are provided as inputs. There are three key components of the advice mask. They are,

The entity mask (a tensor of dimensions  #entities by length of feature vector) that indicates which entities and the relevant features are affected by the advice/preference

The context mask (#entities by #entities) which indicates the contexts that are affected (relations are directed and so this matrix is asymmetric)

The label mask
which indicate the preferred label of the affected entities, in onehot encoding
All the components are initialized to zeros. The masks are then computed for every preference rule iteratively [lines: 1925]. This includes satisfiability checking for a given preference rule [line: 20], which is achieved via subgraph matching on the knowledge graph since a preference rule (Horn clause  Definition 1) can be viewed a subgraph template. For more details, we refer to the work on employing hyperraph/graph databases for counting instances of horn clauses richards95; dasSDM2016; DasAAAI19. So for all entities, relevant features, and relations/contexts that satisfy the rule, the corresponding elements of the tensor masks are set to [lines: 2123]. The components and are used in gate computation in the KCLN procedure and is used for the indicator in the advice gradient.
After considering the formulation and the learning of KCLNs, we now turn our attention to empirical evaluation of the proposed work.
4 Experiments
We investigate the following questions as part of our experiments, 

Can KCLNs learn effectively with noisy sparse samples i.e., performance?

Can KCLNs learn efficiently with noisy sparse samples i.e., speed of learning?

How does quality of advice affect the performance of KCLN i.e., reliance on robust advice?
We compare against the original Column Networks architecture with no advice^{2}^{2}2Vanilla CLN indicates the original Column Network architecture pham2017column as a baseline. Our intention is to show how advice/knowledge can guide model learning towards better predictive performance and efficiency, in the context of collective classification using Column Networks. Also, we have discussed earlier, in detail, how our problem setting is distinct from most existing noise robust deep learning approaches. Hence, we restricted our comparisons to the original work.
4.1 Experimental Setup
System: KCLN has been developed by extending original CLN architecture, which uses Keras as the functional deep learning API with a Theano
backend for tensor manipulation. We extend this system to include: (1) advice gradient feedback at the end of every epoch, (2) modified hidden layer computations and (3) a preprocessing wrapper to parse the advice/preference rules and create appropriate tensor masks. Since it is not straightforward to access final layer output probabilities from inside any hidden layer using keras, we use
Callbacks to write/update the predicted probabilities to a shared data structure at the end of every epoch. This data structure is then fed via inputs to the hidden layers. Each minicolumn with respect to an entity is a dense network of hidden layers with hidden nodes in each layer (similar to the most effective settings outlined in pham2017column).The preprocessing wrapper acts as an interface between the advice encoded in a symbolic language (horn clauses) and the tensorbased computation architecture. The advice masks encode , i.e., the set of entities and contexts where the gates are applicable (Algorithm 1).
Domains: We evaluate our approach on four relational domains – Pubmed Diabetes and Corporate Messages, which are multiclass classification problems, and Internet Social Debates and Social Network Disaster Relevance, which are binary. Pubmed Diabetes^{3}^{3}3https://linqs.soe.ucsc.edu/data is a citation network for predicting whether a peerreviewed article is about Diabetes Type 1, Type 2 or none, using textual features (TFIDF vectors) from pubmed abstracts as well as citation relationships between them. It comprises articles, considered as an entities, with bagofwords textual features (TFIDF weighted word vectors), and citation relationships among each other. Internet Social Debates^{4}^{4}4http://nldslab.soe.ucsc.edu/iac/v2/ is a data set for predicting stance (‘for’/‘against’) about a debate topic from online posts on social debates. It contains posts (entities) characterized by TFIDF vectors, extracted from the text and header, and relations of 2 types, ‘sameAuthor’ and ‘sameThread’. Corporate Messages^{5}^{5}5https://www.figureeight.com/dataforeveryone/ is an intention prediction data set of flier messages sent by corporate groups in the finance domain with sameSourceGroup relations. The target is to predict the intention of the message (Information, Action or Dialogue). Finally, Social Network Disaster Relevance (same source) is a relevance prediction data set of 8000 Twitter posts, curated and annotated by crowd with their relevance scores. Along with bagofword features we use confidence score features and relations among tweets (of types ‘same author’ and ‘same location’). Table 1 outlines the important aspects of the 4 domains (data sets) used in our experimental evaluation. As indicated earlier, inspired by original CLNs, we evaluate our approach on both binary and multiclass prediction problems.
Domain/Data set  #Entities  #Relations  #Features  Target type 

Pubmed Diabetes  Multiclass  
Corporate Messages  Multiclass  
Online Social Debates  Binary  
Disaster Relevance  Binary 
Metrics: Following pham2017column, we report macroF1 and microF1 scores for the multiclass problems, and F1 scores and AUCPR for the binary ones. MacroF1 computes the F1 score independently for each class and takes the average whereas a microF1 aggregates the contributions of all classes to compute the average F1 score. For all experiments we use hidden layers and hidden units per column in each layer. All results are averaged over 5 runs. Other settings are consistent with original CLN.
Human Advice: KCLN is designed to handle arbitrarily complex expert advice given that they are encoded as preference rules. However, even with some relatively simple preference rules KCLN is more effective in sparse samples. For instance, in Pubmed, the longest one among the 4 preference rules used is, . Note how a simple rule, indicating an article citing another one discussing obesity is likely to be about Type2 diabetes, proved to be effective. Expert knowledge from real physicians can thus, prove to be even more effective. In Disaster Relevance we used rules that did not require much domain expertise, such as if a tweet is by the same user who usually posts nondisaster tweets then the tweet is likely to be a nondisaster one. Suboptimal advice may lead to a wrong direction of the Advice Gradient. However, our soft gates do not alter the loss, but instead promote/demote the contribution of nodes/contexts. Similar to Patrini et al., patrini2017making, data is still balancing the effect of advice during training.
4.2 Experimental Results
Recall that our goal is to demonstrate the efficiency and effectiveness of KCLNs with smaller set of training examples. Hence, we present the aforementioned metrics with varying sample size and with varying epochs and compare our model against Vanilla CLN. We split the data sets into a training set and a holdout test set with 60%40% ratio. For varying epochs we only learn on 40% of our already split training set (i.e., 24% of the complete data) to train the model with varying epochs and test on the holdout test set. Figures 3(a)  3(b) illustrate the microF1 and the macroF1 scores for the PubMed diabetes data and Figures 6(a)  6(b) show the F1 score and AUCPR for the and social network disaster relevance data. As the figures show, although both KCLN and Vanilla CLN converge to the same predictive performance, KCLN converges significantly faster (less epochs). For the corporate messages and the internet social debate, KCLN not only converges faster but also has a better predictive performance than Vanilla CLN as shown in Figures 4(a)  4(b) and Figures 5(a)  5(b). The results show that KCLNs learn more efficiently with noisy sparse samples thereby answering (Q1) affirmatively.
Effectiveness of KCLN is illustrated by its performance with respect to the varying sample sizes of the training set, especially with low sample size. The intuition is, domain knowledge should help guide the model to learn better when the amount of training data available is small. KCLN is trained on gradually varying sample size from 5% of the training data (3% of the complete data) till 80% of the training data (48% of complete data) and tested on the holdout test set. Figures 3(c)  3(d) present the microF1 and macroF1 score for pubMed diabetes and Figures 6(c)  6(d) plot the F1 score and AUCPR for social network disaster relevance. It can be seen that KCLN outperforms Vanilla CLN across all sample sizes, on both metrics, which suggests that the advice is relevant throughout the training phase with varying sample sizes. For corporate messages, KCLN outperforms with small number of samples as shown in the microF1 metric (Figure 4(c)) gradually converging to a similar prediction performance with larger samples. MacroF1 (Figure 4(d)), however, shows that the performance is similar for both the models across all sample sizes, although KCLN does perform better with very small samples. Since this is a multiclass classification problem, similar performance in the macroF1 case suggests that in some classes the advice is not applicable during learning, while it applies well w.r.t. other classes, thereby averaging out the final result. For internet social debate stance prediction, Figures 5(c)  5(d) present the F1 score and the AUCPR respectively. KCLN outperforms the Vanilla CLN on both metrics and thus we can answer (Q2) affirmatively. KCLNs learn effectively with noisy sparse samples.
An obvious question that will arise is – how robust is our learning system to that of noisy/incorrect advice? Conversely, how does the choice of affect the quality of the learned model? To answer these questions specifically, we performed an additional experiment on the Internet Social Debates domain by augmenting the learner with incorrect advice. This incorrect advice is essentially created by changing the preferred label of the advice rules to incorrect values (based on our understating). Also, recall that the contribution of advice is dependent on the tradeoff parameter , which controls the robustness of KCLN to advice quality. Consequently, we experimented with different values of (), across varying sample sizes.
Figure 7 shows how with higher values the performance deteriorates due to the effect of noisy advice. is not plotted since the performance is same as noadvice/Vanilla CLN. Note that with reasonably low values of , the performance does not deteriorate much and is, in fact, better in some samples. Thus with reasonably low values of KCLN is robust to quality of advice (Q3). We picked one domain to present the results of this robustness but have observed similar behavior in all the domains.
4.3 Discussion
It is difficult to quantify correctness or quality of human advice unless, absolute ground truth is accessible in some manner. We evaluate on sparse samples of real data sets with no availability of gold standard labels. Hence, to the best of our capabilities, we have provided the most relevant/useful advice in the experiments aimed at answering (Q1) and (Q2) as indicated in the experimental setup. We emulate noisy advice (for Q3) by flipping/altering the preferred labels of advice rules in the original set of preferences.
We have shown theoretically, in Proposition 1 and 2, that the robustness of KCLN depends on the advice tradeoff parameter . We illustrated how it can control the contribution of the data versus the advice towards effective training. We postulate that even in presence of noisy advice, the data (if not noisy) is expected to contribute towards effective learning with a weight of . Of course, if both the data and advice are noisy the concept is not learnable. Note that this is the case with any learning algorithm where both the knowledge/hypotheses space and the data being incorrect can lead to an incorrect hypothesis.
The experiments w.r.t. Q3 (Figure 7) empirically support our theoretical analysis. We found that when , KCLN performs well even with noisy advice. In the earlier experiments where we use potentially good advice, we report the results with , since the advice gradient is piecewise (affects only a subset of entities/relations). So it is reasonable to assign higher weight to the advice and the contribution of the entities and relations/contexts affected by it, given the advice is noisefree. Also, note that the drop in performance towards very low sample sizes (in Figure 7) highlights how learning is challenging in the noisydata and noisyadvice scenario. This aligns with our general understanding of most humanintheloop/advicebased approaches in AI. Tradeoff between data and advice via a weighted combination of both is a well studied solution in related literature OdomNatarajan18 and, hence, we adapt the same in our context. Tracking the expertise of humans to infer advice quality is an interesting future research direction.
5 Conclusion
We considered the problem of providing guidance for CLNs. Specifically, inspired by treating the domain experts as true domain experts and not CLN experts, we developed a formulation based on preferences. This formulation allowed for natural specification of guidance. We derived the gradients based on advice and outlined the integration with the original CLN formulation. Our experimental results across different domains clearly demonstrate the effectiveness and efficiency of the approach, specifically in knowledgerich, datascarce problems. Exploring other types of advice including feature importances, qualitative constraints, privileged information, etc. is a potential future direction. Scaling our approach to webscale data is a natural extension. Finally, extending the idea to other deep models remains an interesting direction for future research.
Comments
There are no comments yet.