The re-emergence of Deep Learning DeepLearningBook2016 has demonstrated significant success in difficult real-world domains such as image krizhevsky2012imagenet, audio audio and video processing videoCVPR. Deep Learning is recently being increasingly applied to structured domains, where the data is represented using richer symbolic or graph features to capture relational structure between entities and attributes in the domain. Such models are able to capture increasingly complex interactions between features with deeper layers. However, the combinatorial complexity of reasoning over a large number of relations and objects has remains a significant bottleneck to overcome.
While recent work in relational deep learning seeks to address the problem of faithful modeling of relational structure KazemiPoole18-RelNNs; SourekEtAl-15-LRNNs; KaurEtAl18-RRBM, we focus on Column Networks (CLNs) pham2017column which are deep architectures composed of several (feedforward) mini-columns, each of which represents an entity in the domain. Relationships between two entities are modeled through edges between mini-columns. The true power of column networks come from natural modeling of long-range inter-entity interactions with progressively deeper layers and have been successfully applied to collective classification tasks. However, CLNs rely on large amounts of data and incorporate little to no knowledge about the problem domain. While this may suffice for low-level applications such as image/video processing, it is a concern in relational domains consisting of rich, semantic information.
Biasing the learners is necessary in order to allow them to inductively leap from training instances to true generalization over new instances Mitchell80. While deep learning does incorporate one such bias in the form of domain knowledge (for example, through parameter tying or convolution, which exploits neighborhood information), we are motivated to develop systems that can incorporate richer and more general forms of domain knowledge. This is especially germane for deep relational models as they inherently construct and reason over richer representations.
One way in which a human can guide learning is by providing rules over training examples and features shavlik89ebnn; towell1994knowledge; fung2003knowledge; kunapuli2010online. Another way that has been studied extensively is expressing preferences within the preference-elicitation framework BoutilierEtAl06
. We are inspired by this form of advice as they have been successful within the context of inverse reinforcement learningKunapuliEtAl13odomaaai15 and planning DasEtAl18.
The motivation for our approach is as follows: to develop a framework that allows a human to guide deep learning by incorporating rules and constraints that define the domain and its aspects. Incorporation of prior knowledge into deep learning has begun to receive interest recently DingEtAl18. However, in many such approaches, the guidance is not through a human, but rather through a pre-processing algorithm to generate guidance. Our framework is much more general, in that a domain expert provides guidance during learning. We exploit the rich representation power of relational methods to capture, represent and incorporate such rules into relational deep learning models.
We make the following contributions: (1) we propose the formalism of Knowledge-augmented Column Networks (K-CLN), (2) we present an approach to inject generalized domain knowledge in a CLN and develop the learning strategy that exploits this knowledge, and (3) we demonstrate, across two real problems, in some of which CLNs have been previously employed, the effectiveness and efficiency of injecting domain knowledge. Specifically, our results across the domains clearly show statistically superior performance with small amounts of data. As far as we are aware, this is the first work on human-guided CLNs.
2 Knowledge-augmented Column Networks
Column Networks pham2017column
allow for encoding interactions/relations between entities as well as the attributes of such entities in a principled manner without explicit relational feature construction or vector embedding. This is important when dealing with structured domains, especially, in the case of collective classification. This enables us to seamlessly transform a multi-relational knowledge graph into a deep architecture making them one of the robustrelational deep models. Figure 1 illustrates an example column network, w.r.t. the knowledge graph on the left. Note how each entity forms its own column and relations are captured via the sparse inter-column connectors.
Consider a graph , where is the set of vertices/entities. is the set of arcs/edges between two entities and denoted as . Also, is multi-relational, i.e., where is the set of relation types in the domain. To obtain the equivalent Column Network from , let be the feature vector representing the attributes of an entity and its label predicted by the model111 uniquely indexes , we use and interchangeably. denotes a hidden node w.r.t. entity at the hidden layer ( is the index of the hidden layers). As mentioned earlier, the context between 2 consecutive layers captures the dependency of the immediate neighborhood (based on arcs/edges/inter-column connectors). For entity , the context w.r.t. and hidden nodes are computed as,
where are all the neighbors of w.r.t. in the knowledge graph . Note the absence of context connectors between and (Figure 1, right) since there does not exist any relation between and (Figure 1, left). The activation of the hidden nodes is computed as the sum of the bias, the weighted output of the previous hidden layer and the weighted contexts where and are weight parameters and
is a bias for some activation function. is a pre-defined constant that controls the parameterized contexts from growing too large for complex relations. Setting to the average number of neighbors of an entity is a reasonable assumption. The final output layer is a softmax over the last hidden layer.
where is the label ( is the set of labels) and is the index of the last hidden layer.
While CLNs are relation-aware deep models that can represent and learn from structured data faithfully, they are not devoid of limitations, especially the challenges of effective learning with sparse samples or systematic noise. Several approaches jiang2018mentornet; goldberger2016training; miyato2018virtual enable effective learning of deep models in presence of noise, however our problem setting is significantly different, due to - [(1) Type of noise]: we aim to handle systematic/targeted noise odom2018human. It occurs frequently in real-world due to cognitive bias or sample sparsity. [(2) Type of error]: Systematic noise leads to generalization errors. [(3) Structured data]: K-CLN works in the context of structured data (entities/relations). Structured/relational data, thought crucial, is inherently sparse (most relations are false in the real world). Given: A sparse multi-relational graph , attributes of each entity (sparse or noisy) in , equivalent Column-Network and access to a Human-expert To Do: More effective and efficient collective classification by knowledge augmented training of , where is the set of all the network parameters of the Column Network. To this effect we propose Knowledge-augmented CoLumn Network that incorporates human advice into deep models in a principled manner using a gated architecture, where ‘Advice Gates’ augment/modify the trained network parameters based on the advice.
2.1 Knowledge Representation
Any model specific encoding of domain knowledge, such as numeric constraints or modified loss functions etc., has several limitations, namely (1) counter-intuitive to the humans since they are domain experts and not experts in machine learning (2) the resulting framework is brittle and not generalizable. Consequently, we employ preference rules (akin to IF-THEN statements) to capture human knowledge.
A preference is a modified Horn clause,
where and the are variables over entities, are attributes of and is a relation. and indicate the preferred non-preferred labels respectively. Quantification is implicitly and hence dropped. We denote a set of preference rules as .
2.2 Knowledge Injection
Given that knowledge is provided as partially-instantiated preference rules , more than one entity may satisfy a preference rule. Also, more than one preference rules may be applicable for a single entity. The key idea is that we aim to consider the error of the trained model w.r.t. both the data and the advice. Consequently, in addition to the “data gradient” as with original CLNs, there is a “advice gradient”. This gradient acts a feedback to augment the learned weight parameters (both column and context weights) towards the direction of the advice gradient. It must be mentioned that not all parameters will be augmented. Only the parameters w.r.t. the entities and relations (contexts) that satisfy should be affected. Let be the set of entities and relations that satisfy the set of preference rules . The hidden nodes (equation 1) can now be expressed as,
where and and are advice-based soft gates with respect to a hidden node and its context respectively. is some gating function, is the “advice gradient” and is the trade-off parameter explained later. The key aspect of soft gates is that they attempt to enhance or decrease the contribution of particular edges in the column network aligned with the direction of the “advice gradient”. We choose the gating function as an exponential . The intuition is that soft gates are natural, as they are multiplicative and a positive gradient will result in increasing the value/contribution of the respective term, while a negative gradient results in pushing them down. We now present the “advice gradient” (the gradient with respect to preferred labels).
Under the assumption that the loss function with respect to advice / preferred labels is a log-likelihood, of the form , then the advice gradient is, , where is the preferred label of entity and and is an indicator function over the preferred label. For binary classification, the indicator is inconsequential but for multi-class scenarios it is essential ( for preferred label and for ).
An entity can satisfy multiple advice rules. So we take the most preferred label. In case of conflicting advice (i.e. different labels are equally advised), we set the advice label to be the label given by the data, .
Given that the loss function of original CLN is cross-entropy (binary or sparse-categorical for the binary and multi-class prediction cases respectively) and the objective with respect to advice is log-likelihood, the functional gradient of the modified objective for K-CLN is,
where is the trade-off parameter between the effect of data and effect of advice, and are the indicator functions on the label w.r.t. the data and the advice respectively and and are the gradients, similarly, w.r.t. data and advice respectively.
Hence, it follows from Proposition 2 that the data and the advice balances the training of the K-CLN network parameters
via the trade-off hyperparameter. When data is noisy (or sparse with negligible examples for a region of the parameter space) the advice (if correct) induces a bias on the output distribution towards the correct label. Even if the advice is incorrect, the network still tries to learn the correct distribution to some extent from the data (if not noisy). The contribution of the effect of data versus the effect of advice will primarily depend on . If both the data and human advice are sub-optimal (noisy), the correct label distribution is not even learnable. We exclude the formal proofs due to space limitation.
2.3 The Algorithm
Algorithm 1 outlines the key steps involved in our approach. It trains a Column Network using both the data (the knowledge graph ) and the human advice (set of preference rules ). It returns a K-CLN where
are the network parameters. As described earlier, the network parameters of K-CLN (same as CLN) are manipulated (stored and updated) via tensor algebra with appropriate indexing for entities and relations. Also recall that our gating functions are piece-wise/non-smooth and apply only to the subspace of entities, features and relations where the preference rules are satisfied. Thus, as a pre-processing step, we create tensor masks that compactly encode such a subspace with a call to the procedureCreateMask()
, explained later. At the end of every epoch the output probabilities and the gradients are computed and stored in a shared data structure [line: 10] for computing advice gates in the next epoch. Rest of the training strategy is similar original CLN, except modified hidden units (Equation 2.2) [line: 8] and data and advice trade-off parameter .
Procedure CreateMask() constructs the advice tensor mask(s) over the space of entities, features and relations/contexts, based on the advice rules, that are required to compute the gates. The main components are - (1) The entity mask ( tensor) that indicates entity and feature indexes affected by the preferences; (2) The context mask () which indicates the affected contexts/relations; (3) The label mask
stores the preferred label of the affected entities, in one-hot encoding. Advice mask computation requires efficient satisfiability checking for each preference rule against the knowledge graph. We solve this via efficient subgraph matching proposed by Das et al. DasAAAI19. The masks are binary withencoding true and encoding false.
We investigate the following questions as part of our experiments, - [Q1] Can K-CLNs learn effectively with noisy sparse samples i.e., performance? [Q2] Can K-CLNs learn efficiently with noisy sparse samples i.e., speed of learning? [Q3] How does quality of advice affect the performance of K-CLN i.e., reliance on robust advice? We compare against the original Column Networks architecture with no advice222Vanilla CLN indicates original architecture pham2017column as a baseline. We show how advice/knowledge can guide model learning towards better predictive performance and efficiency, in the context of collective classification using Column Networks.
3.1 Experimental Setup
System: K-CLN has been developed by extending original CLN architecture, which uses Keras as the functional deep learning API with a Theano
backend for tensor manipulation. We extend this system to include: (1) advice gradient feedback at the end of every epoch, (2) modified hidden layer computations and (3) a pre-processing wrapper to parse the advice/preference rules and create appropriate tensor masks. Since it is not straightforward to access final layer output probabilities from inside any hidden layer using keras, we useCallbacks to write/update the predicted probabilities to a shared data structure at the end of every epoch. Rest of the architecture follows from original CLNs. The advice masks encode , i.e., the set of entities and contexts where the gates are applicable.
Domains: We evaluate our approach on two relational domains – Pubmed Diabetes, a multi-class classification problem and Internet Social Debates, a binary classification problem. Pubmed Diabetes333https://linqs.soe.ucsc.edu/data is a citation network for predicting whether a peer-reviewed article is about Diabetes Type 1, Type 2 or none, using textual features (TF-IDF vectors) from pubmed abstracts. It comprises articles, considered as an entities, with bag-of-words textual features (TF-IDF weighted word vectors), and citation relationships among each other. Internet Social Debates444http://nldslab.soe.ucsc.edu/iac/v2/ is a data set for predicting stance (‘for’/‘against’) about a debate topic from online posts on social debates. It contains posts (entities) characterized by TF-IDF vectors, extracted from the text and header, and relations of 2 types, ‘sameAuthor’ and ‘sameThread’.
Metrics: Following pham2017column, we report micro-F1 score, which aggregates the contributions of all classes to compute the average F1 score, for the multi-class problem and AUC-PR for the binary one. We use hidden layers and hidden units per column in each layer. All results are averaged over 5 runs and our settings are consistent with original CLN.
Human Advice: K-CLN can handle arbitrarily complex advice (encoded as preference rules). However, even with some relatively simple rules K-CLN is effective in sparse samples. For instance, in Pubmed, the longest preference rule used is, . This simply indicates an article citing another one discussing obesity. is likely about Type2 diabetes, Expert knowledge from real physicians can thus, prove to be even more effective. Note that sub-optimal advice may lead to a wrong direction of the Advice Gradient. However, since the data balances the effect of advice during training as shown by patrini2017making, our soft gates do not alter the loss but instead promote/demote the contribution of nodes/contexts.
3.2 Experimental Results
Our goal is to demonstrate the efficiency and effectiveness of K-CLNs with smaller set of training examples. Hence, we present the aforementioned metrics with varying sample size and with varying epochs and compare our model against Vanilla CLN. We split the data sets into a training set and a hold-out test set with 60%-40% ratio. For varying epochs we only learn on 40% of our training set (i.e., 24% of the complete data) to train the model with varying epochs and test on the hold-out test set. Figures 4 (Left) and 4 (Left) illustrate the micro-F1 scores with varying epochs for PubMed diabetes and internet social debate data sets resp. K-CLN converges significantly faster (less epochs), at times, with better predictive performance at convergence which shows that K-CLNs learn more efficiently with noisy sparse samples thereby answering (Q1) affirmatively.
Effectiveness of K-CLN is illustrated by its performance with respect to the varying sample sizes of the training set, especially in the low sample ranges. The intuition is, domain knowledge should help guide the model to learn better when the amount of training data available is small. K-CLN is trained on gradually varying sample size from 5% of the training data (3% of the complete data) till 80% of the training data (48% of complete data) and tested on the hold-out test set. Figures 4 (Right) and 4 (Right) present the micro-F1 with varying sample sizes for PubMed diabetes and internet social debate respectively.For internet social debate stance prediction, K-CLN outperforms Vanilla CLN with all sample sizes lower than, approximately, . However, in case of PubMed, K-CLN outperforms Vanilla CLN for all sample sizes we experimented with, thus answering (Q2) affirmatively. K-CLNs learn effectively with noisy sparse samples.
An obvious question that will arise is – how robust is our learning system to that of noisy/incorrect advice? Conversely, how does the choice of affect the quality of the learned model? To answer these questions specifically, we performed an additional experiment on the Internet Social Debates domain by augmenting the learner with incorrect advice. This incorrect advice is essentially created by changing the preferred label of the advice rules to incorrect values (based on our understating). Also, recall that the contribution of advice is dependent on the trade-off parameter , which controls the robustness of K-CLN to advice quality. Consequently, we experimented with different values of (), across varying sample sizes.
Figure 5 shows how with higher values the performance deteriorates due to the effect of noisy advice. is not plotted since the performance is same as no-advice/Vanilla CLN. Note that with reasonably low values of , the performance does not deteriorate much and is, in fact, better in some samples. Thus with reasonably low values of K-CLN is robust to quality of advice (Q3). We picked one domain to present the results of this robustness but have observed similar behavior in both the domains. These experiments empirically support our theoretical analysis (Proposition 2). We found that when , K-CLN performs well even with noisy advice. In the earlier experiments where we use potentially good advice, we report the results with , So it is reasonable to assign higher weight to the advice and the contribution of the entities and relations/contexts affected by it, given the advice is noise-free. Also, note that the drop in performance towards very low sample sizes (in Figure 5) highlights how learning is challenging in the noisy-data and noisy-advice scenario. This aligns with our general understanding of most human-in-the-loop/advice-based approaches in AI. Trade-off between data and advice via a weighted combination of both is a well studied solution in related literature OdomNatarajan18 and, hence, we adapt the same in our context. Tracking the expertise of humans to infer advice quality is an interesting future research direction.
We considered the problem of providing guidance for CLNs. Specifically, inspired by treating the domain experts as true domain experts and not CLN experts, we developed a formulation based on preferences. This formulation allowed for natural specification of guidance. We derived the gradients based on advice and outlined the integration with the original CLN formulation. Our initial evaluation across a couple of domains clearly demonstrate the effectiveness and efficiency of the approach, specifically in knowledge-rich, data-scarce problems. We are also experimenting on a few more domains and the results will be included in the full version of the paper. Exploring other types of advice including feature importance, qualitative constraints, privileged information, etc. is a potential future direction. Scaling our approach to web-scale data is a natural extension. Finally, extending the idea to other deep models and applications to more real domains remains an interesting direction for future research.
MD, GK & SN gratefully acknowledge the support of CwC Program Contract W911NF-15-1-0461 with the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO). SN also acknowledges the NSF grant IIS-1836565 and AFOSR award FA9550-18-1-0462. DSD acknowledges the National Institute of Health (NIH) grant no. R01 GM097628. Any opinions, findings and conclusion or recommendations are those of the authors and do not necessarily reflect the view of the DARPA, ARO, AFOSR or the US government.