Human-Guided Learning of Column Networks: Augmenting Deep Learning with Advice

04/15/2019
by   Mayukh Das, et al.
The University of Texas at Dallas
0

Recently, deep models have been successfully applied in several applications, especially with low-level representations. However, sparse, noisy samples and structured domains (with multiple objects and interactions) are some of the open challenges in most deep models. Column Networks, a deep architecture, can succinctly capture such domain structure and interactions, but may still be prone to sub-optimal learning from sparse and noisy samples. Inspired by the success of human-advice guided learning in AI, especially in data-scarce domains, we propose Knowledge-augmented Column Networks that leverage human advice/knowledge for better learning with noisy/sparse samples. Our experiments demonstrate that our approach leads to either superior overall performance or faster convergence (i.e., both effective and efficient).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/31/2019

Knowledge-augmented Column Networks: Guiding Deep Learning with Advice

Recently, deep models have had considerable success in several tasks, es...
06/06/2021

Tabular Data: Deep Learning is Not All You Need

A key element of AutoML systems is setting the types of models that will...
07/31/2015

Deep Networks for Image Super-Resolution with Sparse Prior

Deep learning techniques have been successfully applied in many areas of...
09/01/2013

Multi-Column Deep Neural Networks for Offline Handwritten Chinese Character Classification

Our Multi-Column Deep Neural Networks achieve best known recognition rat...
05/23/2019

Augmenting correlation structures in spatial data using deep generative models

State-of-the-art deep learning methods have shown a remarkable capacity ...
07/10/2020

ExpertNet: Adversarial Learning and Recovery Against Noisy Labels

Today's available datasets in the wild, e.g., from social media and open...
07/09/2021

Can Deep Neural Networks Predict Data Correlations from Column Names?

For humans, it is often possible to predict data correlations from colum...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The re-emergence of Deep Learning DeepLearningBook2016 has found significant and successful applications in difficult real-world domains such as image krizhevsky2012imagenet, audio audio and video processing videoCVPR. Deep Learning has also been increasingly applied to structured domains, where the data is represented using richer symbolic or graph features to capture relational structure between entities and attributes in the domain. Intuitively, deep learning architectures are naturally suited to learning and reasoning over such multi-relational domains as they are able to capture increasingly complex interactions between features with deeper layers. However, the combinatorial complexity of reasoning over a large number of relations and objects has remained a significant bottleneck to overcome.

Recent work in relational deep learning has sought to address this particular issue. This includes relational neural networks

KazemiPoole18-RelNNs; SourekEtAl-15-LRNNs

, relational Restricted Boltzmann machines

KaurEtAl18-RRBM and neuro-symbolic architectures such as C-ILP FrancaEtAl-CILP-14. In our work, we focus upon the framework of Column Networks (CLNs) developed by pham2017column. Column networks are composed of several (feedforward) mini-columns each of which represents an entity in the domain. Relationships between two entities are modeled through edges between mini-columns. These edges allow for the short-range exchange of information over successive layers of the column network; however, the true power of column networks emerges as the depth of interactions increases, which allows for the natural modeling of long-range interactions.

Column networks are an attractive approach for several reasons: (1) hidden layers of a CLN share parameters, which means that making the network deeper does not introduce more parameters, (2) as the depth increases, the CLN can begin to model feature interactions of considerable complexity, which is especially attractive for relational learning, and (3) learning and inference are linear in the size of the network and the number of relations, which makes CLNs highly efficient. However, like other deep learning approaches, CLNs rely on vast amounts of data and incorporate little to no knowledge about the problem domain. While this may not be an issue for low-level applications such as image or video processing, it is a significant issue in relational domains, since the relational structure encodes rich, semantic information. This suggests that ignoring domain knowledge can considerably hinder generalization.

It is well known that biasing learners is necessary in order to allow them to inductively leap from training instances to true generalization over new instances  Mitchell80

. Indeed, the inductive bias towards “simplicity and generality” leads to network architectures with simplifying assumptions through regularization strategies that aim to control the complexity of the neural/deep network. While deep learning does incorporate one such bias in the form of domain knowledge (for example, through parameter tying or convolution, which exploits neighborhood information), we are motivated to develop systems that can incorporate richer and more general forms of domain knowledge. This is especially germane for deep relational models as they inherently construct and reason over richer representations. Such domain-knowledge-based inductive biases have been applied to a diverse array of machine learning approaches, variously known as advice-based, knowledge-based or human-guided machine learning.

One way in which a human can guide learning is by providing rules over training examples and features. The earliest such approaches combined explanation-based learning (EBL-NN, shavlik89ebnn) or symbolic domain rules with ANNs (KBANN, towell1994knowledge). Domain knowledge as rules over input features

can also be incorporated into support vector machines (SVMs,

Cortes1995; Scholkopf98; fung2003knowledge; LeSmolaGartner06; kunapuli2010online). Another natural way a human could guide learning is by expressing preferences and has been studied extensively within the preference-elicitation framework due to Boutilier et al. (BoutilierEtAl06

). We are inspired by this form of advice as they have been successful within the context of inverse reinforcement learning

KunapuliEtAl13

, imitation learning

odomaaai15 and planning DasEtAl18.

These approaches span diverse machine learning formalisms, and they all exhibit the same remarkable behavior: better generalization with fewer training examples because they effectively exploit and incorporate domain knowledge as an inductive bias. This is the prevailing motivation for our approach: to develop a framework that allows a human to guide deep learning by incorporating rules and constraints that define the domain and its aspects. Incorporation of prior knowledge into deep learning has begun to receive interest recently, for instance, the recent work on incorporating prior knowledge of color and scene information into deep learning for image classification DingEtAl18. However, in many such approaches, the guidance is not through a human, but rather through a pre-processing algorithm to generate guidance. Our framework is much more general in that a human provides guidance during learning. Furthermore, the human providing the domain advice is not an AI/ML expert but rather a domain expert who provides rules naturally. We exploit the rich representation power of relational methods to capture, represent and incorporate such rules into relational deep learning models.

We make the following contributions: (1) we propose the formalism of Knowledge-augmented Column Networks, (2) we present, inspired by previous work (such as KBANN), an approach to inject generalized domain knowledge in a CLN and develop the learning strategy that exploits this knowledge, and (3) we demonstrate, across four real problems in some of which CLNs have been previously employed, the effectiveness and efficiency of injecting domain knowledge. Specifically, our results across the domains clearly show statistically superior performance with small amounts of data. As far as we are aware, this is the first work on human-guided CLNs.

The rest of the paper is organized as follows. We first review the background necessary for the paper including CLNs. Then we present the formalism of KCLNs and demonstrate with examples how to inject knowledge into CLNs. Next, we present the experimental results across four domains before concluding the paper by outlining areas for future research.

2 Background and Related Work

The idea of several processing layers to learn increasingly complex abstractions of the data was initiated by the perceptron model

rosenblatt1958perceptron and was further strengthened by the advent of the back-propagation algorithm lecun1998gradient. A deep architecture was proposed by krizhevsky2012imagenet and have since been adapted for different problems across the entire spectrum of domains, such as, Atari games via deep reinforcement learning mnih2013playing, sentiment classification glorot2011domain

and image super-resolution

dong2014learning.

Applying advice to models has been a long explored problem to construct more robust models to noisy data fung2003knowledge; LeSmolaGartner06; towell1994knowledge; kunapuli2010adviceptron; odom2018human. fu1995introduction

presents a unified view of different variations of knowledge-based neural networks, namely, rule based, decision tree based and semantic constraints based neural networks. Rule-based approaches translate symbolic rules to neural architectures, decision tree based ones impose bounded regions in the parameter space and in constraint based neural networks, each node denotes a concept and each edge denotes relationships between these concepts.Such advice based learning has been proposed for support vector machines

fung2003knowledge; le2006simpler in propositional cases and probabilistic logic models odom2018human for relational cases. There has also been some work on applying advice to neural networks. towell1994knowledge introduce the KBANN algorithm which compiles first order logic rules into a neural network and kunapuli2010adviceptron

present the first work on applying advice, in the form of constraints, to the perceptron. In the rule based neural networks, the data attributes are assigned as input nodes, the target concept(s) as the output nodes and the intermediate concept(s) as the hidden nodes. The decision tree based network inherits its structure from the underlying decision tree. Each decision node in the tree can be viewed as an input space hyperplane and a decision region bounded by these hyperplanes can be viewed as a leaf node. In constraint based neural network, each node denotes a concept and each edge denotes relationships between these concepts. The knowledge-based neural network framework has been applied successfully to various real world problems such as recognizing genes in DNA sequences

noordewier1991training, microwave design wang1997knowledge, robotic control handelman1990integrating and recently in personalised learning systems melesko2018semantic. Combining relational (symbolic) and deep learning methods has recently gained significant research thrust since relational approaches are indispensable in faithful and explainable modeling of implicit domain structure, which is a major limitation in most deep architectures in spite of their success. While extensive literature exists that aim to combine the two sutskever2009modelling; rocktaschel2014low; lodhi2013deep; battaglia2016interaction, to the best of our knowledge, there has been little or no work on incorporating the advice in any such framework.

Column networks transform relational structures into a deep architecture in a principled manner and are designed especially for collective classification tasks pham2017column. The architecture and formulation of the column network are suited for adapting it to the advice framework. The GraphSAGE algorithm hamilton2017inductive shares similarities with column networks since both architectures operate by aggregating neighborhood information but differs in the way the aggregation is performed. Graph convolutional networks kipf2016semi is another architecture that is very similar to the way CLN operates, again differing in the aggregation method. diligenti2017semantic presents a method of incorporating constraints, as a regularization term, which are first order logic statements with fuzzy semantics, in a neural model and can be extended to collective classification problems. While it is similar in spirit to our proposed approach it differs in its representation and problem setup.

Several recent approaches aim to make deep architectures robust to label noise. They include (1.) learning from easy samples (w/ small loss) by using MentorNets which are neural architectures that estimate curriculum (

i.e. importance weight on samples) jiang2018mentornet

, (2.) noise-robust loss function via additional noise adaptation layers 

goldberger2016training or via multiplicative modifiers over the error/network parameters patrini2017making

and (3) introduction of a regularizer in the loss function for smoothing in presence of adversarial randomizations on the distribution of the response variable 

miyato2018virtual.

While the above approaches enable effective learning of deep models in presence of noise, there are some fundamental differences between our problem setting and these related approaches.

  • [Type of noise]: Our approach aims to handle a specific type of noise, namely systematic/targeted noise odom2018human. It occurs frequently in real-world data due to several factors including cognitive bias of humans (or errors in the processes) who record data and sample sparsity.

  • [Type of error]: Systematic noise leads to generalization errors in the learned model (see Example 1).

  • [Structured data]: K-CLN works in the context of structured data (entities/relations). Faithful modeling of structure is crucial in most real domains but has the limitation that the data is inherently sparse (most entities are not related to each other i.e., most relations are false in the real world).

  • [Noise prior]: All the noise handling approaches for deep models mentioned earlier explicitly try to model the noise either via prior knowledge of noise distribution or by estimating the same with some proposal distribution. While in adversarial regularization miyato2018virtual the learned label distribution is used as proposal for generating perturbations, it requires a lot of data for the learner to converge. K-CLN does not explicitly model noise but allows expert knowledge to guide the learner towards better generalization via an inductive bias.

3 Knowledge-augmented Column Networks

Column Networks pham2017column

allow for encoding interactions/relations between entities as well as the attributes of such entities in a principled manner without explicit relational feature construction or vector embedding. This is important when dealing with structured domains, especially, in the case of collective classification. This enables us to seamlessly transform a multi-relational knowledge graph into a deep architecture making them one of the robust

relational deep models. Figure 1 illustrates an example column network, w.r.t. the knowledge graph on the left. Note how each entity forms its own column and relations are captured via the sparse inter-column connectors.

Figure 1: Original Column network (diagram source: pham2017column) Figure 2: Knowledge-augmented Column Network (K-CLN) architecture

Consider a graph , where is the set of vertices/entities. For brevity, we assume only one entity type. However, there is no such theoretical limitation in the formulation. is the set of arcs/edges between two entities and denoted as . Note that the graph is multi-relational, i.e., where is the set of relation types in the domain. To obtain the equivalent Column Network from , let be the feature vector representing the attributes of an entity and its label predicted by the model111Note that since in our formulation every entity is uniquely indexed by , we use and interchangeably. denotes a hidden node w.r.t. entity at the hidden layer ( is the index of the hidden layers). As mentioned earlier, the context between 2 consecutive layers captures the dependency of the immediate neighborhood (based on arcs/edges/inter-column connectors). For entity , the context w.r.t. and hidden nodes are computed as,

(1)
(2)

where are all the neighbors of w.r.t. in the knowledge graph . Note the absence of context connectors between and (Figure 1, right) since there does not exist any relation between and (Figure 1, left). The activation of the hidden nodes is computed as the sum of the bias, the weighted output of the previous hidden layer and the weighted contexts where and are weight parameters and

is a bias for some activation function

. is a pre-defined constant that controls the parameterized contexts from growing too large for complex relations. Setting to the average number of neighbors of an entity is a reasonable assumption. The final output layer is a softmax over the last hidden layer.

(3)

where is the label ( is the set of labels) and is the index of the last hidden layer.

Following pham2017column, we choose to formulate our approach in the context of a relation-sensitive predictive modeling, specifically collective classification tasks. However, structured data is implicitly sparse since most entities in the world are not related to each other, thereby adding to the existing challenge of faithful modeling of the underlying structure. The challenge is amplified as we aim to learn in the presence of knowledge-rich, data-scarce problems. As we show empirically, sparse samples (or targeted noise) may lead to sub-optimal learning or slower convergence.

Example 1

Consider a problem of classifying whether a published article is about carcinoid metastasis

zuetenhorst2005metastatic

or is irrelevant, from a citation network, and textual features extracted from the articles themselves. There are several challenges: (1) Data is implicitly sparse due to rarity of studied cases and experimental findings, (2) Some articles may cite other articles related to carcinoid metastasis and contain a subset of the textual features, but address another topic and (3) Finally, the presence of targeted noise, where some important citations were not extracted properly by some citation parser and/or the abstracts are not informative enough.

The above cases may lead to the model not being able to effectively capture certain dependencies, or converge slower, even if they are captured somewhere in the advanced layers of the deep network. Our approach attempts to alleviate this problem via augmented learning of Column Networks using human advice/knowledge. We formally define our problem in the following manner,

Given: A sparse multi-relational graph , attributes of each entity (sparse or noisy) in , equivalent Column-Network and access to a Human-expert To Do: More effective and efficient collective classification by knowledge augmented training of , where is the set of all the network parameters of the Column Network.

We develop Knowledge-augmented CoLumn Networks (K-CLN), that incorporates human-knowledge, for more effective and efficient learning from relational data (Figure 2 illustrates the overall architecture). While knowledge-based connectionist models are not entirely new, our formulation provides - (1) a principled approach for incorporating advice specified in an intuitive logic-based encoding/language (2) a deep model for collective classification in relational data.

3.1 Knowledge Representation

Any model specific encoding of domain knowledge, such as numeric constraints or modified loss functions etc., has several limitations, namely (1) counter-intuitive to the humans since they are domain experts and not experts in machine learning (2) the resulting framework is brittle and not generalizable. Consequently, we employ preference rules (akin to IF-THEN statements) to capture human knowledge.

Definition 1

A preference is a modified Horn clause,

where and the are variables over entities, are attributes of and is a relation. and indicate the preferred non-preferred labels respectively. Quantification is implicitly and hence dropped. We denote a set of preference rules as .

Note that we can always, either have just the preferred label in head of the clause and assume all others as non-preferred, or assume the entire expression as a single literal. Intuitively a rule can be interpreted as conditional rule, IF [conditions hold] THEN label is preferred. A preference rule can be partially instantiated as well, i.e., or more of the variables may be substituted with constants.

Example 2

For the prediction task mentioned in Example 1, a possible preference rule could be,

Intuitively, this rule denotes that an article is not a relevant clinical work to carcinoid metastasis if it cites an ‘AI’ article and contains the word “domain”, since it is likely to be another AI article that uses carcinoid metastatis as an evaluation domain.

3.2 Knowledge Injection

Given that knowledge is provided as partially-instantiated preference rules , more than one entity may satisfy a preference rule. Also, more than one preference rules may be applicable for a single entity. The main intuition is that we aim to consider the error of the trained model w.r.t. both the data and the advice. Consequently, in addition to the “data gradient” as with original CLNs, there is a “advice gradient”. This gradient acts a feedback to augment the learned weight parameters (both column and context weights) towards the direction of the advice gradient. It must be mentioned that not all parameters will be augmented. Only the parameters w.r.t. the entities and relations (contexts) that satisfy should be affected. Let be the set of entities and relations that satisfy the set of preference rules . The hidden nodes (equation 1) can now be expressed as,

(4)

where and and are advice-based soft gates with respect to a hidden node and its context respectively. is some gating function, is the “advice gradient” and is the trade-off parameter explained later. The key aspect of soft gates is that they attempt to enhance or decrease the contribution of particular edges in the column network aligned with the direction of the “advice gradient”. We choose the gating function as an exponential . The intuition is that soft gates are natural, as they are multiplicative and a positive gradient will result in increasing the value/contribution of the respective term, while a negative gradient results in pushing them down. We now present the “advice gradient” (the gradient with respect to preferred labels).

Proposition 1

Under the assumption that the loss function with respect to advice / preferred labels is a log-likelihood, of the form , then the advice gradient is, , where is the preferred label of entity and and is an indicator function over the preferred label. For binary classification, the indicator is inconsequential but for multi-class scenarios it is essential ( for preferred label and for ).

Since an entity can satisfy multiple advice rules we take the MAX preferred label, i.e., we take the label to the preferred label if is given by most of the advice rules that satisfies. In case of conflicting advice (i.e. different labels are equally advised), we simply set the advice label to be the label given by the data, .

Proof Sketch: Most advice based learning methods formulate the effect of advice as a constraint on the parameters or a regularization term on the loss function. We consider a regularization term based on the advice loss and we know that . We consider in its functional form following prior non-parametric boosting approaches odomaaai15. Thus . A functional gradient w.r.t. of yields,

Alternatively, assuming a squared loss such as , would result in an advice gradient of the form .

As illustrated in the K-CLN architecture (Figure 2

), at the end of every epoch of training the

advice gradients are computed and soft gates are used to augment the value of the hidden units as shown in Equation 3.2.

Proposition 2

Given that the loss function of original CLN is cross-entropy (binary or sparse-categorical for the binary and multi-class prediction cases respectively) and the objective w.r.t. advice is log-likelihood, the functional gradient of the modified objective for K-CLN is,

(5)

where is the trade-off parameter between the effect of data and effect of advice, and are the indicator functions on the label w.r.t. the data and the advice respectively and and are the gradients, similarly, w.r.t. data and advice respectively.

Proof Sketch: The original objective function (w.r.t. data) of CLNs is cross-entropy. For clarity, let us consider the binary prediction case, where the objective function is now a binary cross-entropy of the form
.

Ignoring the summation for brevity, for every entity , . Extension to the multi-label prediction case with a sparse categorical cross-entropy is straightforward and is an algebraic manipulation task. Now, from Proposition 1, the loss function w.r.t. advice is the log likelihood of the form, . Thus the modified objective is expressed as,

(6)

where is the trade-off parameter. can be implicitly understood. Now we know from Proposition 1 that the distributions, and , can be expressed in their functional forms, given that the activation function of the output layer is a softmax, as . Taking the functional (partial) gradients (w.r.t. and ) of the modified objective function (Equation 6), followed by some algebraic manipulation we get,

(Eqn 5)

Hence, it follows from Proposition 2 that the data and the advice balances the training of the K-CLN network parameters

via the trade-off hyperparameter

. When data is noisy (or sparse with negligible examples for a region of the parameter space) the advice (if correct) induces a bias on the output distribution towards the correct label. Even if the advice is incorrect, the network still tries to learn the correct distribution to some extent from the data (if not noisy). The contribution of the effect of data versus the effect of advice will primarily depend on . If both the data and human advice are sub-optimal (noisy), the correct label distribution is not even learnable.

1:procedure KCLN(Knowledge graph , Column network , Advice , Trade-off )
2:     K-CLN w/ changed expr. of hidden units w.r.t. Eqn 3.2
3:     Initialize n/w parameters of K-CLN intialized to
4:      CreateMask() mask ents./rels.
5:     Initial gradients ; advice gradient at epoch
6:     for epochs k=1 to convergence do convergence criteria same as original CLN
7:         Get advice gradients w.r.t. previous epoch
8:         Gates and
9:         Train using Equation 3.2; Update
10:         Compute from for current epoch
11:         Store get from
12:     end for
13:return K-CLN
14:end procedure 
15:procedure CreateMask(Knowledge graph ,Advice )
16:      : feature length of entity; : # entities where
17:     
18:      where is the number of distinct labels
19: : entity mask; : context mask & : label mask, w.r.t. advice
20:     for each preference  do
21:         if  : and satisfies  then
22:               where is the feature affected by
23:               where
24:              ; where LabelOf()
25:         end if
26:     end for
27:     return
28:end procedure
Algorithm 1 K-CLN: Knowledge-augmented CoLumn Networks

3.3 The Algorithm

Algorithm 1 outlines the key steps involved in our approach. KCLN() is the main procedure [lines: 1-14] that trains a Column Network using both the data (the knowledge graph ) and the human advice (set of preference rules ). It returns a K-CLN where are the network parameters, which are initialized to any arbitrary value ( in our case; [line: 3

]). As described earlier, the network parameters of K-CLN (same as CLN) are manipulated (stored and updated) via tensor algebra with appropriate indexing for entities and relations. Also recall that our gating functions are piece-wise/non-smooth and apply only to the subspace of entities, features and relations where the preference rules are satisfied. Thus, as a pre-processing step, we create tensor masks that compactly encode such a subspace with a call to the procedure

CreateMask() [line: 4], which we elaborate later.

The network is then trained through multiple epochs till convergence [lines: 6-12

]. At the end of every epoch the output probabilities and the gradients are computed and stored in a shared data structure [

line: 11] such that they can be accessed subsequently to compute the advice gates [lines: 7-8]. Our network is trained largely similar to the original CLN with two key modifications [line: 9], namely,

  1. Equation 3.2 is used as the modified expression for hidden units

  2. The data trade-off is multiplied with the original loss while its counterpart, the advice trade-off , is used to compute the gates [line: 8]

Procedure CreateMask() [lines: 15-27] constructs the tensor mask(s) over the space of entities, features and relations/contexts that are required to compute the gates (as seen in line: 8). Data (the ground knowledge graph ) and the set of preference rules are provided as inputs. There are three key components of the advice mask. They are,

  1. The entity mask (a tensor of dimensions - #entities by length of feature vector) that indicates which entities and the relevant features are affected by the advice/preference

  2. The context mask (#entities by #entities) which indicates the contexts that are affected (relations are directed and so this matrix is asymmetric)

  3. The label mask

    which indicate the preferred label of the affected entities, in one-hot encoding

All the components are initialized to zeros. The masks are then computed for every preference rule iteratively [lines: 19-25]. This includes satisfiability checking for a given preference rule [line: 20], which is achieved via subgraph matching on the knowledge graph since a preference rule (Horn clause - Definition 1) can be viewed a subgraph template. For more details, we refer to the work on employing hyperraph/graph databases for counting instances of horn clauses richards95; dasSDM2016; DasAAAI19. So for all entities, relevant features, and relations/contexts that satisfy the rule, the corresponding elements of the tensor masks are set to [lines: 21-23]. The components and are used in gate computation in the KCLN procedure and is used for the indicator in the advice gradient.

After considering the formulation and the learning of KCLNs, we now turn our attention to empirical evaluation of the proposed work.

4 Experiments

We investigate the following questions as part of our experiments, -

  1. Can K-CLNs learn effectively with noisy sparse samples i.e., performance?

  2. Can K-CLNs learn efficiently with noisy sparse samples i.e., speed of learning?

  3. How does quality of advice affect the performance of K-CLN i.e., reliance on robust advice?

We compare against the original Column Networks architecture with no advice222Vanilla CLN indicates the original Column Network architecture pham2017column as a baseline. Our intention is to show how advice/knowledge can guide model learning towards better predictive performance and efficiency, in the context of collective classification using Column Networks. Also, we have discussed earlier, in detail, how our problem setting is distinct from most existing noise robust deep learning approaches. Hence, we restricted our comparisons to the original work.

4.1 Experimental Setup

System: K-CLN has been developed by extending original CLN architecture, which uses Keras as the functional deep learning API with a Theano

backend for tensor manipulation. We extend this system to include: (1) advice gradient feedback at the end of every epoch, (2) modified hidden layer computations and (3) a pre-processing wrapper to parse the advice/preference rules and create appropriate tensor masks. Since it is not straightforward to access final layer output probabilities from inside any hidden layer using keras, we use

Callbacks to write/update the predicted probabilities to a shared data structure at the end of every epoch. This data structure is then fed via inputs to the hidden layers. Each mini-column with respect to an entity is a dense network of hidden layers with hidden nodes in each layer (similar to the most effective settings outlined in pham2017column).

The pre-processing wrapper acts as an interface between the advice encoded in a symbolic language (horn clauses) and the tensor-based computation architecture. The advice masks encode , i.e., the set of entities and contexts where the gates are applicable (Algorithm 1).

Domains: We evaluate our approach on four relational domains – Pubmed Diabetes and Corporate Messages, which are multi-class classification problems, and Internet Social Debates and Social Network Disaster Relevance, which are binary. Pubmed Diabetes333https://linqs.soe.ucsc.edu/data is a citation network for predicting whether a peer-reviewed article is about Diabetes Type 1, Type 2 or none, using textual features (TF-IDF vectors) from pubmed abstracts as well as citation relationships between them. It comprises articles, considered as an entities, with bag-of-words textual features (TF-IDF weighted word vectors), and citation relationships among each other. Internet Social Debates444http://nldslab.soe.ucsc.edu/iac/v2/ is a data set for predicting stance (‘for’/‘against’) about a debate topic from online posts on social debates. It contains posts (entities) characterized by TF-IDF vectors, extracted from the text and header, and relations of 2 types, ‘sameAuthor’ and ‘sameThread’. Corporate Messages555https://www.figure-eight.com/data-for-everyone/ is an intention prediction data set of flier messages sent by corporate groups in the finance domain with sameSourceGroup relations. The target is to predict the intention of the message (Information, Action or Dialogue). Finally, Social Network Disaster Relevance (same source) is a relevance prediction data set of 8000 Twitter posts, curated and annotated by crowd with their relevance scores. Along with bag-of-word features we use confidence score features and relations among tweets (of types ‘same author’ and ‘same location’). Table 1 outlines the important aspects of the 4 domains (data sets) used in our experimental evaluation. As indicated earlier, inspired by original CLNs, we evaluate our approach on both binary and multi-class prediction problems.

Domain/Data set #Entities #Relations #Features Target type
Pubmed Diabetes Multi-class
Corporate Messages Multi-class
Online Social Debates Binary
Disaster Relevance Binary
Table 1: Evaluation domains and their properties

Metrics: Following pham2017column, we report macro-F1 and micro-F1 scores for the multi-class problems, and F1 scores and AUC-PR for the binary ones. Macro-F1 computes the F1 score independently for each class and takes the average whereas a micro-F1 aggregates the contributions of all classes to compute the average F1 score. For all experiments we use hidden layers and hidden units per column in each layer. All results are averaged over 5 runs. Other settings are consistent with original CLN.

Human Advice: K-CLN is designed to handle arbitrarily complex expert advice given that they are encoded as preference rules. However, even with some relatively simple preference rules K-CLN is more effective in sparse samples. For instance, in Pubmed, the longest one among the 4 preference rules used is, . Note how a simple rule, indicating an article citing another one discussing obesity is likely to be about Type2 diabetes, proved to be effective. Expert knowledge from real physicians can thus, prove to be even more effective. In Disaster Relevance we used rules that did not require much domain expertise, such as if a tweet is by the same user who usually posts non-disaster tweets then the tweet is likely to be a non-disaster one. Sub-optimal advice may lead to a wrong direction of the Advice Gradient. However, our soft gates do not alter the loss, but instead promote/demote the contribution of nodes/contexts. Similar to Patrini et al., patrini2017making, data is still balancing the effect of advice during training.

(a) Micro-F1 (w/ epochs) (b) Macro-F1 (w/ epochs) (c) Micro-F1 (w/ varying sample size) (d) Macro-F1 (w/ varying sample size)
Figure 3: [Pubmed Diabetes publication prediction (multi-class)] Learning curves - (Top)  w.r.t. training epochs at 24% (of total) sample, (Bottom)  w.r.t. varying sample sizes [best viewed in color].
(a) Micro-F1 (w/ epochs) (b) Macro-F1 (w/ epochs) (c) Micro-F1 (w/ varying sample size) (d) Macro-F1 (w/ varying sample size)
Figure 4: [Corporate Messages intention prediction (multi-class)] Learning curves - (Top)  w.r.t. training epochs at 24% (of total) sample, (Bottom)  w.r.t. varying sample sizes [best viewed in color].
(a) F1 (w/ epochs) (b) AUC-PR (w/ epochs) (c) F1 (w/ varying sample size) (d) AUC-PR (w/ varying sample size)
Figure 5: [Internet Social debate stance prediction (binary class)] Learning curves - (Top)  w.r.t. training epochs at 24% (of total) sample, (Bottom)  w.r.t. varying sample sizes [best viewed in color].
(a) F1 (w/ epochs) (b) AUC-PR (w/ epochs) (c) F1 (w/ varying samples) (d) AUC-PR (w/ varying samples)
Figure 6: [Social Network Disaster prediction (binary class)] Learning curves - (Top)  w.r.t. training epochs at 24% (of total) sample, (Bottom)  w.r.t. varying sample sizes [best viewed in color].

4.2 Experimental Results

Recall that our goal is to demonstrate the efficiency and effectiveness of K-CLNs with smaller set of training examples. Hence, we present the aforementioned metrics with varying sample size and with varying epochs and compare our model against Vanilla CLN. We split the data sets into a training set and a hold-out test set with 60%-40% ratio. For varying epochs we only learn on 40% of our already split training set (i.e., 24% of the complete data) to train the model with varying epochs and test on the hold-out test set. Figures 3(a) - 3(b) illustrate the micro-F1 and the macro-F1 scores for the PubMed diabetes data and Figures 6(a) - 6(b) show the F1 score and AUC-PR for the and social network disaster relevance data. As the figures show, although both K-CLN and Vanilla CLN converge to the same predictive performance, K-CLN converges significantly faster (less epochs). For the corporate messages and the internet social debate, K-CLN not only converges faster but also has a better predictive performance than Vanilla CLN as shown in Figures 4(a) - 4(b) and Figures 5(a) - 5(b). The results show that K-CLNs learn more efficiently with noisy sparse samples thereby answering (Q1) affirmatively.

Effectiveness of K-CLN is illustrated by its performance with respect to the varying sample sizes of the training set, especially with low sample size. The intuition is, domain knowledge should help guide the model to learn better when the amount of training data available is small. K-CLN is trained on gradually varying sample size from 5% of the training data (3% of the complete data) till 80% of the training data (48% of complete data) and tested on the hold-out test set. Figures 3(c) - 3(d) present the micro-F1 and macro-F1 score for pubMed diabetes and Figures 6(c) - 6(d) plot the F1 score and AUC-PR for social network disaster relevance. It can be seen that K-CLN outperforms Vanilla CLN across all sample sizes, on both metrics, which suggests that the advice is relevant throughout the training phase with varying sample sizes. For corporate messages, K-CLN outperforms with small number of samples as shown in the micro-F1 metric (Figure 4(c)) gradually converging to a similar prediction performance with larger samples. Macro-F1 (Figure 4(d)), however, shows that the performance is similar for both the models across all sample sizes, although K-CLN does perform better with very small samples. Since this is a multi-class classification problem, similar performance in the macro-F1 case suggests that in some classes the advice is not applicable during learning, while it applies well w.r.t. other classes, thereby averaging out the final result. For internet social debate stance prediction, Figures 5(c) - 5(d) present the F1 score and the AUC-PR respectively. K-CLN outperforms the Vanilla CLN on both metrics and thus we can answer (Q2) affirmatively. K-CLNs learn effectively with noisy sparse samples.

An obvious question that will arise is – how robust is our learning system to that of noisy/incorrect advice? Conversely, how does the choice of affect the quality of the learned model? To answer these questions specifically, we performed an additional experiment on the Internet Social Debates domain by augmenting the learner with incorrect advice. This incorrect advice is essentially created by changing the preferred label of the advice rules to incorrect values (based on our understating). Also, recall that the contribution of advice is dependent on the trade-off parameter , which controls the robustness of K-CLN to advice quality. Consequently, we experimented with different values of (), across varying sample sizes.

Figure 7 shows how with higher values the performance deteriorates due to the effect of noisy advice. is not plotted since the performance is same as no-advice/Vanilla CLN. Note that with reasonably low values of , the performance does not deteriorate much and is, in fact, better in some samples. Thus with reasonably low values of K-CLN is robust to quality of advice (Q3). We picked one domain to present the results of this robustness but have observed similar behavior in all the domains.

(a) F1 (varying sample & )
(b) AUC-PR (varying sample & )
Figure 7: Performance, F1 and AUC-PR, of K-CLN on Internet Social Debates data set across different sample sizes, with varying trade-off parameter (on the advice gradient). Note that the advice here is incorrect/sub-optimal. has the same performance as no-advice (Vanilla CLN), hence not plotted.

4.3 Discussion

It is difficult to quantify correctness or quality of human advice unless, absolute ground truth is accessible in some manner. We evaluate on sparse samples of real data sets with no availability of gold standard labels. Hence, to the best of our capabilities, we have provided the most relevant/useful advice in the experiments aimed at answering (Q1) and (Q2) as indicated in the experimental setup. We emulate noisy advice (for Q3) by flipping/altering the preferred labels of advice rules in the original set of preferences.

We have shown theoretically, in Proposition 1 and 2, that the robustness of K-CLN depends on the advice trade-off parameter . We illustrated how it can control the contribution of the data versus the advice towards effective training. We postulate that even in presence of noisy advice, the data (if not noisy) is expected to contribute towards effective learning with a weight of . Of course, if both the data and advice are noisy the concept is not learnable. Note that this is the case with any learning algorithm where both the knowledge/hypotheses space and the data being incorrect can lead to an incorrect hypothesis.

The experiments w.r.t. Q3 (Figure 7) empirically support our theoretical analysis. We found that when , K-CLN performs well even with noisy advice. In the earlier experiments where we use potentially good advice, we report the results with , since the advice gradient is piecewise (affects only a subset of entities/relations). So it is reasonable to assign higher weight to the advice and the contribution of the entities and relations/contexts affected by it, given the advice is noise-free. Also, note that the drop in performance towards very low sample sizes (in Figure 7) highlights how learning is challenging in the noisy-data and noisy-advice scenario. This aligns with our general understanding of most human-in-the-loop/advice-based approaches in AI. Trade-off between data and advice via a weighted combination of both is a well studied solution in related literature OdomNatarajan18 and, hence, we adapt the same in our context. Tracking the expertise of humans to infer advice quality is an interesting future research direction.

5 Conclusion

We considered the problem of providing guidance for CLNs. Specifically, inspired by treating the domain experts as true domain experts and not CLN experts, we developed a formulation based on preferences. This formulation allowed for natural specification of guidance. We derived the gradients based on advice and outlined the integration with the original CLN formulation. Our experimental results across different domains clearly demonstrate the effectiveness and efficiency of the approach, specifically in knowledge-rich, data-scarce problems. Exploring other types of advice including feature importances, qualitative constraints, privileged information, etc. is a potential future direction. Scaling our approach to web-scale data is a natural extension. Finally, extending the idea to other deep models remains an interesting direction for future research.

References