Knowledge transfer between Artificial Intelligent systems has been the subject of extensive discussion in the literature for more than two decades Gorban:DAN:1991 , Hinton:NC:1991 , Pratt:ANIP:1992 , Schultz:2000 . State-of-the art approach to date is to use, or salvage, parts of the “teacher” AI system in the “student” AI followed by re-training of the “student” yosinski2014transferable , chen2015net2net . Alternatives to AI salvaging include model compression Bucila:2006 , knowledge distillation Hinton:2015 , and privileged information vapnik2017knowledge . These approaches demonstrated substantial success in improving generalization capabilities of AIs as well as in reducing computational overheads SqueezeNet:2016 , in cases of knowledge transfer from larger AI to the smaller one. Notwithstanding, however, which of the above strategies is followed, their implementation often requires either significant resources including large training sets and power needed for training, or access to privileged information that may not necessarily be available to end-users. Thus new frameworks and approaches are needed.
In this contribution we provide new framework for automated, fast, and non-destructive process of knowledge spreading across AI systems of varying architectures. In this framework, knowledge transfer is accomplished by means of Knowledge Transfer Units comprising of mere linear functionals and/or their small cascades. Main mathematical ideas are rooted in measure concentration Gromov:1999 , GAFA:Gromov:2003 , Gibbs1902 , Levi1951 , Gorban:2007 and stochastic separation theorems GorbanTyukin:NN:2017 revealing peculiar properties of random sets in high dimensions. We generalize some of the latter results here and show how these generalizations can be employed to build simple one-shot Knowledge Transfer algorithms between heterogeneous AI systems whose state may be represented by elements of linear vector space of sufficiently high dimension. Once knowledge has been transferred from one AI to another, the approach also allows to “unlearn” new knowledge without the need to store a complete copy of the “student” AI is created prior to learning. We expect that the proposed framework may pave way for fully functional new phenomenon – Nursery of AI systems in which AIs quickly learn from each other whilst keeping their pre-existing skills largely intact.
The paper is organized as follows. Section 2 contains mathematical background needed to justify the proposed knowledge transfer algorithms. In Section 3 we present two algorithms for transferring knowledge between a pair of AI systems in which one operates as a teacher and the other functions as a student. Section 4 illustrates the approach with examples, and Section 5 concludes the paper.
2 Mathematical background
Let the set
be an i.i.d. sample from a distribution in . Pick another set
from the same distribution at random. What is the probability that there is a linear functional separating from ?
). These two special cases cover or, indeed, approximate broad range of practically relevant situations including e.g. Gaussian distributions (reduce asymptotically to equidistribution infor
large enough) and data vectors in which each attribute is a numerical and independent random variable.
Consider the case when the underlying probability distribution is an equidistribution in the unit ball, and suppose that and are i.i.d. samples from this distribution. We are interested in determining the probability that there exists a linear functional separating and
. An estimate of this probability is provided in the following theorem
Let and be i.i.d. samples from the equidisribution in . Then
Proof of Theorem 1. Given that elements in the set are independent, the probability that is
Consider an auxiliary set
Vectors belong to the sphere of radius centred at the origin (see Figure 1, (b)).
According to GorTyu:2016 (proof of Proposition 3 and estimate (26)), the probability that for a given a given all elements of are pair-wise -orthogonal, i.e.
can be estimated from below as:
for . Suppose now that (2) holds true. Let be chosen so that . If this is the case than there exists a set of pair-wise orthogonal vectors
such that (Figure 1, (c))
Finally, consider the set
The set belongs to the sphere of radius , and its elements are vertices of the corresponding -simplex in (Figure 1, (d)).
Consider the functional:
Recall that if
are orthonormal vectors inthen . Hence , and we can conclude that and for all . According to (3), for all . Therefore the functional
satisfies the following condition: and for all . This is illustrated with Figure 1, (e).
The functional partitions the unit ball into the union of two disjoint sets: the spherical cap
and its complement in , . The volume of the cap can be estimated from above as
Hence the probability that for all can be estimated from below as
Therefore, for fixed chosen so that , the probability that can be separated from by the functional can be estimated from below as:
Given that this estimate holds for all feasible values of , statement (1) follows.
Figure 2 shows how estimate (1) of the probability behaves, as a function of for fixed and . As one can see from this figure, when exceeds some critical value ( in this specific case), the lower bound estimate (1) of the probability drops. This is not surprising since the bound (1) is a) based on rough, -like, estimates, and b) these estimates are derived for just one class of separating functionals . Furthermore, no prior pre-processing and/or clustering was assumed for the . An alternative estimate that allows us to account for possible clustering in the set is presented in Theorem 2.
Let and be i.i.d. samples from the equidistribution in . Let be a subset of elements from such that
Proof of Theorem 2. Consider the set . Observe that , , for all , with probability . Consider now the vector
and evaluate the following inner products
According to assumption (6), with probability ,
Let and . Consider the functional
It is clear that for all by the way the functional is constructed. The functional partitions the ball into two sets: the set defined as in (5) and its complement, . The volume of the set is bounded from above as
Estimate (7) now follows.
Examples of estimates (7) for various parameter settings are shown in Fig. 3. As one can see, in absence of pair-wise strictly positive correlation assumption, , the estimate’s behavior, as a function of , is similar to that of (1). However, presence of moderate pair-wise positive correlation results in significant boosts to the values of .
Note that proofs of Theorems 1, 2 are constructive. Not only they provide estimates from below of the probability that two random i.i.d. drawn samples from are linearly separable, but also they present the corresponding separating functionals explicitly as (4) and (8), respectively. The latter functionals are similar to Fisher linear discriminants. Whilst having explicit separation functionals is an obvious advantage from practical view point, the estimates that are associated with such functionals do not account for more flexible alternatives. In what follows we present a generalization of the above results that accounts for such a possibility as well as extends applicability of the approach to samples from product distributions. The results are provided in Theorem 3.
Consider the linear space , let the cardinality of the set be smaller than . Consider the quotient space . Let be a representation of in , and let the coordinates of , be independent random variables i.i.d. sampled from a product distribution in a unit cube with variances
be independent random variables i.i.d. sampled from a product distribution in a unit cube with variances, . Then for
with probability there is a linear functional separating and .
Proof of Theorem 3. Observe that, in the quotient space , elements of the set
are vectors whose coordinates coincide with that of the quotient representation of . This means that the quotient representation of consists of a single element, . Furthermore, dimension of is . Let and . According to Theorem 2 and Corollary 2 from GorbanTyukin:NN:2017 , for and satisfying
with probability the following inequalities hold:
for all , . This implies that the functional
separates and with probability .
3 AI Knowledge Transfer Framework
In this section we show how Theorems 1, 2 and 3 can be applied for developing a novel one-shot AI knowledge transfer framework. We will focus on the case of transfer knowledge between two AI systems, a teacher AI and a student AI, in which input-output behaviour of the student AI is evaluated by the teacher AI. In this setting, assignment of AI roles, i.e. student or teaching, is beyond the scope of this manuscript. The roles are supposed to be pre-determined or otherwise chosen arbitrarily.
3.1 General setup
Consider two AI systems, a student AI, denoted as , and a teacher AI, demoted as . These legacy AI systems process some input signals, produce internal representations of the input and return some outputs. We further assume that some relevant information about the input, internal signals, and outputs of can be combined into a common object, , representing, but not necessarily defining, the state of . The objects are assumed to be elements of .
Over a period of activity system generates a set of objects . Exact composition of the set could depend on a task at hand. For example, if is an image classifier, we may be interested only in a particular subset of input-output data related to images of a certain known class. Relevant inputs and outputs of corresponding to objects in are then evaluated by the teacher, . If outputs differ to that of for the same input then an error is registered in the system. Objects associated with errors are combined into the set . The procedure gives rise to two disjoint sets:
A diagram schematically representing the process is shown in Fig. 4. The knowledge transfer task is to “teach” so that with
does not make such errors
existing competencies of on the set of inputs corresponding to internal states are retained, and
knowledge transfer from to is reversible in the sense that can “unlearn” new knowledge by modifying just a fraction of its parameters, if required.
Two algorithms for achieving such transfer knowledge are provided below.
3.2 Knowledge Transfer Algorithms
Our first algorithm, Algorithm 1, considers cases when Auxiliary Knowledge Transfer Units, i.e. functional additions to existing student , are single linear functionals. The second algorithm, Algorithm 2, extends Auxiliary Knowledge Transfer Units to two-layer cascades of linear functionals.
Centering. For the given set , determine the set average, , and generate sets
Regularization. Determine covariance matrices , of the sets and . Let ,
be their corresponding eigenvalues, and
be the eigenvectors of. If some of , are zero or if the ratio is too large, project and onto appropriately chosen set of eigenvectors, :
where is the matrix comprising of significant principal components of .
Whitening. For the centered and regularized dataset , derive its covariance matrix, , and generate whitened sets
Clustering. Pick , , , and partition the set into clusters so that elements of these clusters are, on average, pairwise positively correlated. That is there are such that:
Construction of Auxiliary Knowledge Units. For each cluster , , construct separating linear functionals :
where , are the averages of and , respectively, and is chosen as .
Integration. Integrate Auxiliary Knowledge Units into decision-making pathways of . If, for an generated by an input to , any of then report accordingly (swap labels, report as an error etc.)
The algorithms comprise of two general stages, pre-processing stage and knowledge transfer stage. The purpose of the pre-processing stage is to regularize and “sphere” the data. This operation brings the setup close to the one considered in statements of Theorems 1, 2. The knowledge transfer stage constructs Auxiliary Knowledge Transfer Units in a way that is very similar to the argument presenteed in the proofs of Theorems 1 and 2. Indeed, if then the term
is close to identity matrix, and the functionalsare good approximations of (8). In this setting, one might expect that performance of the knowledge transfer stage would be also closely aligned with the corresponding estimates (1), (7).
Note that the regularization step in the pre-processing stage ensures that the matrix is non-singular. Indeed, consider
Denoting and rearranging the sum below as
we obtain that is non-singular as long as the sum is non-singular. The latter property, however, is guaranteed by the regularization step in Algorithm 1.
Auxiliary Knowledge Transfer Units in Step 2.b of Algorithm 1
are derived in accordance with standard Fisher linear discriminant formalism. This, however, need not be the case, and other methods such as e.g. support vector machinesVapnik2000 could be employed for this purpose there. It is worth mentioning, however, that support vector machines might be prone to overfitting Han:2014 and their training often involves iterative procedures such as e.g. sequential quadratic minimization Platt:1998 .
Furthermore, instead of the sets , one could use a somewhat more aggressive division: and , respectively.
Depending on configuration of samples and , Algorithm 1 may occasionally create knowledge transfer units, , that are “filtering” errors too aggressively. That is some may accidentally trigger non-negative response, , and as a result of this their corresponding inputs to could be ignored or mishandled. To mitigate this, one can increase the number of clusters and knowledge transfer units, respectively. This will increase the probability of successful separation and hence alleviate the issue. On the other hand, if increasing the number of knowledge transfer units is not desirable for some reason, then two-functional units could be a feasible remedy. Algorithm 2 presents a procedure for such an improved AI Knowledge Transfer.
Pre-processing. Do as in Step 1 in Algorithm 1
Clustering. Do as in Step 2.a in Algorithm 1
Construction of Auxiliary Knowledge Units.1:Do as in Step 2.b in Algorithm 1. At the end of this step first-stage functionals , will be derived.2:For each set , , evaluate the functionals for all and identify elements such that and (incorrect error assignment). Let be the set containing such elements .3:If (there is an such that ) then increment the value of : , and return to Step 2.a.4:If (all sets are empty) then proceed to Step 2.c.5:For each pair of and with not empty, project orthogonally sets and
onto the hyperplaneand form the sets and :6:Construct a linear functional separating from so that for all and for all .
Integration. Integrate Auxiliary Knowledge Units into decision-making pathways of . If, for an generated by an input to , any of the predicates hold true then report accordingly (swap labels, report as an error etc.).
In what follows we illustrate the approach as well as the application of the proposed Knowledge Transfer algorithms in a relevant problem of a computer vision system design for pedestrian detection in live video streams.
Let and be two systems developed, e.g. for the purposes of pedestrian detection in live video streams. Technological progress in embedded systems and availability of platforms such as e.g. Nvidia Jetson TX2 made hadrware deployment of such AI systems at the edge of computer vision processing pipelines feasible. These AI systems, however, lack computational power to run state-of-the-art large scale object detection solutions such as e.g. ResNet ResNet in real-time. Here we demonstrate that to compensate for this lack of power, AI Knowledge Transfer can be successfully employed. In particular, we suggest that the edge-based system is “taught” by the state-of-the-art teacher in a non-iterative and near-real time way. Since our building blocks are linear functionals, such learning will not lead to significant computational overheads. At the same time, as we will show later, the proposed AI Knowledge Transfer will result in a major boost to the system’s performance in the conditions of the experiment.
4.1 Definition of and and rationale
In our experiments, the teacher AI, , was modeled by a deep Convolutional Network, ResNet 18 ResNet with circa M trainable parameters. The network was trained on a “teacher” dataset comprised of M non-pedestrian (negatives), and K pedestrian (positives) images. The student AI, , was modelled by a linear classifier with HOG features Dalal:2005 and trainable parameters. The values of these parameters were the result of training on a “student” dataset, a sub-sample of the “teacher” dataset comprising of K positives and K negatives, respectively. This choice of and systems enabled us to emulate interaction between edge-based AIs and their more powerful counterparts that could be deployed on larger servers or computational clouds.
Moreover, to make the experiment more realistic, we assumed that internal states of both systems are inaccessible for direct observation. To generate sets and required in Algorithms 1 and 2 we augmented system with an external generator of HOG features of the same dimension. We assumed, however, that covariance matrices of positives and negatives from the “student” dataset are available for the purposes of knowledge transfer. A diagram representing this setup is shown in Figure 5.
A candidate image is evaluated by two systems simultaneously as well as by a HOG features generator. The latter generates dimensional vectors of HOGs and stores these vectors in the set . If outputs of and do not match the corresponding feature vector is added to the set .
4.2 Error types
In our setting, evaluation of an image patch by (ResNet) took sec on Nvidia K80 which was several orders slower than that of (linear HOG-based classifier). Whilst such behavior was expected, this imposed technical limitations on the process of mitigating errors of Type II. Each frame from our testing video produced K image patches to test. Evaluation of all these candidates by our chosen is prohibitive computationally. To overcome this technical difficulty we tested only a limited subset of image proposals with regards to these error type. To get a computationally viable number of proposals for false negative testing, we increased sensitivity of the HOG-based classifier by lowering its detection threshold from to . This way our linear classifier with lowered threshold acted as a filter letting through more true positives at the expense of large number of false positives. In this operational mode, Knowledge Transfer Unit were tasked to separate true positives from negatives in accordance with object labels supplied by .
The approach was tested on two benchmark videos: LINTHESCHER sequence Ess:2008 created by ETHZ and comprised of 1208 frames and NOTTINGHAM video Nottingham containing 435 frames of live footage taken with an action camera. In what follows we will refer to these videos as ETHZ and NOTTINGHAM videos, respectively. ETHZ video contains complete images of 8435 pedestrians, whereas NOTTINGHAM video has 4039 full-figure images of pedestrians.
Performance and application of Algorithms 1, 2 for NOTTINGHAM and ETHZ videos are summarized in Fig. 6 and 7. Each curves in these figures is produced by varying the values of decision-making threshold in the HOG-based linear classifier. Red circles in Figure 6 show true positives as a function of false positives for the original linear classifier based on HOG features. Parameters of the classifier were set in accordance with Fisher linear discriminant formulae. Blue stars correspond to after Algorithm 1 was applied to mitigate errors of Type I in the system. The value of (number of clusters) in the algorithm was set to be equal to . Green triangles illustrate application of Algorithm 2 for the same error type. Here Algorithm 2 was slightly modified so that the resulting Knowledge Transfer Unit had only one functional . This was due to the low number of errors reaching stage two of the algorithm. Black squares correspond to after application of Algorithm 2 (error Type I) followed by application of Algorithm 2 to mitigate errors of Type II.
Figure 7 shows performance of the algorithms for ETHZ sequence. Red circles show performance of the original , green triangles correspond to supplemented with Knowledge Transfer Units derived using Algorithm 2 for errors of Type I. Black squares correspond to subsequent application of Algorithm 2 dealing with errors of Type II.
In all these cases, supplementing with Knowledge Transfer Units constructed with the help of Algorithms 1, 2 for both error types resulted in significant boost to performance. Observe that in both cases application of Algorithm 2 to address errors of Type II has led to noticeable increases of numbers of false positives in the system at the beginning of the curves. Manual inspection of these false positives revealed that these errors are exclusively due mistakes of