Knowledge Transfer Between Artificial Intelligence Systems

by   Ivan Y. Tyukin, et al.
University of Leicester

We consider the fundamental question: how a legacy "student" Artificial Intelligent (AI) system could learn from a legacy "teacher" AI system or a human expert without complete re-training and, most importantly, without requiring significant computational resources. Here "learning" is understood as an ability of one system to mimic responses of the other and vice-versa. We call such learning an Artificial Intelligence knowledge transfer. We show that if internal variables of the "student" Artificial Intelligent system have the structure of an n-dimensional topological vector space and n is sufficiently high then, with probability close to one, the required knowledge transfer can be implemented by simple cascades of linear functionals. In particular, for n sufficiently large, with probability close to one, the "student" system can successfully and non-iteratively learn k≪ n new examples from the "teacher" (or correct the same number of mistakes) at the cost of two additional inner products. The concept is illustrated with an example of knowledge transfer from a pre-trained convolutional neural network to a simple linear classifier with HOG features.



page 18


Three IQs of AI Systems and their Testing Methods

The rapid development of artificial intelligence has brought the artific...

Student-Teacher Learning from Clean Inputs to Noisy Inputs

Feature-based student-teacher learning, a training method that encourage...

Lifelong Teacher-Student Network Learning

A unique cognitive capability of humans consists in their ability to acq...

One-Trial Correction of Legacy AI Systems and Stochastic Separation Theorems

We consider the problem of efficient "on the fly" tuning of existing, or...

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

Pre-trained language models (PLMs) like BERT have made great progress in...

Knowledge Network and a Knowledge Network Example

Knowledge networks can be defined as social networks that enable the tra...

A Metamodel and Framework for AGI

Can artificial intelligence systems exhibit superhuman general intellige...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge transfer between Artificial Intelligent systems has been the subject of extensive discussion in the literature for more than two decades Gorban:DAN:1991 , Hinton:NC:1991 , Pratt:ANIP:1992 , Schultz:2000 . State-of-the art approach to date is to use, or salvage, parts of the “teacher” AI system in the “student” AI followed by re-training of the “student” yosinski2014transferable , chen2015net2net . Alternatives to AI salvaging include model compression Bucila:2006 , knowledge distillation Hinton:2015 , and privileged information vapnik2017knowledge . These approaches demonstrated substantial success in improving generalization capabilities of AIs as well as in reducing computational overheads SqueezeNet:2016 , in cases of knowledge transfer from larger AI to the smaller one. Notwithstanding, however, which of the above strategies is followed, their implementation often requires either significant resources including large training sets and power needed for training, or access to privileged information that may not necessarily be available to end-users. Thus new frameworks and approaches are needed.

In this contribution we provide new framework for automated, fast, and non-destructive process of knowledge spreading across AI systems of varying architectures. In this framework, knowledge transfer is accomplished by means of Knowledge Transfer Units comprising of mere linear functionals and/or their small cascades. Main mathematical ideas are rooted in measure concentration Gromov:1999 , GAFA:Gromov:2003 , Gibbs1902 , Levi1951 , Gorban:2007 and stochastic separation theorems GorbanTyukin:NN:2017 revealing peculiar properties of random sets in high dimensions. We generalize some of the latter results here and show how these generalizations can be employed to build simple one-shot Knowledge Transfer algorithms between heterogeneous AI systems whose state may be represented by elements of linear vector space of sufficiently high dimension. Once knowledge has been transferred from one AI to another, the approach also allows to “unlearn” new knowledge without the need to store a complete copy of the “student” AI is created prior to learning. We expect that the proposed framework may pave way for fully functional new phenomenon – Nursery of AI systems in which AIs quickly learn from each other whilst keeping their pre-existing skills largely intact.

The paper is organized as follows. Section 2 contains mathematical background needed to justify the proposed knowledge transfer algorithms. In Section 3 we present two algorithms for transferring knowledge between a pair of AI systems in which one operates as a teacher and the other functions as a student. Section 4 illustrates the approach with examples, and Section 5 concludes the paper.

2 Mathematical background

Let the set

be an i.i.d. sample from a distribution in . Pick another set

from the same distribution at random. What is the probability that there is a linear functional separating from ?

Below we provide three -tuple separation theorems: for an equidistribution in (Theorem 1 and 2) and for a product probability measure with bounded support (Theorem 3

). These two special cases cover or, indeed, approximate broad range of practically relevant situations including e.g. Gaussian distributions (reduce asymptotically to equidistribution in


large enough) and data vectors in which each attribute is a numerical and independent random variable.

Consider the case when the underlying probability distribution is an equidistribution in the unit ball

, and suppose that and are i.i.d. samples from this distribution. We are interested in determining the probability that there exists a linear functional separating and

. An estimate of this probability is provided in the following theorem

Theorem 1

Let and be i.i.d. samples from the equidisribution in . Then


Proof of Theorem 1. Given that elements in the set are independent, the probability that is

Consider an auxiliary set

Vectors belong to the sphere of radius centred at the origin (see Figure 1, (b)).

(a) (b) (c) (d) (e)
Figure 1: Illustration to the proof of Theorem 1. Panel (a) shows , and in the set . Panel (b) shows , , and on the sphere . Panel (c): construction of . Note that . Panel (d) shows simplex formed by orthogonal vectors . Panel (e) illustrates derivation of functionals and .

According to GorTyu:2016 (proof of Proposition 3 and estimate (26)), the probability that for a given a given all elements of are pair-wise -orthogonal, i.e.


can be estimated from below as:

for . Suppose now that (2) holds true. Let be chosen so that . If this is the case than there exists a set of pair-wise orthogonal vectors

such that (Figure 1, (c))


Finally, consider the set

The set belongs to the sphere of radius , and its elements are vertices of the corresponding -simplex in (Figure 1, (d)).

Consider the functional:

Recall that if

are orthonormal vectors in

then . Hence , and we can conclude that and for all . According to (3), for all . Therefore the functional


satisfies the following condition: and for all . This is illustrated with Figure 1, (e).

The functional partitions the unit ball into the union of two disjoint sets: the spherical cap


and its complement in , . The volume of the cap can be estimated from above as

Hence the probability that for all can be estimated from below as

Therefore, for fixed chosen so that , the probability that can be separated from by the functional can be estimated from below as:

Given that this estimate holds for all feasible values of , statement (1) follows.

Figure 2 shows how estimate (1) of the probability behaves, as a function of for fixed and . As one can see from this figure, when exceeds some critical value ( in this specific case), the lower bound estimate (1) of the probability drops. This is not surprising since the bound (1) is a) based on rough, -like, estimates, and b) these estimates are derived for just one class of separating functionals . Furthermore, no prior pre-processing and/or clustering was assumed for the . An alternative estimate that allows us to account for possible clustering in the set is presented in Theorem 2.

Figure 2: Estimate (1) of as a function of for and .
Theorem 2

Let and be i.i.d. samples from the equidistribution in . Let be a subset of elements from such that




Proof of Theorem 2. Consider the set . Observe that , , for all , with probability . Consider now the vector

and evaluate the following inner products

According to assumption (6), with probability ,

and, respectively,

Let and . Consider the functional


It is clear that for all by the way the functional is constructed. The functional partitions the ball into two sets: the set defined as in (5) and its complement, . The volume of the set is bounded from above as


Estimate (7) now follows.

Figure 3: Estimate (7) of as a function of for and . Red stars correspond to , . Blue triangles stand for , , and black circles stand for , .

Examples of estimates (7) for various parameter settings are shown in Fig. 3. As one can see, in absence of pair-wise strictly positive correlation assumption, , the estimate’s behavior, as a function of , is similar to that of (1). However, presence of moderate pair-wise positive correlation results in significant boosts to the values of .

Remark 0

Estimates (1), (7) for the probability that follow from Theorems 1, 2 assume that the underlying probability distribution is an equidistribution in . They can, however, be generalized to equidistribuions in ellipsoids and Gaussian distributions (cf. GorTyuRom2016b ).

Note that proofs of Theorems 1, 2 are constructive. Not only they provide estimates from below of the probability that two random i.i.d. drawn samples from are linearly separable, but also they present the corresponding separating functionals explicitly as (4) and (8), respectively. The latter functionals are similar to Fisher linear discriminants. Whilst having explicit separation functionals is an obvious advantage from practical view point, the estimates that are associated with such functionals do not account for more flexible alternatives. In what follows we present a generalization of the above results that accounts for such a possibility as well as extends applicability of the approach to samples from product distributions. The results are provided in Theorem 3.

Theorem 3

Consider the linear space , let the cardinality of the set be smaller than . Consider the quotient space . Let be a representation of in , and let the coordinates of ,

be independent random variables i.i.d. sampled from a product distribution in a unit cube with variances

, . Then for

with probability there is a linear functional separating and .

Proof of Theorem 3. Observe that, in the quotient space , elements of the set

are vectors whose coordinates coincide with that of the quotient representation of . This means that the quotient representation of consists of a single element, . Furthermore, dimension of is . Let and . According to Theorem 2 and Corollary 2 from GorbanTyukin:NN:2017 , for and satisfying

with probability the following inequalities hold:

for all , . This implies that the functional

separates and with probability .

3 AI Knowledge Transfer Framework

In this section we show how Theorems 1, 2 and 3 can be applied for developing a novel one-shot AI knowledge transfer framework. We will focus on the case of transfer knowledge between two AI systems, a teacher AI and a student AI, in which input-output behaviour of the student AI is evaluated by the teacher AI. In this setting, assignment of AI roles, i.e. student or teaching, is beyond the scope of this manuscript. The roles are supposed to be pre-determined or otherwise chosen arbitrarily.

3.1 General setup

Consider two AI systems, a student AI, denoted as , and a teacher AI, demoted as . These legacy AI systems process some input signals, produce internal representations of the input and return some outputs. We further assume that some relevant information about the input, internal signals, and outputs of can be combined into a common object, , representing, but not necessarily defining, the state of . The objects are assumed to be elements of .

Over a period of activity system generates a set of objects . Exact composition of the set could depend on a task at hand. For example, if is an image classifier, we may be interested only in a particular subset of input-output data related to images of a certain known class. Relevant inputs and outputs of corresponding to objects in are then evaluated by the teacher, . If outputs differ to that of for the same input then an error is registered in the system. Objects associated with errors are combined into the set . The procedure gives rise to two disjoint sets:


Figure 4: AI Knowledge transfer diagram. produces a set of its state representations, . The representations are labelled by into the set of correct responses, , and the set of errors, . The student system, , is then augmented by an additional “corrector” eliminating these errors.

A diagram schematically representing the process is shown in Fig. 4. The knowledge transfer task is to “teach” so that with

  • does not make such errors

  • existing competencies of on the set of inputs corresponding to internal states are retained, and

  • knowledge transfer from to is reversible in the sense that can “unlearn” new knowledge by modifying just a fraction of its parameters, if required.

Two algorithms for achieving such transfer knowledge are provided below.

3.2 Knowledge Transfer Algorithms

Our first algorithm, Algorithm 1, considers cases when Auxiliary Knowledge Transfer Units, i.e. functional additions to existing student , are single linear functionals. The second algorithm, Algorithm 2, extends Auxiliary Knowledge Transfer Units to two-layer cascades of linear functionals.

  1. Pre-processing

    1. Centering. For the given set , determine the set average, , and generate sets

    2. Regularization. Determine covariance matrices , of the sets and . Let ,

      be their corresponding eigenvalues, and

      be the eigenvectors of

      . If some of , are zero or if the ratio is too large, project and onto appropriately chosen set of eigenvectors, :

      where is the matrix comprising of significant principal components of .

    3. Whitening. For the centered and regularized dataset , derive its covariance matrix, , and generate whitened sets

  2. Knowledge transfer

    1. Clustering. Pick , , , and partition the set into clusters so that elements of these clusters are, on average, pairwise positively correlated. That is there are such that:

    2. Construction of Auxiliary Knowledge Units. For each cluster , , construct separating linear functionals :

      where , are the averages of and , respectively, and is chosen as .

    3. Integration. Integrate Auxiliary Knowledge Units into decision-making pathways of . If, for an generated by an input to , any of then report accordingly (swap labels, report as an error etc.)

Algorithm 1 Single-functional AI Knowledge Transfer

The algorithms comprise of two general stages, pre-processing stage and knowledge transfer stage. The purpose of the pre-processing stage is to regularize and “sphere” the data. This operation brings the setup close to the one considered in statements of Theorems 1, 2. The knowledge transfer stage constructs Auxiliary Knowledge Transfer Units in a way that is very similar to the argument presenteed in the proofs of Theorems 1 and 2. Indeed, if then the term

is close to identity matrix, and the functionals

are good approximations of (8). In this setting, one might expect that performance of the knowledge transfer stage would be also closely aligned with the corresponding estimates (1), (7).

Remark 0

Note that the regularization step in the pre-processing stage ensures that the matrix is non-singular. Indeed, consider

Denoting and rearranging the sum below as

we obtain that is non-singular as long as the sum is non-singular. The latter property, however, is guaranteed by the regularization step in Algorithm 1.

Remark 0

Clustering at Step 2.a can be achieved by classical -means algorithms Lloyd:1982 or any other method (see e.g. DudaHart ) that would group elements of into clusters according to spatial proximity.

Remark 0

Auxiliary Knowledge Transfer Units in Step 2.b of Algorithm 1

are derived in accordance with standard Fisher linear discriminant formalism. This, however, need not be the case, and other methods such as e.g. support vector machines

Vapnik2000 could be employed for this purpose there. It is worth mentioning, however, that support vector machines might be prone to overfitting Han:2014 and their training often involves iterative procedures such as e.g. sequential quadratic minimization Platt:1998 .

Furthermore, instead of the sets , one could use a somewhat more aggressive division: and , respectively.

Depending on configuration of samples and , Algorithm 1 may occasionally create knowledge transfer units, , that are “filtering” errors too aggressively. That is some may accidentally trigger non-negative response, , and as a result of this their corresponding inputs to could be ignored or mishandled. To mitigate this, one can increase the number of clusters and knowledge transfer units, respectively. This will increase the probability of successful separation and hence alleviate the issue. On the other hand, if increasing the number of knowledge transfer units is not desirable for some reason, then two-functional units could be a feasible remedy. Algorithm 2 presents a procedure for such an improved AI Knowledge Transfer.

  1. Pre-processing. Do as in Step 1 in Algorithm 1

  2. Knowledge Transfer

    1. Clustering. Do as in Step 2.a in Algorithm 1

    2. Construction of Auxiliary Knowledge Units.

      1:Do as in Step 2.b in Algorithm 1. At the end of this step first-stage functionals , will be derived.
      2:For each set , , evaluate the functionals for all and identify elements such that and (incorrect error assignment). Let be the set containing such elements .
      3:If (there is an such that ) then increment the value of : , and return to Step 2.a.
      4:If (all sets are empty) then proceed to Step 2.c.
      5:For each pair of and with not empty, project orthogonally sets and

      onto the hyperplane

      and form the sets and :
      6:Construct a linear functional separating from so that for all and for all .
    3. Integration. Integrate Auxiliary Knowledge Units into decision-making pathways of . If, for an generated by an input to , any of the predicates hold true then report accordingly (swap labels, report as an error etc.).

Algorithm 2 Two-functional AI Knowledge Transfer

In what follows we illustrate the approach as well as the application of the proposed Knowledge Transfer algorithms in a relevant problem of a computer vision system design for pedestrian detection in live video streams.

4 Example

Let and be two systems developed, e.g. for the purposes of pedestrian detection in live video streams. Technological progress in embedded systems and availability of platforms such as e.g. Nvidia Jetson TX2 made hadrware deployment of such AI systems at the edge of computer vision processing pipelines feasible. These AI systems, however, lack computational power to run state-of-the-art large scale object detection solutions such as e.g. ResNet ResNet in real-time. Here we demonstrate that to compensate for this lack of power, AI Knowledge Transfer can be successfully employed. In particular, we suggest that the edge-based system is “taught” by the state-of-the-art teacher in a non-iterative and near-real time way. Since our building blocks are linear functionals, such learning will not lead to significant computational overheads. At the same time, as we will show later, the proposed AI Knowledge Transfer will result in a major boost to the system’s performance in the conditions of the experiment.

4.1 Definition of and and rationale

In our experiments, the teacher AI, , was modeled by a deep Convolutional Network, ResNet 18 ResNet with circa M trainable parameters. The network was trained on a “teacher” dataset comprised of M non-pedestrian (negatives), and K pedestrian (positives) images. The student AI, , was modelled by a linear classifier with HOG features Dalal:2005 and trainable parameters. The values of these parameters were the result of training on a “student” dataset, a sub-sample of the “teacher” dataset comprising of K positives and K negatives, respectively. This choice of and systems enabled us to emulate interaction between edge-based AIs and their more powerful counterparts that could be deployed on larger servers or computational clouds.

Moreover, to make the experiment more realistic, we assumed that internal states of both systems are inaccessible for direct observation. To generate sets and required in Algorithms 1 and 2 we augmented system with an external generator of HOG features of the same dimension. We assumed, however, that covariance matrices of positives and negatives from the “student” dataset are available for the purposes of knowledge transfer. A diagram representing this setup is shown in Figure 5.

Figure 5: Knowledge transfer diagram between ResNet and HOG-SVM object detectors

A candidate image is evaluated by two systems simultaneously as well as by a HOG features generator. The latter generates dimensional vectors of HOGs and stores these vectors in the set . If outputs of and do not match the corresponding feature vector is added to the set .

4.2 Error types

In this experiment we consider and address two types of errors: false positives (Type I errors) and false negatives (Type II errors). The error types were determined as follows. An error is deemed as

false positive if reported presence of a correctly sized full-figure image of pedestrian in a given image patch whereas no such object was there. Similarly, an error is deemed as false negative if a pedestrian was present in the given image patch but did not report it there.

In our setting, evaluation of an image patch by (ResNet) took sec on Nvidia K80 which was several orders slower than that of (linear HOG-based classifier). Whilst such behavior was expected, this imposed technical limitations on the process of mitigating errors of Type II. Each frame from our testing video produced K image patches to test. Evaluation of all these candidates by our chosen is prohibitive computationally. To overcome this technical difficulty we tested only a limited subset of image proposals with regards to these error type. To get a computationally viable number of proposals for false negative testing, we increased sensitivity of the HOG-based classifier by lowering its detection threshold from to . This way our linear classifier with lowered threshold acted as a filter letting through more true positives at the expense of large number of false positives. In this operational mode, Knowledge Transfer Unit were tasked to separate true positives from negatives in accordance with object labels supplied by .

4.3 Datasets

The approach was tested on two benchmark videos: LINTHESCHER sequence Ess:2008 created by ETHZ and comprised of 1208 frames and NOTTINGHAM video Nottingham containing 435 frames of live footage taken with an action camera. In what follows we will refer to these videos as ETHZ and NOTTINGHAM videos, respectively. ETHZ video contains complete images of 8435 pedestrians, whereas NOTTINGHAM video has 4039 full-figure images of pedestrians.

4.4 Results

Performance and application of Algorithms 1, 2 for NOTTINGHAM and ETHZ videos are summarized in Fig. 6 and 7. Each curves in these figures is produced by varying the values of decision-making threshold in the HOG-based linear classifier. Red circles in Figure 6 show true positives as a function of false positives for the original linear classifier based on HOG features. Parameters of the classifier were set in accordance with Fisher linear discriminant formulae. Blue stars correspond to after Algorithm 1 was applied to mitigate errors of Type I in the system. The value of (number of clusters) in the algorithm was set to be equal to . Green triangles illustrate application of Algorithm 2 for the same error type. Here Algorithm 2 was slightly modified so that the resulting Knowledge Transfer Unit had only one functional . This was due to the low number of errors reaching stage two of the algorithm. Black squares correspond to after application of Algorithm 2 (error Type I) followed by application of Algorithm 2 to mitigate errors of Type II.

Figure 6: True positives as a function of false positives for NOTTINGHAM video.

Figure 7 shows performance of the algorithms for ETHZ sequence. Red circles show performance of the original , green triangles correspond to supplemented with Knowledge Transfer Units derived using Algorithm 2 for errors of Type I. Black squares correspond to subsequent application of Algorithm 2 dealing with errors of Type II.

Figure 7: True positives as a function of false positives for ETHZ video.

In all these cases, supplementing with Knowledge Transfer Units constructed with the help of Algorithms 1, 2 for both error types resulted in significant boost to performance. Observe that in both cases application of Algorithm 2 to address errors of Type II has led to noticeable increases of numbers of false positives in the system at the beginning of the curves. Manual inspection of these false positives revealed that these errors are exclusively due mistakes of