1 Introduction
Machine teaching (Zhu, 2015, 2013; Zhu et al., 2018) is the problem of constructing a minimal dataset for a target concept such that a student model (, leaner) can learn the target concept based on this minimal dataset. Recently, machine teaching has been shown very useful in applications ranging from human computer interaction (Suh et al., 2016), crowd sourcing (Singla et al., 2014, 2013) to cyber security (Alfeld et al., 2016, 2017). Besides various applications, machine teaching also has nice connections with curriculum learning (Bengio et al., 2009; Hinton et al., 2015)
. In traditional machine learning, a teacher usually constructs a batch set of training samples, and provides them to a student in one shot without further interactions. Then the student keeps learning from this batch dataset and tries to learn the target concept. Previous machine teaching paradigm
(Zhu, 2013, 2015; Liu et al., 2016) usually focuses on constructing the smallest such dataset, and characterizing the size of such dataset, called the teaching dimension of the student model.For machine teaching to work effectively in practical scenarios, (Liu et al., 2017a) propose an iterative teaching framework which takes into consideration that the learner usually uses iterative algorithms (e.g. gradient descent) to update the models. Different from the traditional machine teaching framework where the teacher only interacts with the student in oneshot, the iterative machine teaching allows the teacher to interact with the student in every single iteration. It hence shifts the teaching focus from models to algorithms: the objective of teaching is no longer constructing a minimal dataset in one shot but searching for samples so that the student learns the target concept in a minimal number of iterations (, fastest convergence for the student algorithm). Such a minimal number of iterations is called the iterative teaching dimension for the student algorithm. (Liu et al., 2017a) mostly consider the simplest iterative case where the teacher can fully observe the student. This case is interesting in theory but too restrictive in practice.
Human teaching is arguably the most realistic teaching scenario in which the learner is completely a blackbox to the teacher. Analogously, the ultimate problem for machine teaching is how to teach a blackbox learner. We call such problem blackbox machine teaching. Inspired by the fact that the teacher and the student typically represent the same concept but in different ways, we present a step towards the blackbox machine teaching – crossspace machine teaching, where the teacher i) does not share the same feature representation with the student, and ii) can not observe the student model. This setting is interesting in the sense that it can both relax the assumptions for iterative machine teaching and improve our understanding on human learning.
Inspired by a reallife fact, that a teacher will regularly examine the student to learn how well the student has mastered the concept, we propose an active teacher model to address the crossspace teaching problem. The active teacher is allowed to actively query the student with a few (limited) samples every certain number of iterations, and the student can only return the corresponding prediction results to the teacher. For example, if the student uses a linear regression model, it will return to the teacher its prediction
where is the student parameter at the th iteration andis the representation of the query example in student’s feature space. Under suitable conditions, we show that the active teacher can always achieve faster rate of improvement than a random teacher that feeds samples randomly. In other words, the student model guided by the active teacher can provably achieve faster convergence than the stochastic gradient descent (SGD). Additionally, we discuss the extension of the active teacher to deal with the learner with forgetting behavior, and the learner guided by multiple teachers.
To validate our theoretical findings, we conduct extensive experiments on both synthetic data and real image data. The results show the effectiveness of the active teacher.
2 Related Work
Machine teaching defines a task where we need to find an optimal training set given a learner and a target concept. (Zhu, 2015) describes a general teaching framework which has nice connections to curriculum learning (Bengio et al., 2009) and knowledge distillation (Hinton et al., 2015). (Zhu, 2013) considers Bayesian learners in exponential family and formulates the machine teaching as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. (Liu et al., 2016) give the teaching dimension of linear learners. Machine teaching has been found useful in cyber security (Mei & Zhu, 2015), human computer interaction (Meek et al., 2016), and human education (Khan et al., 2011). (Johns et al., 2015) extend machine teaching to humanintheloop settings. (Doliwa et al., 2014; Gao et al., 2015; Zilles et al., 2008; Samei et al., 2014; Chen et al., 2018) study the machine teaching problem from a theoretical perspective.
Previous machine teaching works usually ignore the fact that a student model is typically optimized by an iterative algorithm (, SGD), and in practice we focus more on how fast a student can learn from the teacher. (Liu et al., 2017a) propose the iterative teaching paradigm and an omniscient teaching model where the teacher knows almost everything about the learner and provides training examples based on the learner’s status. Our crossspace teaching serves as a stepping stone towards the blackbox iterative teaching.
3 CrossSpace Iterative Machine Teaching
The crossspace iterative teaching paradigm is different from the standard iterative machine teaching in terms of two major aspects: i) the teacher does not share the feature representation with the student; ii) the teacher cannot observe the student’s current model parameter in each iteration. Specifically, we consider the following teaching settings:
Teacher. The teacher model observes a sample
(e.g. image, text, etc.) and represents it as a feature vector
and a label. The teacher knows the model (, loss function) and the optimization algorithm (including the learning rate
^{1}^{1}1For simplicity, the teacher is assumed to know the learning rate of the learner, but this prior is not necessary, as discussed later.) of the learner, and the teacher preserves an optimal parameter of this model in its own feature space. We denote the prediction of the teacher as ^{2}^{2}2For simplicity, we omit the bias term throughout the paper. It is straightforward to add them back..Learner. The learner observes the same sample and represents it as a vectorized feature and a label . The learner uses a linear model where is its model parameter and updates it with SGD (if guided by a passive teacher). We denote the prediction of the student model as in th iteration.
Representation. Although the teacher and learner do not share the feature representation, we still assume their representations have an intrinsic relationship. For simplicity, we assume there exists a unknown onetoone mapping from the teacher’s feature space to the student’s feature space such that . However, the conclusions in this paper are also applicable to injective mappings. Unless specified, we assume that by default.
Interaction. In each iteration, the teacher will provide a training example to the learner and the learner will update its model using this example. The teacher cannot directly observe the model parameter of the student. In this paper, the active teacher is allowed to query the learner with a few examples every certain number of iterations. The learner can only return to the teacher its prediction in the regression scenario, its predicted label or confidence score in the classification scenario, where is the student’s model parameter at th iteration and is some nonlinear function. Note that the teacher and student preserve the same loss function .
Similar to (Liu et al., 2017a), we consider three ways for the teacher to provide examples to the learner:
Synthesisbased teaching. In this scenario, the space of provided examples is
Combinationbased teaching. In this scenario, the space of provided examples is ()
Rescalable poolbased teaching. This scenario further restrict the knowledge pool for samples. The teacher can pick examples from :
We also note that the poolbased teaching (without rescalability) is the most restricted teaching scenario and it is very close to the practical settings.
4 The Active Teaching Algorithm
To address the crossspace iterative machine teaching, we propose the active teaching algorithm, which actively queries its student for its prediction output. We first describe the general version of the active teaching algorithm. Then without loss of generality, we will discuss three specific examples: least square regression (LSR) learner for regression, logistic regression (LR) and support vector machine (SVM) learner for classification
(Friedman et al., 2001).4.1 General Algorithm
Inspired by human teaching, we expand the teacher’s capabilities by enabling the teacher to actively query the student. The student will return its predictions to the teacher. Based on the student’s feedback, The teacher will estimate the student’s status and determine which example to provide next time. The student’s feedback enables the active teacher to teach without directly observing the student’s model.
The active teacher can choose to query the learner with a few samples in each iteration, and the learner will usually report the prediction where denotes some function of the inner product prediction. For example, we usually have for regression and or for classification. Based on our assumption that there is an unknown mapping from teacher’s feature to student’s feature, there also exists a mapping from the model parameters of the teacher to those of the student. These active queries enables the teacher to estimate the student’s corresponding model parameter “in the teacher’s space” and maintain a virtual learner, the teacher’s estimation of the real learner, in its own space. The teacher will decide which example to provide based on its current virtual learner model. The ideal virtual learner will have the same prediction output as the real learner, where . Equivalently, always holds for the ideal virtual learner, where is the conjugate mapping of . Note that for the purpose of analysis, we assume that is a generic linear operator, though our analysis can easily extends to general cases. In fact, one of the most important challenges in active teaching is to recover a virtual student that approximates the real leaner as accurately as possible. The estimation error of the teacher may affect the quality of training examples that the teacher provides for the real learner. Intuitively, if we can recover the virtual learner with an appropriate accuracy, then we can still achieve faster teaching speed than that of passive learning. Fig. 2 shows the pipeline of the crossspace teaching.
With full access to the obtained virtual learner in the teacher’s space, the teacher can perform omniscient teaching as in (Liu et al., 2017a). Specifically, the active teacher will optimize the following objective:
(1) 
where is a loss function and is the teacher’s estimation of after the teacher performs an active query in th iteration (, the current model parameter of the virtual learner). is the learning rate of the virtual learner. The learning rate of the student model is not necessarily needed. The general teaching algorithm is given in Algorithm 1.
Particularly, different types of feedback (, the form of ) from learners contain different amount of information, resulting in different levels of difficulties in recovering the parameters of the learner’s model. We will discuss two general ways to recover the virtual learner for two types of frequently used feedbacks in practice.
Exact recovery of the virtual learner. We know that the learner returns a prediction in the form of . In general, if is an onetoone mapping, we can exactly recover the ideal virtual learner ( ) in the teacher’s space using the system of linear equations. In other words, the recovery of virtual learner could be exact as long as there is no information loss from to . Specifically, we have where is the th query for the learner. Because is given by the real learner, we only need to construct queries ( is the dimension of the teacher space) and require to be linearly independent to estimate . Without no numerical error, we can exactly recover . Since the recovery is exact, we have . Note that there are cases that we can achieve exact recovery without being an onetoone mapping. For example, (hinge function) is not an onetoone mapping but we can still achieve exact recovery.
Approximate recovery of the virtual learner. If is not an onetoone mapping (, which provides bit feedback), then generally we may not be able to exactly recover the student’s parameters. Therefore, we have to develop a more intelligent technique ( less sample complexity) to estimate
. In this paper, we use active learning
(Settles, 2010) to help the teacher better estimate for the virtual learner. One of the difficulties is that the active learning algorithm obtains the parameters of a model based on the predicted labels on which the norm of the weights has no effect. It becomes ambiguous which set of weights the teacher should choose. Therefore, the active teacher also needs to have access to the norm of the student’s weights for recovering the virtual learner. In the following sections, we will develop and analyze our estimation algorithm for the virtual learner based on the existing active learning algorithms with guarantees on sample complexity (Balcan et al., 2009; Ailon, 2012; Hanneke, 2007; Schein & Ungar, 2007; Settles, 2010).4.2 Least Square Regression Learner
For the LSR learner, we use the following model:
(2) 
Because , the LSR learner belongs to the case where the active teacher can exactly recover the ideal virtual learner. When , the teacher only need to perform active exam once. It can be viewed as a “background exam” for the teacher to figure out how well the student has mastered the knowledge at the beginning, and the teacher can track the dynamics of students exactly later. Otherwise, for a general onetoone mapping , the teacher needs to query the student in each iteration. Still, the teacher can reuse the same set of queries in all iterations.
4.3 Logistic Regression Learner
For the LR learner, we use the following model (without loss of generality, we consider the binary classification):
(3) 
We discuss two cases separately: (1) the learner returns the probability of each class (
wheredenotes a sigmoid function); (2) the learner only returns the predicted label (
).In the first case where is a sigmoid function, we can exactly recover the ideal virtual learner. This case is essentially similar to the LSR learner where we need only one “background exam” if and we can reuse the queries in each iteration for a general onetoone mapping (). In the second case where is a sign function, we can only approximate the ideal virtual learner with some error. In this case, we use active learning to do the recovery.
4.4 Support Vector Machine Learner
For the SVM learner, we use the following model for the binary classification:
(4) 
Similarly, we have two cases: (1) the learner returns the hinge value of each class ( ; (2) the learner only returns the label ( ).
In the first case where is a hinge function, we can still recover the ideal virtual learner. Although the hinge function is not a bijective mapping (only half of it is onetoone), we prove that it can still achieve exact recovery with slightly more query samples. For , we need only one “background exam” as in the case of the LR learner. Otherwise, we still need to query the student in each iteration. In the second case where is a sign function, we can only approximate the ideal virtual learner with some error.
5 Theoretical Results
We define an important notion of being “exponentially teachable” to characterize the teacher’s performance. Given , the loss function and feature mapping , is exponentially teachable (ET) if the number of total samples (teaching samples and query samples) is for a learner to achieve approximation, i.e., . Note that the potential dependence of on the problem dimension is omitted here, which will be discussed in detail in the following. We summarize our theoretical results in Table 1. Given a learner that is exponentially teachable by the omniscient teacher, we find that the learner is not exponentially teachable by the active teacher only when is not an onetoone mapping and the teacher uses rescalable poolbased teaching.








5.1 SynthesisBased Active Teaching
We denote and ( is invertible). We first discuss the teaching algorithm when the teacher is able to exactly recover the student’s parameters. A generic theory for synthesisbased ET is provided as follows. Suppose that the teacher can recover exactly using samples at each iteration. If for any , there exists and such that and
then is ET with samples. Existence of the exponentially teachable via exact recovery. Different from (Liu et al., 2017a) where the condition for synthesisbased exponentially teaching is only related to the loss function , the condition for the crossspace teaching setting is related to both loss function and feature mapping . The spectral property of is involved due to the differences of feature spaces, leading to the mismatch of parameters of the teacher and student. It is easy to see that such that the commonly used loss functions, , absolute loss, square loss, hinge loss, and logistic loss, are ET with exact recovery, i.e., . This can be shown by construction. For example, if the , the ET condition will be the same for both omniscient teacher (Liu et al., 2017a) and active teacher.
Next we present generic results of the sample complexity required to recover , which is a constant to (i.e., is ET), as follows. If is bijective, then we can exactly recover with samples. If , then we can exactly recover with samples. Lemma 5.1 and 5.1 cover , , or , where denotes the identity mapping and denotes some sigmoid function, , logistic function, hyperbolic tangent, error function, etc. If the student’s answers to the queries via these student feedbacks in the exam phase, then we can exactly recover with arbitrary independent data, omitting the numerical error. Also note that the query samples in Lemma 5.1 and 5.1 can be reused in each iteration, thus the query sample complexity is , which is formalized as follows. Suppose that the student answers questions in query phase via , , or , then is ET with teaching samples and query samples via exact recovery. Here we emphasize that the number of query samples ( active queries) does not depend on specific tasks. For both regression and classification, as long as the student feedbacks are bijective functions, then Corollary 5.1 holds. The loss function only affects the synthesis or selection of the teaching samples.
In both regression and classification, if which only provides bit feedback, no longer exists and the exact recovery of may not be obtained. In such case, the teacher may only approximate the student’s parameter using active learning. We first present the generic result for ET via approximate recovery as follows. Suppose that the loss function is Lipschitz smooth in a compact domain containing and sample candidates are from bounded set , where . Further suppose at th iteration, the teacher estimates the student with probability at least using samples. If for any , there exists and such that for , we have
then the student can achieve approximation of with samples with probability at least . If , then is ET. Existence of exponentially teachable via approximate recovery. is the number of samples needed for approximately recovering in each iteration. Different from the exact recovery setting where only depends on the feature dimension, here also depends on how accurately the teacher wants to recover in each iteration ( denotes the estimation error of ). The condition for exponentially teachable with approximate recovery is related to both and the approximation level of the student parameters, , the effect of . For example, if the and , the exponentially teachable condition will be the same for both the omniscient teaching (Liu et al., 2017a) and active teaching with exact recovery.
For , if the student provides for the query , it is unlikely to recover unless we know . This leads to the following assumption. The feedback is bit, , and the norm of is known to teacher. Assumption 5.1 is necessary because is scale invariant. We cannot distinguish between and for any only with their signs. The following theorem provides the query sample complexity in this scenario. Suppose that Assumption 5.1 holds. Then with probability at least , then we can recover with query samples. Combining Theorem 5.1 with Theorem 5.1, we have the results for the 1bit feedback case. Suppose Assumption 5.1 holds. Then then is ET with teaching samples and query samples. Tradeoff between teaching samples and query samples. There is a delicate tradeoff between query sample complexity (in the exam phase) and teaching sample complexity. Specifically, with and query samples, we can already achieve the conclusion that converges in rate , which makes the number of teaching samples to be . We emphasize that this rate is the same with the convergence of SGD minimizing strongly convex functions. Note that the teaching algorithm can achieve at least this rate for general convex loss. Compared to the number of teaching samples in Corollary 5.1, although the query samples is less, this setting requires much more effort in teaching. Such phenomenon is reasonable in practice in the sense that if the examination is not accurate, the teacher provides the student less effective samples and hence has to teach for more iterations when the teacher cannot accurately evaluate student’s performance.
We remark that if is a unitary operator, , , we can show that the teacher need only one exam. The key insight is that after the first “background exam”, the teacher can replace the following exams by updating the virtual learner via the same dynamic of the real learner. This is formalized as follows. Suppose that is a unitary operator. If , then . Therefore, with a unitary feature mapping, we only need one exam in the whole teaching procedure. It follows that the query sample complexity in theorem 5.1 will be reduced to via approximate recovery.
5.2 CombinationBased Active Teaching
We discuss how the results for synthesisbased active teaching can be extended to the combinationbased active teaching. In this scenario, we assume both training and query samples are constructed by linear combination of samples in . We have the following corollaries for both exact recovery and approximate recovery in the sense of
Note that with the introduced metric, for , we only consider its component in and the components in the null space will be ignored. Therefore, such that , we have for all . Then we have the result via exact recovery as follows. Suppose the learner gives feedbacks in query phase by or , and . Then is ET with teaching samples and query samples for exact recovery. The result via approximate recovery holds analogously to synthesisbased active teaching, given as follows. Suppose Assumption 5.1 holds, the student answers questions in query phase via or and . Then is ET with teaching samples and query samples via approximate recovery.
5.3 Rescaled PoolBased Active Teaching
In this scenario, the teacher can only pick examples from a fixed sample candidate pool, , for teaching and active query. We still evaluate with the metric defined in (5.2). We first define pool volume to characterize the richness of the pool (Liu et al., 2017a). [Pool Volume] Given the training example pool , the volume of is defined as
Then the result via exact recovery is given as follows. Suppose that the student answers questions in the exam phase via or and . If , there exist and such that for , we have
then is ET with teaching samples and query samples.
For the approximate recovery case, the active learning is no longer able to achieve the desired accuracy for estimating the student’s parameter in the restricted pool scenario. Thus the active teacher may not achieve exponential teaching.
6 Discussions and Extensions
The active teacher need not know the learning rate. To estimate the learning rate, the active teacher should first estimate the student’s initial parameters , and then feed the student with one random sample . Once the updated student’s parameter is estimated by the teacher, the learning rate can be computed by where denotes the elementwise division and the sum is over all the dimensions in . The number of samples for estimating will be , where denotes the samples used in estimating student’s parameter. Even if the learning rate is unknown, the teacher only needs more samples to estimate it. Most importantly, it will not affect the exponential teachability.
Teaching with forgetting. We consider the scenario where the learner may forget some knowledge that the teacher has taught, which is very common in human teaching. We model the forgetting behavior of the learner by adding a deviation to the learned parameter. Specifically in one iteration, the learner updates its model with , but due to the forgetting, its truly learned parameter is where is a random deviation vector. Based on Theorem 5.1, we can show that such forgetting learner is not ET with a teacher that only knows the learner’s initial parameter and can not observe the learner along iteration. However, the active teacher can make the forgetting learner ET via the active query strategy. More details and experiments are provided in Appendix D.
Teaching by multiple teachers. Suppose multiple teachers sequentially teach a learner, a teacher can not guide the learner without knowing its current parameter. It is natural for the teacher to actively estimate the learner. Our active teaching can be easily extended to multiple teacher scenario.
7 Experiments
General settings. Detailed settings are given in Appendix B. We mainly evaluate the practical poolbased teaching (without rescaling) in the experiments. Still, in the exam stage, our active teacher is able to synthesize novel query examples as needed. The active teacher works in a different feature space from the learner’s space, while the omniscient teacher (Liu et al., 2017a) can fully observe the learner and works in the same feature space as the learner. The omniscient teacher serves as a baseline (possibly an upper bound) in our experiments. For active learning, we use the algorithm in (Balcan et al., 2009; Schein & Ungar, 2007).
Evaluation. For synthetic data, we use two metrics to evaluate the convergence performance: the objective value and w.r.t. the training set. For real images, we further use accuracy on the testing set for evaluation. We put the experiments of forgetting learner in Appendix D.
7.1 Teaching with Synthetic Data
We use Gaussian distributed data to evaluate our active teacher model
on linear regression and binary linear classification tasks. We study the LRS learner with , LR learner with being the sigmoid function, LR learner with . For the first two cases, the active teacher can perform an onetime exam (“background exam”) to exactly recover the ideal virtual learner. After recovering the ideal virtual learner, the active teaching could achieve the performance of the omniscient teaching. The experimental results in Fig. 3(a) and Fig. 3(b) meet our expectations. In the initial iterations (on the order of feature dimensions), we can see that the learner does not update itself. In this stage, the active teacher provides query samples to the learner and recover a virtual learner based on the feedbacks of these query samples. After the exact recovery of the virtual learner, one can observe that the active teacher achieves faster convergence compared with the random teacher (SGD). In fact, the active teacher and the omniscient teacher should achieve the same convergence speed if omitting numerical errors.For the LR learner with , the teacher could only approximate the learner with the active learning algorithm. Besides, the active teacher needs to know the norm of the student model. We use the algorithm in (Schein & Ungar, 2007) and recover the virtual learner in each iteration such that becomes small enough. From the results in Fig. 3(c), we can see that due to the approximation error between the recovered virtual learner and the ideal virtual learner, the active teacher can not achieve the same performance as the omniscient teacher. However, the convergence of the active teacher is very close to the omniscient teacher, and is still much faster than SGD. Note that, we remove the iterations used for exams to better compare the convergence of different approaches.
LSR  LSR 
LR ( is sigmoid)  LR ( is sigmoid) 
LR ( is sign)  LR ( is sign) 
7.2 Teaching with Real Image Data
We apply the active teacher to teach the LR learner on the MNIST dataset (LeCun et al., 1998) to further evaluate the performance. In this experiment, we perform binary classification on the digits 7 and 9. We use two random projections to obtain two sets of 24dim features for each image: one is for the teacher’s feature space and the other is for the student’s feature space. The omniscient teacher uses the student’s space as its own space (, shared feature space), while the active teacher uses different feature space with the student. For the LR learner with sign function ( 1bit feedbacks), one can observe that the active teacher has comparable performance to the omniscient teacher, even doing better at the beginning. Because we evaluate the teaching performance on real image data, the omniscient teacher will not necessarily be an upper bound of all the teacher. Still, as the algorithms iterate, the active teacher becomes worse than the omniscient teacher due to its approximation error.
In the right side of Fig.4, we visualize the images selected by the active teacher, omniscient teacher and random teacher. The active teacher preserves the pattern of images selected by the omniscient teacher: starting from easy examples first and gradually shifting to difficult ones, while the images selected by the random teacher have no patterns.
LR ( is sign)  Active Teacher 
LR ( is sign)  Omniscient Teacher 
LR ( is sign)  Random Teacher (SGD) 
8 Conclusions and Open Problems
As a step towards the ultimate blackbox machine teaching, crossspace teaching greatly relaxes the assumptions of previous teaching scenarios and bridges the gap between the iterative machine teaching and the practical world. The active teaching strategy is inspired by realistic human teaching. For machine teaching to be applicable in practice, we need to gradually remove all the unrealistic assumptions to obtain more realistic teaching scenario. The benefits of more realistic machine teaching are in two folds. First, it enables us make better use of the existing offtheshelf pretrained models to teach a new model on some new tasks. It is also related to transfer learning
(Pan & Yang, 2010). Second, it can improve our understanding on human education and provide more effective teaching strategies for humans.Rescalable poolbased active teaching with bit feedback. The proposed algorithm may not work the in poolbased teaching setting when the student return bit feedback. We leave the possibility of achieving exponential teachability in this setting as an open problem.
Relaxation for the conditions on . Current constraints on the operator are still too strong to match more practical scenarios. How to relax the conditions on is important.
A better alternative to approximate recovery? Is there some other tool other than active learning for our teacher to recover the virtual learner? For example, 1bit compressive sensing (Boufounos & Baraniuk, 2008) may help.
Acknowledgements
The project was supported in part by NSF IIS1218749, NSF Award BCS1524565, NIH BIGDATA 1R01GM108341, NSF CAREER IIS1350983, NSF IIS1639792 EAGER, NSF CNS1704701, ONR N000141512340, Intel ISTC, NVIDIA, and Amazon AWS.
References
 Ailon (2012) Ailon, Nir. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research, 13(Jan):137–164, 2012.

Alfeld et al. (2016)
Alfeld, Scott, Zhu, Xiaojin, and Barford, Paul.
Data poisoning attacks against autoregressive models.
In AAAI, pp. 1452–1458, 2016.  Alfeld et al. (2017) Alfeld, Scott, Zhu, Xiaojin, and Barford, Paul. Explicit defense actions against testset attacks. In AAAI, 2017.
 Balcan et al. (2009) Balcan, MariaFlorina, Beygelzimer, Alina, and Langford, John. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009.
 Bengio et al. (2009) Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In ICML, 2009.
 Boufounos & Baraniuk (2008) Boufounos, Petros T and Baraniuk, Richard G. 1bit compressive sensing. In CISS, 2008.
 Chen et al. (2018) Chen, Yuxin, Aodha, Oisin Mac, Su, Shihan, Perona, Pietro, and Yue, Yisong. Nearoptimal machine teaching via explanatory teaching sets. In AISTATS, 2018.
 Doliwa et al. (2014) Doliwa, Thorsten, Fan, Gaojian, Simon, Hans Ulrich, and Zilles, Sandra. Recursive teaching dimension, vcdimension and sample compression. Journal of Machine Learning Research, 15(1):3107–3131, 2014.
 Friedman et al. (2001) Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
 Gao et al. (2015) Gao, Ziyuan, Simon, Hans Ulrich, and Zilles, Sandra. On the teaching complexity of linear sets. In International Conference on Algorithmic Learning Theory, pp. 102–116. Springer, 2015.
 Hanneke (2007) Hanneke, Steve. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp. 353–360. ACM, 2007.
 Hinton et al. (2015) Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Johns et al. (2015) Johns, Edward, Mac Aodha, Oisin, and Brostow, Gabriel J. Becoming the expertinteractive multiclass machine teaching. In CVPR, 2015.
 Khan et al. (2011) Khan, Faisal, Mutlu, Bilge, and Zhu, Xiaojin. How do humans teach: On curriculum learning and teaching dimension. In NIPS, 2011.
 LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Liu et al. (2016) Liu, Ji, Zhu, Xiaojin, and Ohannessian, H Gorune. The teaching dimension of linear learners. In ICML, 2016.
 Liu et al. (2017a) Liu, Weiyang, Dai, Bo, Humayun, Ahmad, Tay, Charlene, Yu, Chen, Smith, Linda B., Rehg, James M., and Song, Le. Iterative machine teaching. In ICML, 2017a.
 Liu et al. (2017b) Liu, Weiyang, Zhang, YanMing, Li, Xingguo, Yu, Zhiding, Dai, Bo, Zhao, Tuo, and Song, Le. Deep hyperspherical learning. In NIPS, 2017b.
 Liu et al. (2018) Liu, Weiyang, Liu, Zhen, Yu, Zhiding, Dai, Bo, Lin, Rongmei, Wang, Yisen, Rehg, James M, and Song, Le. Decoupled networks. In CVPR, 2018.
 Meek et al. (2016) Meek, Christopher, Simard, Patrice, and Zhu, Xiaojin. Analysis of a design pattern for teaching with features and labels. arXiv preprint arXiv:1611.05950, 2016.
 Mei & Zhu (2015) Mei, Shike and Zhu, Xiaojin. Using machine teaching to identify optimal trainingset attacks on machine learners. In AAAI, 2015.
 Nemirovski et al. (2009) Nemirovski, Arkadi, Juditsky, Anatoli, Lan, Guanghui, and Shapiro, Alexander. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
 Pan & Yang (2010) Pan, Sinno Jialin and Yang, Qiang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 Samei et al. (2014) Samei, Rahim, Semukhin, Pavel, Yang, Boting, and Zilles, Sandra. Algebraic methods proving sauer’s bound for teaching complexity. Theoretical Computer Science, 558:35–50, 2014.
 Schein & Ungar (2007) Schein, Andrew I and Ungar, Lyle H. Active learning for logistic regression: an evaluation. Machine Learning, 68(3):235–265, 2007.
 Settles (2010) Settles, Burr. Active learning literature survey. University of Wisconsin, Madison, 52(5566):11, 2010.

Singla et al. (2013)
Singla, Adish, Bogunovic, Ilija, Bartók, G, Karbasi, A, and Krause, A.
On actively teaching the crowd to classify.
In NIPS Workshop on Data Driven Education, number EPFLPOSTER221572, 2013.  Singla et al. (2014) Singla, Adish, Bogunovic, Ilija, Bartok, Gabor, Karbasi, Amin, and Krause, Andreas. Nearoptimally teaching the crowd to classify. In ICML, pp. 154–162, 2014.
 Suh et al. (2016) Suh, Jina, Zhu, Xiaojin, and Amershi, Saleema. The label complexity of mixedinitiative classifier training. In ICML, pp. 2800–2809, 2016.
 Zhu (2013) Zhu, Xiaojin. Machine teaching for bayesian learners in the exponential family. In NIPS, 2013.
 Zhu (2015) Zhu, Xiaojin. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, 2015.
 Zhu et al. (2018) Zhu, Xiaojin, Singla, Adish, Zilles, Sandra, and Rafferty, Anna N. An overview of machine teaching. arXiv preprint arXiv:1801.05927, 2018.
 Zilles et al. (2008) Zilles, Sandra, Lange, Steffen, Holte, Robert, and Zinkevich, Martin. Teaching dimensions based on cooperative learning. In COLT, 2008.
Appendix A Details of the Proofs
We analyze the sample complexity by separating the teaching procedure into two stages in each iteration, , the active query stage by conducting examination for the student and the teaching stage by providing samples to the student.
a.1 Error Decomposition
Recall that there is a mapping from the feature space of the teacher to that of the student, and we have where denotes the conjugate mapping of . We also denote the , since the operator is invertible, and . To involve the inconsistency between the student’s parameters , and the teacher’s estimator , at th iteration into the analysis, we first provide the recurrsion with error decomposition. For simplicity, we denote . Then, we have the update rule of student as
where is constructed by teacher with the estimator . Plug into the difference, we have
Suppose the loss function is Lipschitz smooth and ,
which implies
We have the error decomposition as follows,