Adversarially robust and explainable model compression with on-device personalization for NLP applications

by   Yao Qiang, et al.
Wayne State University

On-device Deep Neural Networks (DNNs) have recently gained more attention due to the increasing computing power of the mobile devices and the number of applications in Computer Vision (CV), Natural Language Processing (NLP), and Internet of Things (IoTs). Unfortunately, the existing efficient convolutional neural network (CNN) architectures designed for CV tasks are not directly applicable to NLP tasks and the tiny Recurrent Neural Network (RNN) architectures have been designed primarily for IoT applications. In NLP applications, although model compression has seen initial success in on-device text classification, there are at least three major challenges yet to be addressed: adversarial robustness, explainability, and personalization. Here we attempt to tackle these challenges by designing a new training scheme for model compression and adversarial robustness, including the optimization of an explainable feature mapping objective, a knowledge distillation objective, and an adversarially robustness objective. The resulting compressed model is personalized using on-device private training data via fine-tuning. We perform extensive experiments to compare our approach with both compact RNN (e.g., FastGRNN) and compressed RNN (e.g., PRADO) architectures in both natural and adversarial NLP test settings.



There are no comments yet.


page 1


Comparative Study of CNN and RNN for Natural Language Processing

Deep neural networks (DNN) have revolutionized the field of natural lang...

A Survey in Adversarial Defences and Robustness in NLP

In recent years, it has been seen that deep neural networks are lacking ...

DKM: Differentiable K-Means Clustering Layer for Neural Network Compression

Deep neural network (DNN) model compression for efficient on-device infe...

Private Model Compression via Knowledge Distillation

The soaring demand for intelligent mobile applications calls for deployi...

MobiRNN: Efficient Recurrent Neural Network Execution on Mobile GPU

In this paper, we explore optimizations to run Recurrent Neural Network ...

Compressing Language Models using Doped Kronecker Products

Kronecker Products (KP) have been used to compress IoT RNN Applications ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Mobile Artificial Intelligence (AI) finds application in a wide range of domains including image classification

[51], healthcare [24], and speech recognition [29]. Thus, there is a growing interest to migrate AI tasks from the cloud to mobile devices directly. The improved hardware and increased availability of mobile apps are expected to provide personalized user experience, protect users’ privacy, and offer minimum latency. Despite the initial success, the development of on-device NLP applications has lagged behind applications in the above referenced domains.

Even with the rapid improvements of mobile hardware, the main obstacles in deploying the standard DNNs on mobile devices are the high computational and memory requirements during both training and inference. Recently, model compression techniques have been applied to develop compact DNNs for mobile deployment. An array of lightweight CNN architectures, e.g., MobileNetV2 [43] and SquezzeNet [18], have been designed specially for mobile CV tasks. However, the compact RNN architectures are needed to learn from on-device sequential/temporal data, which, at the present, are mostly developed and applied in lightweight (IoTs) applications, e.g., Human Activity Recognition (HAR) [7].

As such, few compact RNN architecture has been designed and a small number of compressed RNN models are available for resource-hungry NLP applications. The latter, for example, projection networks, e.g., [40] and [22] have demonstrated initial success in the text classification task. Up-to-date, the vast majority of mobile AI applications are focused on inference [15] since the more computationally intense on-device training has not been demonstrated as a feasible task. In addition to the computational challenge, a trustworthy compressed RNN model requires a serious tackling the following major challenges: (1) information loss in model compression, (2) adversarial robustness [57, 14, 52], and (3) lack of explainability [28] and personalization [30]. Systematically addressing these challenges would significantly promote the wide adoption of the compressed RNN models in NLP applications.

In this work, we design a novel training scheme (Figure 1) to address the above challenges via building the robustness and explainability in our compressed RNN model during the training process on the cloud, followed by on-device personalization. Thus, our approach avoids the post-hoc adversarial training [13] and/or attribution based explanation [31]. Although we only demonstrate impressive accuracy against several adversarial attacks towards text classification tasks, our training scheme is applicable to other RNN model based NLP tasks. Specifically, our major contributions are:

  • We tackle the information loss during the model compression by minimizing the layer-wise feature mapping loss and label distillation loss.

  • We employ an adversarial robust objective function in our new training scheme to encourage the predicted probabilities of false classes that are equally distributed in both the full and the compressed models.

  • We enable the explainability of both models via a novel aspect-based feature mapping and fine tune the on-device model by personalizing the compressed model using a small amount of on-device training data.

Ii Related Work

Ii-a Adversarial Attacks and Defense

Adversarial attacks, both white-box and black-box attacks [54], generate carefully crafted adversarial examples to fool DNNs in making wrong predictions. In the former, the attacks have complete access to the target model (e.g., model architecture, parameters, etc.) and generate adversarial samples exploiting the guidance of the model’s gradients, e.g., [48, 13]

. In the latter, the attacks are only allowed to test the target classifiers without any access to the target model

[26, 50]. We select three strong attack methods [42, 12, 41] covering both black-box and white-box attacks as will be described in Experiment Setup Section.

Recently, an array of defense techniques has been proposed to counter these adversarial attacks. As an alternative to the traditional natural training, which is vulnerable to carefully crafted attacks, adversarial training is a technique that allows DNNs to correctly classify both clean and adversarial examples [13, 19, 58]. However, most existing adversarial training approaches are based on one specific type of adversarial attack, leading to a compromised generalization of defense capability on adversarial examples from other attacks [45]. Besides, the high computing cost in generating strong adversarial examples makes the standard adversarial training computationally prohibitive [44], especially on large-scale NLP datasets.

In tackling these problems, efficient approaches to improve the adversarial robustness of DNNs via learning discriminative features through minimizing new loss functions have been proposed

[32, 34]. For example, [5] proposed a new loss objective called Complement Objective Training (COT) that achieves good robustness against single-step adversarial attacks. COT maximizes the likelihood of the ground-truth classes while neutralizing the probabilities of the complementary (false) classes using two loss objectives. However, these two loss objectives do not have a coordination mechanism to efficiently work together. In order to reconcile the competition between them, a new approach called Guided Complement Entropy (GCE) has been recently proposed [6] for CV applications. Specifically, GCE adds a “guided” term to maintain the balance between the true class and the false classes, which help improve the adversarial robustness. Without loss of generality, we employ GCE to extract adversarially robust features from large-scale NLP datasets and demonstrate the improvement of adversarial robustness of our models against various adversarial attacks. To our knowledge, we are among the first to leverage a new loss function to enable the adversarial robustness of model compression for NLP applications.

Ii-B Compressed RNN Models

Several RNN model compression techniques achieve compactness e.g., designing tailor-made lightweight architectures [43], generating compressed models via automatic Neural Architecture Search, [49]. Quantization [38] and/or pruning [16] reduce the model size at the expense of reduced prediction accuracy. Knowledge distillation [17] is used to improve the performance of the lightweight models by transferring the knowledge from a large teacher network to a small lightweight student network.

In text classification, [40] proposed a new architecture that jointly trains a full neural network and a simpler “projection” network, which leverages random projections to transform inputs or intermediate representations into bits. The “projection” network encodes lightweight and efficient computing operations in bit space to reduce memory footprint. Since then, a few more advanced projection networks have been proposed to achieve better performance. For example, Self-Governing Neural Network (SGNN) [39]

learns compact projection vectors with local sensitive hashing, which surmounts the need for pre-trained word embeddings with huge parameters in complex networks.

[22] proposed a novel projection attention neural network named PRADO that combines trainable projections with attention and convolutions. Their compressed 200-Kilobytes model has achieved good performance on multiple text classification tasks.

However, most of the above approaches overlook the important issue of adversarial robustness in NLP. Recent literature has focused on the trade-off between adversarial robustness and neural network compression with application in CV. [57] investigated the extent to which adversarial samples are transferable between uncompressed and compressed DNNs. [14] proposed a novel Adversarially Trained Model Compression (ATMC) framework that obtains a remarkably favorable trade-off among model size, accuracy, and robustness. [52] proposed a framework of concurrent adversarial training and weight pruning that enables model compression while still preserving the adversarial robustness. Different from all the above state-of-the-art solutions, we tailor-made model compression techniques for NLP to enable adversarially robust and personalized on-device applications.

Fig. 1: Our new training scheme for model compression. The full model is pre-trained on the cloud. The compressed model is trained on the cloud with our new training scheme using public datasets to ensure adversarial robustness and explainability, followed by fine tuning and deploying on mobile devices using on-device private data.

Ii-C Compact RNN Models

RNN models have achieved significant success in learning complex patterns for temporal/sequential data (e.g., sensor signal, natural language). Beyond the classical LSTM and GRU architectures, more sophisticated RNNs with skip connections and residual blocks [3] and those combined with CNNs [36] have been recently developed to allow the RNN to go ‘deeper’ and give better performance. Despite the state-of-the-art prediction accuracy, these heavyweight RNN models are resource-hungry and not suitable for on-device deployment.

Recently, tiny RNNs, i.e., RNNs with small parameter sizes (e.g., 10K or less), have received increasing attention due to their high application potential to on-device deployment in IoT environment. [25]

proposed FastRNN/FastGRNN by adding residual connections and gating on the standard RNNs, which outperformed LSTM and GRU in prediction accuracy with fewer parameters (10K versus 30K). AntisymmetricRNN


is designed based on ordinary differential equations (ODEs); this model can achieve comparable performance with both LSTM and GRU with much fewer number of parameters (10K). iRNN

[21] is a similar model based on ODEs, which is designed to facilitate the RNN training with identity gradient. With the size of parameters is only 7.80 K, it achieves a better performance than GRU and a comparable performance to LSTM. Despite the impressive performance in IoT applications, thus far only FastGRNN [25] has demonstrated a pilot NLP application with a much larger model size (250K) compared to the lightweight IoT applications (10K).

Iii Method

Iii-a Training Scheme

Figure 1 illustrates our new training scheme, which is composed of a new compression technique to minimize information loss, aspect-based feature mapping to ensure explainability, and on-device fine-tuning to enable personalization. Specifically, we design feature mapping and label distillation layers to minimize information loss during model compression. Additionally, the aspect-based feature mapping method enables the explainability of the model via minimizing the interpretable loss during training. Finally, the compressed model is fine tuned with different values of hyper-parameter in the label distillation layer, leveraging on-device training data for personalization.

We denote an input text as , where is the index and is the number of words. The first layer in most models designed for NLP tasks applies an embedding layer with trainable parameters to map each word to a fixed-length -dimension vector , where denotes the vocabulary size. The embedded word vectors are then processed by the remaining layers. To retain sufficient embedding information, most models use a large vocabulary size (ranging between hundreds of thousands to millions) and a high embedding dimension (e.g., 100 or higher), leading to a huge number of parameters in . As there is a minimum required vocabulary size to achieve a specific performance [8], we set the embedding layer to a low dimension (e.g., 5, 10, and 20) in the compressed model to decrease the number of parameters in

. In order to minimize the embedding information loss, we design a feature mapping method that allows the compressed model to learn embedding information from the pre-trained full model by minimizing the difference between the two embedded features. Specifically, we first employ an autoencoder to compress the high dimension embedding feature

in the full model into the low dimension embedding feature. We then minimize the distance ensuring during training. and indicate the full and compressed models, respectively.

As shown in Figure 1

, following the embedding layer is a recurrent neural network layer, e.g., Long Short-Term Memory (LSTM). The number of parameters in LSTM layer can be calculated as:

[23], where is the size of latent feature and is the input size. So the latent feature size can greatly affect the number of parameters. As such we set a small size of the LSTM hidden layer without using attentions, short-cut connections, or other sophisticated additions to reduce the number of parameters in the compressed model. We employ another autoencoder to encode the high dimension latent features into the low dimension . Then we minimize the distance to ensure . Moreover, we design an aspect-based feature mapping in this layer to enable the explainability of the model, which is explained in more details in next section.

The third layer in our training scheme exploits label distillation, which enforces the compressed model to mimic the prediction behavior of the full model by training the former with more informative soft labels generated by the latter [17]. Furthermore, [27] has demonstrated that a good manipulation of temperature can push the softmax scores of in- and out-of-distribution (OOD) images further apart from each other, making the OOD images more distinguishable. Inspired by this, we fine tune the compressed model on each user’s device with personalized training data with different values of hyper-parameter to achieve on-device personalization.

Iii-B Explainable Feature Mapping

Fig. 2: Explainable feature mapping.

Our aspect-based feature mapping (Figure 2) enables the explainability of the model compression leveraging the ubiquitous aspect information, which is an additional input feature of our models. Aspect-based explainable AI models have been developed to solve problems in collaborative filtering [56, 33]

and sentiment analysis

[10]. For example, food, price, service, and ambience are aspects in Restaurant domain to explain the recommendation and polarities of the user reviews, and genres

are used as aspects in Movie domain. Thus, we first use one-hot encoding to represent a particular aspect domain, then embed these aspect one-hot vectors into an aspect-guided latent space with the same embedding dimension

in the embedding layer, denoted as , where is the index and is the number of the aspects. Then, we map the latent features from the LSTM layer into the same space with these aspect features, and get the aspect attention weights through a self-attention mechanism, where is the attention weight for -th aspect. In this way, we derive a new aspect-based latent features by combining the aspect features with the aspect attention weights as: . Then, is passed to fully connected layers. Our goal here is to minimize the distances between the general feature and the aspect-based features (i.e., ) to ensure the similarity between them (i.e., ). Thus, our approach makes the model explanation intrinsically and can rally answer “how” the compressed model generates corresponding predictions supervised by the novel interpretable loss object. Furthermore, we employ a novel metric (hit-ratio) as a quantitative evaluation of our model intrinsic explainability.

Iii-C Training Objective

We select a different loss function for each component in our training scheme according to the different training objectives. We employ a new loss function GCE [6] instead of the traditional CrossEntropy (CE) not only to improve the adversarial robustness of the full model and the compressed model but also to maximize the transfer of robustness to the compressed model. So the text classification task loss is denoted = GCE(). We minimize the Mean Square Errors (MSE) between the high-dimension features from the full model and the low-dimension features from the compressed model aiming to assimilate them in the two feature mapping layers: = and = , respectively. Additionally, we minimize the Kullback–Leibler (KL) divergence loss, denoted as =

, between the output label probability distributions in the label distillation layer. For the two autoencoders used in our training scheme, we minimize MSE between the inputs and the reconstructed outputs to derive a better feature mapping:

= + , and here are the reconstructed outputs. We also use MSE as the model interpretable loss: = to derive explainable feature mapping.

The general training objective function consists of several components, i.e., = + + + + + , where , , , and are tuning parameters to leverage the relative importance of different loss objectives. With that, the compressed model is duly regularized by the pre-trained full model. We note the general loss objective is sufficiently flexible that each loss objective can be retained/dropped according to the real-world need.

Iv Experiment Setup

Our experiments are conducted with several text classification tasks, such as sentiment analysis (Amazon and Yelp), news categorization (AG’s News), and topic classification (Yahoo). In particular, we want to answer the following questions through the experiments: (1) Can the compressed model achieve a competitive performance with the cloud-based model through our new training scheme? (2) Does the compressed model ensure adversarial robustness against strong adversarial attacks? (3) Is the compressed model explainable for on-device deployment?

Iv-a Datasets

Statistics of the text classification datasets used in our experiments are listed in Appendix. We conduct our explainable feature mapping experiments on Rest-2014 dataset which is from SemEval 2014 [37] containing reviews of restaurant domains together with some aspect features such as food, price, service, and ambience. We select Senti140 from LEAF project [2] to evaluate the performance of our on-cloud model compression and on-device personalization. Specifically, we randomly select 5,000 different users from the Senti140 dataset, including 18,400 positive samples and 16,200 negative samples, to train the on-cloud model and compress the on-device model. We will also select the user devices that have the most abundant training samples to fine tune the on-device models using additional training data, further improving the performance.

Dataset Loss Conventional Training Scheme New Training Scheme
Clean PWWS Gradient Replaceone Clean PWWS Gradient Replaceone
Acc/F1 AdvAcc/AdvF1 AdvAcc/AdvF1 AdvAcc/AdvF1 Acc/F1 AdvAcc/AdvF1 AdvAcc/AdvF1 AdvAcc/AdvF1
Amazon CE 58.8/45.2 30.5/25.4 49.2/34.8 37.6/31.0 59.0/45.3 33.1/24.9 50.1/39.8 39.4/35.0
GCE 58.0/44.1 32.1/24.8 50.1/37.7 37.9/29.4 59.8/45.7 32.4/26.1 58.7/44.6 49.0/38.4
Yelp CE 61.0/45.9 36.4/30.1 43.7/30.1 42.4/29.4 62.7/47.7 36.6/29.3 56.4/46.8 43.8/30.4
GCE 59.8/45.4 34.2/30.6 56.8/40.1 43.7/30.1 61.6/46.7 35.4/28.1 62.5/48.0 45.3/32.2
Yahoo CE 71.5/65.2 40.6/35.3 56.6/51.1 43.8/42.7 72.2/65.9 42.1/38.9 58.1/53.6 45.8/43.9
GCE 71.2/65.0 39.6/36.5 61.5/57.3 45.7/43.4 71.5/65.1 43.7/40.4 66.6/64.6 49.6/48.3
AG’s NEWS CE 89.5/67.0 52.4/46.9 88.2/66.2 68.2/51.4 91.0/68.3 55.4/47.1 89.6/69.1 70.3/55.3
GCE 90.5/67.9 56.7/48.9 89.3/68.3 74.4/55.8 91.1/68.4 58.2/49.0 90.2/67.4 79.6/62.2
TABLE I: Comparison of performance and adversarial robustness of the compressed models using the conventional and new training schemes on four benchmark datasets. Conventional scheme means the traditional training process of RNN networks, which we used to train our full model. Clean indicates natural test samples. PWWS, gradient, and replaceone represent three types of adversarial attacks. Best performance are bold-faced.

Iv-B Adversarial Attacks

Among the array of strong adversarial attacks [54], most of them target on the pre-trained embedding matrix, e.g., [11] using GloVe [35]. Others, e.g., [53] use additional meta-data such as SentiWordNet [1]. Both methods iteratively search through the embedding matrix requiring a long time to find the effective perturbations, representing a substantial obstacle for attacking in real-time. Furthermore, their performance on the transferred attacks decrease dramatically compared to that on the original architecture and data, demonstrating a poor generalization ability [54]. Therefore, we select two strong one-off attacks, i.e., Replaceone [12] and Gradient [42], which are more efficient and with the minimal alteration that is more suitable for real-world applications. We also select another strong attack, i.e., probability weighted word saliency, PWWS [41], which makes the perturbations not only unperceivable but also preserves maximal semantic similarity. We used the same settings as in [12, 41] for our experiments.
Replaceone. [12] applies several different black-box scoring functions and word transformation methods to generate adversarial samples with minimum word changes to evade the classifiers. We use Replaceone, a more efficient yet effective scoring function, to find the most important words then swap two adjacent letters to generate new words in the adversarial samples.
Gradient. [42] proposed a white-box attack that uses gradients to identify salient words in the original samples and then modify these words to generate adversarial samples.
PWWS. [41] proposed a greedy algorithm based white-box attack using a new word replacement order determined by both the word saliency and the classification probability to generate adversarial samples with lexical/grammatical correctness and semantic similarity.

Iv-C The Compared Methods

Our goal is to develop adversarially robust model compression for NLP tasks, which not only achieves competitive performance in classifying both clean and adversarial samples but also satisfy the on-device resource constraints. Although our compressed model is intended for on-device deployment, in order to demonstrate our competencies we compare with some widely used on-cloud models that are designed to exploit the full extent of cloud resources, i.e., bag-of-words TF-IDF, N-grams TF-IDF, char-level CNN, word-level CNN, and LSTM

[55] for comparison. For on-device models, we compare ours with both compressed and compact models. In the former, we compare with two existing on-device compact neural network models for NLP applications (i.e., SGNN [39] and PRADO [22]) in terms of the test performance using clean samples only since (1) they are not designed for defending against adversarial attacks and (2) there is no source code available. In the latter, we compare with FastGRNN [25], originally designed for IoT applications and recently demonstrated promising performance in NLP applications. We discuss the comparison results in Table II. We provide more implementation details and our hyper-parameters tuning in Appendix. We will release our code in future proceedings.

Iv-D Evaluation Metrics

In our experiments, we not only consider the performance of the compressed models trained in the natural way with clean texts but also their adversarial robustness towards various adversarial attacks. We evaluate the models’ performance in terms of accuracy (Acc) and macro-F1 scores (F1) for text classification tasks. We also test our model robustness on thousands of adversarial examples generated from different adversarial attacks separately. We report the adversarial accuracy (AdvAcc) and adversarial macro-F1 score (AdvF1) to evaluate the adversarial robustness of the models. In addition, we design a novel hit-ratio to evaluate our model explainability.

V Results and Discussion

V-a Performance Comparison

Table I

shows the performance and adversarial robustness of the compressed models, either trained in a conventional way or in the proposed new training scheme (ours). It is observed that our compressed models achieve an overall better performance on all the evaluation metrics. As our new training scheme exploits feature mapping and label distillation to enable the compressed models to learn information from the pre-trained full models, it is capable of reducing the information loss during model compression, leading to marked improvement on clean sample accuracy and macro-F1 scores. In addition, the compressed models trained with GCE loss improve the adversarial robustness compared to the ones trained with the conventional CE loss in a vast majority of comparisons, highlighting the advantage of adversarially robust model compression.

Type Method Yelp Amazon Yahoo
Clean Gradient Replaceone Clean Gradient Replaceone Clean Gradient Replaceone
Acc AdvAcc AdvAcc Acc AdvAcc AdvAcc Acc AdvAcc AdvAcc
Compressed Ours 61.7 62.5 45.4 59.9 58.8 49.0 71.5 66.6 49.6
PRADO 64.7 N/A N/A 61.2 N/A N/A 72.3 N/A N/A
SGNN 35.4 N/A N/A 39.1 N/A N/A 36.6 N/A N/A
Compact FastGRNN 26.73 20.74 20.68 30.20 20.41 19.66 28.34 19.74 21.67
Full Ours 62.4 56.7 45.0 60.2 53.5 53.3 72.1 64.7 45.8
CNN-char 62.0 45.7 40.8 59.6 47.0 42.1 71.2 43.5 44.9
CNN-word 60.5 44.8 37.6 57.6 47.6 41.1 71.2 51.9 46.6
LSTM 58.2 53.2 42.6 59.4 55.7 41.3 70.8 50.0 45.1
BoW TFIDF 59.9 N/A N/A 55.3 N/A N/A 71.0 N/A N/A
N-gram TFIDF 54.8 N/A N/A 52.4 N/A N/A 68.5 N/A N/A
  • N/A: not applicable. : results reported in [22].

TABLE II: Comparison of our compressed models with other compressed, compact and full models. Our compressed and full models have 200K and 2M parameters. PRADO has 175 K parameters and FastGRNN has more than 250K parameters.

We further compare the performance of our compressed models with the other compressed, compact models and full models in Table II in both clean and adversarial example settings (wherever applicable). Note the results of PRADO [22] and SGNN [39] are cited directly from the original papers since there is no source code is made available. Therefore, we only compare the clean sample accuracy with these two compressed RNN models since they are not designed for mitigating adversarial attacks and hence not adversarially robust. In Table II, our compressed model achieves a better performance than SGNN, moreover, it maintains a comparable clean sample accuracy with PRADO yet demonstrates an impressive adversarial robustness, which is one of the unique features of our model compression technique. We also compared our compressed RNN model with a manually designed compact RNN model named FastGRNN [25]. Specifically, we re-implemented the algorithm based on the package released with the original paper (, and we used the same settings of adversarial attack methods to make a fair comparison in terms of adversarial robustness between our compressed model and the compact FastGRNN. From Table II we observed a better performance of our compressed model on both clean and adversarial examples than FastGRNN. Both clean and adversarial example accuracy are critical for on-device NLP applications. For example, in cybersecurity, detection of business email compromise (BEC) and email account compromise (EAC) scams has become a rising challenge since it does not contain malware but only social engineering messages. As such, the on-device NLP app needs to look into emails’ text contents to perform the adversarial attack detection. Finally, we show that our models outperform a number of competing full models in terms of both clean and adversarial accuracy.

V-B Model Explainability

In Method section, we describe our explainable aspect-based feature mapping as an optional component of model compression objective function. We calculate attention weights for -th aspect by adding them up for each token in texts, formally: = . Then we identify the aspect as the one with the highest attention weight denoted as , and compare with the ground truth aspect coded in the one-hot vector . Note and here are the index values. If these two are consistent, formally: , we consider it as a successful hit, demonstrating that the model is explainable in terms of this aspect. Thus, we calculate a hit-ratio for the test set (i.e., the number of successful hits divided by the number of samples). A large hit-ratios indicate the compressed model generates the predictions based on the pivotal aspect with larger attention wights demonstrating the model achieve better explainability.

As shown in Figure 3 and Table II, our compressed model achieves a comparable performance with the full model in both clean and adversarial sample settings. In addition, compared to the conventionally trained compressed models (Figure 3), our model achieves a better performance in terms of both accuracy and explainability (hit-ratio). This improvement is mainly due to the fact that the features of our compressed model are derived from the full model via an interpretable feature mapping guided by the aspects, which gives rise to a higher hit-ratio (+12%) (Figure 3).

Fig. 3: Evaluation of model explainability on Rest-2014.

V-C Model Personalization

As it is hard to get a non-trivial number of data samples on users’ devices to train a personal model. We propose a two-step strategy to retain the model personality. We first train a global model on a global dataset collected from 5000 users. Then we fine tune it on each user’s private training data while deploying it on-device. Additionally, we find it effective to leverage the hyper-parameters in the label distillation layer while training the global model to achieve better performance on the local data. As shown in Figure 4, each on-device model achieves its best performance with a different value of . In more details, in devices such as User 1, the fine tuned personal model achieves a better model performance than the global model, probably due to the higher quality of the private training data. Otherwise, we use the global model instead of the fine tuned personal model (e.g. User 3) and get improved accuracy (69.675.1) and F1 score (66.568.5).

Fig. 4: On-device personalization on Senti140.

V-D Model Size Selection and Ablation Study

In Table III, we investigate the effect of embedding dimension () and latent feature size () on the model compression performance. We set both and values at 100 for our full model, resulting in the model size of 2 million (2M) parameters. Whereas for the three compressed models, we set both and values at 5, 10, or 20, resulting in the model sizes of 100K, 200K, and 400K parameters, respectively. The compressed models with larger and achieve better performance due to lower information loss. In Table IV, we show an ablation study to examine the effectiveness of the key components (e.g., embedding feature mapping layer, latent feature mapping layer, and label distillation layer) using three different datasets. The compressed models with all the components achieve the best performance in both clean and adversarial sample accuracy and macro-F1 score. The importance of a layer component increases as it gets closer to the output.

Dataset Dimension Clean Replaceone Random Gradient
Acc/F1 AdvAcc/AdvF1 AdvAcc/AdvF1 AdvAcc/AdvF1
Yelp 5 53.7/37.3 37.2/26.3 52.1/35.8 52.0/36.1
10 61.6/46.7 45.3/32.2 58.4/46.6 62.5/48.0
20 61.8/46.8 47.3/31.4 61.3/48.6 65.9/51.2
Yahoo 5 50.1/40.6 35.8/28.8 46.8/38.0 41.6/33.6
10 71.5/65.1 49.6/48.3 71.1/65.4 66.6/64.6
20 72.2/65.8 51.9/49.3 71.2/67.3 69.0/66.0
AG’s News 5 90.3/67.8 68.1/50.8 89.4/66.9 88.1/65.8
10 91.1/68.4 79.6/62.0 92.1/69.0 90.2/67.4
20 90.7/68.0 79.9/62.5 89.6/67.1 89.6/66.5
TABLE III: Comparison of diverse sizes of compressed models.
Dataset Layers Clean Replaceone
Acc/F1 AdvAcc/AdvF1
Yelp w/o Embedding 60.3/47.3 37.4/26.1
w/o Latent 60.7/47.5 38.6/28.0
w/o Label 61.0/47.4 36.8/27.2
All 61.6/47.7 45.3/32.2
Yahoo w/o Embedding 70.0/63.4 45.2/41.9
w/o Latent 70.4/64.1 45.3/42.0
w/o Label 70.9/64.4 44.8/40.6
All 71.5/65.1 49.6/48.3
AG’s News w/o Embedding 89.8/66.8 68.5/51.4
w/o Latent 90.0/67.2 69.6/51.9
w/o Label 90.4/67.9 75.7/56.2
All 91.1/68.4 79.6/62.0
TABLE IV: Contribution of different layers of our model.

Vi Conclusions and Future Work

In this work, we design a new training scheme for model compression ensuring adversarial robustness, explainability, and personalization for NLP applications. Our novel aspect-based feature mapping and label distillation minimize the information loss and maximize the explainability of model compression, and the new training objective ensures models’ adversarial robustness. The performance can be further improved via on-device personalization. Unlike the adversarial training that is computationally prohibitive and limited to the known attacks, our new training scheme is sufficiently flexible, effective, and robust for on-device NLP applications. In future work, we will generalize our training scheme to other on-device NLP tasks, such as Question Answering, Reading Comprehension, and Neural Machine Translation. Furthermore, transformer-based models (e.g BERT

[9]) gain a lot of attention recently dealing with NLP tasks. There are several recent studies on compressing BERT models, e.g., MobileBERT [47], TinyBERT [20], and PKD [46]. These approaches mainly focus on model compression while non of them deal with model adversarial robustness and explainability. We will extend our training scheme to the widely used transformer-based models.


  • [1] S. Baccianella et al. (2010) Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining.. In Lrec, Vol. 10, pp. 2200–2204. Cited by: §IV-B.
  • [2] S. Caldas et al. (2018) Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: §IV-A.
  • [3] V. Campos et al. (2017) Skip rnn: learning to skip state updates in recurrent neural networks. arXiv preprint arXiv:1708.06834. Cited by: §II-C.
  • [4] B. Chang, M. Chen, E. Haber, and E. H. Chi (2019) AntisymmetricRNN: a dynamical system view on recurrent neural networks. arXiv preprint arXiv:1902.09689. Cited by: §II-C.
  • [5] H. Chen et al. (2019) Complement objective training. arXiv preprint arXiv:1903.01182. Cited by: §II-A.
  • [6] H. Chen et al. (2019) Improving adversarial robustness via guided complement entropy. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4881–4889. Cited by: §II-A, §III-C.
  • [7] K. Chen et al. (2020) Deep learning for sensor-based human activity recognition: overview, challenges and opportunities. arXiv preprint arXiv:2001.07416. Cited by: §I.
  • [8] W. Chen et al. (2019) How large a vocabulary does text classification need? a variational approach to vocabulary selection. arXiv preprint arXiv:1902.10339. Cited by: §III-A.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §VI.
  • [10] H. H. Do, P. Prasad, A. Maag, and A. Alsadoon (2019) Deep learning for aspect-based sentiment analysis: a comparative review. Expert Systems with Applications 118, pp. 272–299. Cited by: §III-B.
  • [11] J. Ebrahimi et al. (2017) Hotflip: white-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751. Cited by: §IV-B.
  • [12] J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi (2018) Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. Cited by: §II-A, §IV-B.
  • [13] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I, §II-A, §II-A.
  • [14] S. Gui et al. (2019) Model compression with adversarial robustness: a unified optimization framework. In Advances in Neural Information Processing Systems, pp. 1283–1294. Cited by: §I, §II-B.
  • [15] T. Guo (2018) Cloud-based or on-device: an empirical study of mobile deep inference. In 2018 IEEE International Conference on Cloud Engineering (IC2E), pp. 184–190. Cited by: §I.
  • [16] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §II-B.
  • [17] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §II-B, §III-A.
  • [18] A. Howard et al. (2019) Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: §I.
  • [19] R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328. Cited by: §II-A.
  • [20] X. Jiao et al. (2019) Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351. Cited by: §VI.
  • [21] A. Kag, Z. Zhang, and V. Saligrama (2020) RNNs incrementally evolving on an equilibrium manifold: a panacea for vanishing and exploding gradients?. In International Conference on Learning Representations, Cited by: §II-C.
  • [22] K. Krishnamoorthi et al. (2019) PRADO: projection attention networks for document classification on-device. In Proceedings of the 2019 Conference on EMNLP-IJCNLP, pp. 5013–5024. Cited by: §I, §II-B, §IV-C, item 1, §V-A.
  • [23] O. Kuchaiev and B. Ginsburg (2017) Factorization tricks for lstm networks. arXiv preprint arXiv:1703.10722. Cited by: §III-A.
  • [24] P. Kuhad, A. Yassine, and S. Shimohammadi ()

    Using distance estimation and deep learning to simplify calibration in food calorie measurement

    In 2015 IEEE International Conference on CIVEMSA, pp. . Cited by: §I.
  • [25] A. Kusupati et al. (2018) Fastgrnn: a fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In Advances in Neural Information Processing Systems, pp. 9017–9028. Cited by: §II-C, §IV-C, §V-A.
  • [26] B. Liang et al. (2017) Deep text classification can be fooled. arXiv preprint arXiv:1704.08006. Cited by: §II-A.
  • [27] S. Liang, Y. Li, and R. Srikant (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §III-A.
  • [28] H. Liu, Q. Yin, and W. Y. Wang (2018) Towards explainable nlp: a generative explanation framework for text classification. arXiv preprint arXiv:1811.00196. Cited by: §I.
  • [29] I. McGraw et al. (2016) Personalized speech recognition on mobile devices. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5955–5959. Cited by: §I.
  • [30] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov (2019) Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 691–706. Cited by: §I.
  • [31] P. K. Mudrakarta, A. Taly, M. Sundararajan, and K. Dhamdhere (2018) Did the model understand the question?. arXiv preprint arXiv:1805.05492. Cited by: §I.
  • [32] A. Mustafa, S. Khan, M. Hayat, R. Goecke, J. Shen, and L. Shao (2019) Adversarial defense by restricting the hidden space of deep neural networks. In Proceedings of the IEEE ICCV, pp. 3385–3394. Cited by: §II-A.
  • [33] D. Pan, X. Li, X. Li, and D. Zhu (2020) Explainable recommendation via interpretable feature mapping and evaluation of explainability. arXiv preprint arXiv:2007.06133. Cited by: §III-B.
  • [34] T. Pang et al. (2019) Rethinking softmax cross-entropy loss for adversarial robustness. arXiv preprint arXiv:1905.10626. Cited by: §II-A.
  • [35] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §IV-B.
  • [36] P. Pinheiro and R. Collobert (2014) Recurrent convolutional neural networks for scene labeling. In

    International conference on machine learning

    pp. 82–90. Cited by: §II-C.
  • [37] M. Pontiki et al. SemEval-2014 task 4: aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 27–35. External Links: Link, Document Cited by: §IV-A.
  • [38] M. Rastegari et al. (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §II-B.
  • [39] S. Ravi and Z. Kozareva (2018) Self-governing neural networks for on-device short text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 887–893. Cited by: §II-B, §IV-C, §V-A.
  • [40] S. Ravi (2017) Projectionnet: learning efficient on-device deep networks using neural projections. arXiv preprint arXiv:1708.00630. Cited by: §I, §II-B.
  • [41] S. Ren et al. (2019) Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1085–1097. Cited by: §II-A, §IV-B.
  • [42] S. Samanta and S. Mehta (2017) Towards crafting text adversarial samples. arXiv preprint arXiv:1707.02812. Cited by: §II-A, §IV-B.
  • [43] M. Sandler et al. (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 4510–4520. Cited by: §I, §II-B.
  • [44] A. Shafahi et al. (2019) Adversarial training for free!. In Advances in Neural Information Processing Systems, pp. 3353–3364. Cited by: §II-A.
  • [45] C. Song et al. (2018) Improving the generalization of adversarial training with domain adaptation. arXiv preprint arXiv:1810.00740. Cited by: §II-A.
  • [46] S. Sun et al. (2019) Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355. Cited by: §VI.
  • [47] Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou (2020) Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984. Cited by: §VI.
  • [48] C. Szegedy and othres (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §II-A.
  • [49] M. Tan et al. (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §II-B.
  • [50] J. Uesato, B. O’Donoghue, A. v. d. Oord, and P. Kohli (2018) Adversarial risk and the dangers of evaluating against weak attacks. arXiv preprint arXiv:1802.05666. Cited by: §II-A.
  • [51] J. Wu et al. (2016) Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4820–4828. Cited by: §I.
  • [52] S. Ye et al. (2019) Adversarial robustness vs. model compression, or both. In The IEEE International Conference on Computer Vision (ICCV), Vol. 2. Cited by: §I, §II-B.
  • [53] H. Zhang, H. Zhou, N. Miao, and L. Li (2019) Generating fluent adversarial examples for natural languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5564–5569. Cited by: §IV-B.
  • [54] W. E. Zhang, Q. Z. Sheng, A. Alhazmi, and C. Li (2020) Adversarial attacks on deep-learning models in natural language processing: a survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11 (3), pp. 1–41. Cited by: §II-A, §IV-B.
  • [55] X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §IV-C.
  • [56] Y. Zhang and X. Chen (2018) Explainable recommendation: a survey and new perspectives. arXiv preprint arXiv:1804.11192. Cited by: §III-B.
  • [57] Y. Zhao, I. Shumailov, R. Mullins, and R. Anderson (2018) To compress or not to compress: understanding the interactions between adversarial attacks and neural network compression. arXiv preprint arXiv:1810.00208. Cited by: §I, §II-B.
  • [58] Y. Zhou et al. (2019) Learning to discriminate perturbations for blocking adversarial attacks in text classification. arXiv preprint arXiv:1909.03084. Cited by: §II-A.