1 Introduction
Machine learning for time series data has rich applications in a variety of domains, ranging from medical diagnosis (e.g., physiological signals such as electrocardiogram (ECG) (Kampouraki et al., 2008)
), finance/weather forecasting, to industrial measurements (e.g., sensors and Internet of Things (IoT)). It is worth noting that one common practical challenge that prevents time series learning tasks from using modern largescale deep learning models is data scarcity. While many efforts
(Fawaz et al., 2018; Ye & Dai, 2018; Kashiparekh et al., 2019)have been made to advance transfer learning and model adaptation for time series classification, a principled approach is lacking and its performance may not be comparable to conventional statistical learning benchmarks
(Langkvist et al., 2014).To bridge this gap, we propose a novel approach, named voice to series (V2S), for time series classification by reprogramming a pretrained acoustic model (AM), such as a spokenterms recognition model. Unlike general time series tasks, modern AMs are trained on massive human voice datasets and are considered as a mature technology widely deployed in intelligent electronic devices. The rationale of V2S lies in the fact that voice data can be viewed as univariate temporal signals, and therefore a welltrained AM is likely to be reprogrammed as a powerful feature extractor for solving time series classification tasks. Figure 1 shows a schematic illustration of the proposed V2S framework, including (a) a trainable reprogram layer, (b) a pretrained AM, and (c) a specified label mapping function between source (human voice) and target (time series) labels.
Model reprogramming was firstly introduced in (Elsayed et al., 2019)
. The authors show that one can learn a universal input transformation function to reprogram a pretrained ImageNet model (without changing the model weights) for solving MNIST/CIFAR10 image classification and simple visionbased counting tasks with high accuracy. It can be viewed as an efficient approach for transfer learning with limited data, and it has achieved stateoftheart (SOTA) results on biomedical image classification tasks
(Tsai et al., 2020). However, despite the empirical success, little is known on how and why reprogramming can be successful.Different from existing works, this paper aims to address the following three fundamental questions: (i) Can acoustic models be reprogrammed for time series classification? (ii) Can V2S outperform SOTA timeseries classification results? (iii) Is there any theoretical justification on why reprogramming works?
Our main contributions in this paper provide affirmative answers to the aforementioned fundamental questions, which are summarized as follows.

We propose V2S, a novel and unified approach to reprogram largescale pretrained acoustic models for different time series classification tasks. To the best of our knowledge, V2S is the first framework that enables reprogramming for time series tasks.

Tested on a standard UCR time series classification benchmark (Dau et al., 2019), V2S either outperforms or is tied with the best reported results on 20 out of 30 datasets and improves their average accuracy by 1.84%, suggesting that V2S is a principled and effective approach for time series classification.

In Section 4, we develop a theoretical risk analysis to characterize the performance of reprogramming on the target task via source risk and representation alignment loss. In Section 5, we also show how our theoretical results can be used to assess the performance of reprogramming. Moreover, we provide interpretation on V2S through auditory neural saliency map and embedding visualization.
2 Related Works
2.1 Time Series Classification
Learning to classify time series is a standard research topic in machine learning and signal processing. A major research branch uses designed features followed by conventional classifiers, such as digital filter design in together with support vector machine (SVM)
(Kampouraki et al., 2008)(Geurts, 2001), or kernel based methods (Zhang et al., 2010; Lines et al., 2012). Recently, deep learning models have been utilized in time series (Fawaz et al., 2019) and demonstrated improved performance. The methods range from a completely endtoend classifier (Zhang et al., 2010; Wang et al., 2017) to a mixture model (Hong et al., 2019) that combines feature engineering and deep learning. Notably, feature engineering methods (Wang et al., 2017) can still attain SOTA results on some time series classification tasks, especially when the number of training data is small.2.2 Model Reprogramming
Although in the original paper (Elsayed et al., 2019) model reprogramming was phrased as “adversarial” reprogramming, it is not limited to the adversarial setting, nor does it involve any adversarial training. Elsayed et al. (2019)
showed that the pretrained ImageNet models can be reprogrammed for classifying other image datasets and for solving visionbased counting tasks.
Tsai et al. (2020) demonstrated the advantage of reprogramming on labellimited data such as biomedical image classification, and used zeroth order optimization (Liu et al., 2020) to enable reprogramming of blackbox machine learning systems.Beyond image data, model reprogramming has been used in natural language processing (NLP)
(Neekhara et al., 2019; Hambardzumyan et al., 2020), such as machine translation and sentiment classification. Vinod et al. (2020) further showed that NLP models can be reprogrammed for molecule learning tasks in biochemistry.For reprogramming with time series, it remains unclear what source domain and pretrained models to be used. Current research works on reprogramming do not extend to time series because image and text data have fundamentally different feature characteristics than time series. One of our major contributions is the proposal of reprogramming pretrained acoustic models on abundant human voice data for time series classification. To our best knowledge, our V2S is the first framework on reprogramming for time series.
2.3 Deep Acoustic Modeling
Recent deep learning models have shown impressive results on predicting label(s) from acoustic information. The central idea is to use a large number of spectral features (e.g., Melspectrogram or logpower spectrum) as training inputs to capture the important features. Some of the latent features learned from AM are interpretable based on physiological auditory experiments (e.g., cortex responses (Kaya & Elhilali, 2017)
) or neural saliency methods. Several efforts have been put into the design of a large and deep neural network to extract features from human voice datasets. Among them, residual neural network (ResNet)
(He et al., 2016; Saon et al., 2017) and VGGish (Hershey et al., 2017) models are popular backbones for AM tasks, such as spokenterm recognition (Yang et al., 2021a; de Andrade et al., 2018) and speech enhancement (Xu et al., 2014; Yang et al., 2020). It is worth noting that standard transfer learning via parameter finetuning is not ideal for time series tasks with limited data, as acoustic data and timeseries data are often quite diverse in scale. Our V2S addresses this issue via learning an input transformation function while fixing the pretrained AM.Symbol  Meaning 

/  source/target domain 
/  the space of source/target data samples 
/  the space of source/target data labels 
/  source/target data distribution 
data sample and onehot coded label drawn from  
number of source labels  
pretrained way source classification model  
softmax function in neural network, and  
logit (presoftmax) representation, and  
risk function of based on classifier  
population risk based on classifier  
additive input transformation on target data, parameterized by 
3 Voice2Series (V2S)
3.1 Mathematical Notation
Table 1 summarizes the major mathematical notations used in this paper for V2S reprogramming. Throughout this paper, we will denote a way acoustic classification model pretrained on voice data as a source model, and use the term target data to denote the univariate timeseries data to be reprogrammed. The notation
is reserved for denoting a probability function. The remaining notations will be introduced when applicable.
3.2 V2S Reprogramming on Data Inputs
Here we formulate the problem of V2S reprogramming on data inputs. Let denote a univariate time series input from the target domain with temporal features. Our V2S aims to find a trainable input transformation function that is universal to all target data inputs, which serves the purpose of reprogramming into the source data space , where . Specifically, the reprogrammed sample is formulated as:
(1) 
where is a zero padding function that outputs a zeropadded time series of dimension . The location of the segment to be placed in is a design parameter and we defer the discussion to Section 5.2. The term is a binary mask that indicates the location of in its zeropadded input , where the th entry of is 0 if is present (indicating the entry is nonreprogrammable), and it is otherwise (indicating the entry is not occupied and thus reprogrammable). The operator denotes elementwise product. Finally, is a set of trainable parameters for aligning source and target domain data distributions. One can consider a more complex function in our reprogramming function. But in practice we do not observe notable gains when compared to the simple function . In what follows, we will use the term to denote the trainable additive input transformation for V2S reprogramming. Moreover, for ease of representation we will omit the padding notation and simply use to denote the reprogrammed target data, by treating the “” operation as a zeropadded broadcasting function.
3.3 V2S Reprogramming on Acoustic Models (AMs)
We select a pretrained deep acoustic classification model as the source model () for model reprogramming. We assume the source model has softmax as the final layer and outputs nonnegative confidence score (prediction probability) for each source label. With the transformed data inputs described in (1), one can obtain the class prediction of the source model on an reprogrammed target data sample , denoted by
(2) 
Next, as illustrated in Figure 1, we assign a (manytoone) label mapping function to map source labels to target labels. For a target label , its class prediction will be the averaged class predictions over the set of source labels assigned to it. We use the term to denote the prediction probability of the target task on the associated groundtruth target label . Finally, we learn the optimal parameters for data input reprogramming by optimizing the following objective:
(3)  
The optimization will be implemented by minimizing the empirical loss (V2S loss ) evaluated on all targetdomain training data pairs for solving .
In practice, we find that manytoone label mapping can improve the reprogramming accuracy when compared to onetoone label mapping, similar to the findings in (Tsai et al., 2020). Below we make a concrete example on how manytoone label mapping is used for V2S reprogramming. Consider the case of reprogramming spokenterm AM for ECG classification. One can choose to map multiple (but nonoverlapping) classes from the source task (e.g., ’yes’, ’no’, ’up’, ’down’ in AM classes) to every class from the target task (e.g., ’Normal’ or ’Ischemia’ in ECG classes), leading to a specified mapping function . Let denote the set of source labels mapping to the target label . Then, the class prediction of based on V2S reprogramming is the aggregated prediction over the assigned source labels, which is defined as
(4) 
where denotes the number of labels in . In our implementation we use random (but nonoverlapping) manytoone mapping between source and target labels. Each target label is assigned with the same number of source labels.
3.4 V2S Algorithm
Algorithm 1 summarizes the training procedure of our proposed V2S reprogramming algorithm. The algorithm uses the ADAM optimizer (Kingma & Ba, 2015) to find the optimal reprogramming parameters that minimize the V2S loss as defined in (3), which is evaluated over all targetdomain training data. In our implementation of Algorithm 1 we use stochastic optimization with minibatches.
4 Population Risk via Reprogramming
To provide theoretical justification on the effectiveness of V2S, in what follows we establish a formal population risk analysis and prove that based on V2S, the population risk of the target task is upper bounded by the sum of the source population risk and the Wasserstein1 distance between the logit representations of the source data and the reprogrammed target data. Our analysis matches the intuition that a highaccuracy (low population risk) source model with a better sourcetarget data alignment (small Wasserstein1 distance) should exhibit better reprogramming performance. In Section 5.4, we show that our derived population risk bound can be used to assess the reprogramming performance of V2S for different source models and target tasks. We also note that our theoretical analysis is not limited to V2S reprogramming. It applies to generic classification tasks.
Using the mathematical notation summarized in Table 1, the source model is a pretrained way neural network classifier
with a softmax layer
as the final model output. We omit the notation of the model parameters in our analysis because reprogramming does not change the pretrained model parameters. The notation is used to describe a data sample and its onehot coded label . We will use the subscript to denote source/target data when applicable. For the purpose of analysis, given a neural network classifier , we consider the root mean squared error (RMSE) denoted by .To put forth our analysis, we make the following assumptions based on the framework of reprogramming:

The source risk is , that is, .

The sourcetarget label space has a specified surjective onetoone label mapping function for every target label , such that , , and if .

Based on reprogramming, the target loss function
with an additive input transformation function can be represented as , where is induced by label mapping (Assumption 2) and is induced by reprogramming the source loss with target data. 
The learned input transformation function for reprogramming is denoted by , which is the minimizer of the target population risk with the reprogramming loss objective.

Domainindependent drawing of source and target data: Let and
denote the probability density function of source data and target data distributions over
and , respectively. The joint probability density function is the product of their marginals, i.e., .
For a given neural network classifier, the following lemma associates the expected RMSE of model predictions on two different domains with the Wasserstein1 distance between their corresponding probability measures on the logit representations, which will play a key role in characterizing the population risk for reprogramming. Wasserstein distance is a statistical distance between two probability measures and , and it has been widely used for studying optimal transport problems (Peyré & Cuturi, 2018). Specifically, for any , the Wasserstein distance is defined as
where
denotes all joint distributions
that have marginals and .Lemma 1: Given a way neural network classifier . Let and be the probability measures of the logit representations and from two data domains and , where and . Assume independent draws for and , i.e., . Then
where is the Wasserstein1 distance between and .
Proof: Please see Appendix A.
With Lemma 1, we now state the main theorem regarding an upper bound on population risk for reprogramming.
Theorem 1: Let denote the learned additive input transformation for reprogramming (Assumption 4). The population risk for the target task via reprogramming a way source neural network classifier , denoted by , is upper bounded by
Proof: Please see Appendix B.
Theorem 1 shows that the target population risk via reprogramming is upper bounded by the summation of two terms: (i) the source population risk , and (ii) the representation alignment loss in the logit layer between the source data and the reprogrammed target data based on the same source neural network classifier , measured by their Wasserstein1 distance. The results suggest that reprogramming can attain better performance (lower risk) when the source model has a lower source loss and a smaller representation alignment loss.
In the extreme case, if the source and target representations can be fully aligned, the Wasserstein1 distance will become 0 and thus the target task via reprogramming can perform as well as the source task. On the other hand, if the representation alignment loss is large, then it may dominate the source risk and hinder the performance on the target task. In the next section we will investigate how the representation alignment loss can inform the reprogramming performance for V2S. We would also like to make a final remark that our risk analysis can be extended beyond the additive input transformation setting, by considering a more complex function input transformation function (e.g., an affine transformation). However, in practice, we observe little gain of doing so in V2S and therefore focus on the additive input transformation setting.
5 Performance Evaluation
5.1 Acoustic Models (AMs) and Source Datasets
We start by introducing the source models and the source datasets they trained on. The pretrained source models will be used in our V2S experiments.
Limitedvocabulary Voice Commends Dataset: To create a largescale (100k training samples) pretrained acoustic model for our experiments, we select the Google Speech Commands V2 (Warden, 2018) (denoted as GSCv2) dataset, which contains 105,829 utterances of 35 words from 2,618 recorded speakers with a sampling rate of 16 kHz. We also provide some discussion on models trained on other acoustic datasets (e.g., AudioSet (Gemmeke et al., 2017)) in Appendix D. In general, using the same network architecture we find that AMs trained on the voice commands dataset show better V2S performance than other datasets, which could be attributed to its similarity to shortlength (onesecond or less) time series data.
Transformerbased AMs: For training the source model, we use a popular transformer based singlehead selfattention architecture (de Andrade et al., 2018) for V2S reprogramming, denoted as V2S (Figure 2 (a)). We also train a similar architecture with UNet (Long et al., 2015), denoted as V2S (Figure 2 (b)), which is designed to enhance feature extraction in acoustic tasks (Yang et al., 2021a). Both pretrained V2S and V2S models have a comparable number (0.2M/0.3M) of model parameters and test accuracy (96.90%/96.92%). Their data input dimension is 16k and thus the reprogramming function has 16k trainable parameters. The details of the transformerbased models and comparisons to other popular neural network architectures are given in Appendix C.
5.2 V2S Implementation and Baseline
V2S Implementation:
We use Tensorflow
(Abadi et al., 2016) (v2.2) to implement our V2S framework following Algorithm 1. To enable endtoend V2S training, we use the Kapre toolkit (Choi et al., 2017) to incorporate an onGPU audio preprocessing layer, as shown in Figure 2. For the V2S parameters in Algorithm 1, we use and a minibatch size of 32 withtraining epochs. We use maximal manytoone random label mapping, which assigns
nonoverlapping source labels to every target label, where is the size of the label set and is the floor function that gives the largest integer not exceeding . To stabilize the training process, we add weight decay as a regularization term to the V2S loss and set the regularization coefficient to be. Our V2S implementation is opensource and available at
https://github.com/huckiyang/Voice2SeriesReprogramming.For model tuning, we use dropout during training on the reprogramming parameters . Moreover, during input reprogramming we also replicate the target signal into segments and place them starting from the beginning of the reprogrammed input with an identical interval (see Figure 4 (a) as an example with ). For each task, we report the best result of V2S among a set of hyperparmeters with dropout rate and the number of target signal replication . We use 10fold splitting on training data to select the best performed model based on the validation loss and report the accuracy on test data with an average of 10 runs, which follows a similar experimental setting used in (Cabello et al., 2020).
Transfer Learning Baseline (TF): To demonstrate the effectivness of reprogramming, we also provide a transfer learning baseline using the same V2S pretrained model. Different from V2S, this baseline (named TF) does not use input reprogramming but instead allows finetuning the pretrained model parameters using the zeropadded target data. An additional dense layer for taskdependent classification is also included for training.
5.3 UCR Time Series Classification Benchmark
UCR Archive (Dau et al., 2019) is a prominent benchmark that contains a large collection of time series classification datasets with default training and testing data splitting. The current stateoftheart (SOTA) results on test accuracy are obtained from the following methods: (i) deep neural networks including fully convolutional networks (FCN) and deep residual neural networks as reported in (Wang et al., 2017); (ii) bagoffeatures framework (Schäfer, 2015; Cabello et al., 2020); (iii) ensemblebased framework (Hills et al., 2014; Bagnall et al., 2015; Lines et al., 2018); (iv) time warping framework (Ratanamahatana & Keogh, 2005).
To ensure each target label is at least assigned with 3 unique source labels, we select 30 time series datasets in UCR Archive with the number of target labels 10 for our V2S experiments. We note that for each dataset, the algorithm that achieves the current SOTA result can vary, and therefore the comparison can be unfair to V2S. Nonetheless, the results still provide new sights on how many datasets can V2S outperform and obtain the new SOTA, rather than which method is best for time series classification?
In addition to comparing the standard test accuracy of each dataset as well as the mean and median accuracy over all datasets, we also report the mean perclass error (MPCE) proposed in (Wang et al., 2017), which is a single metric for performance evaluation over multiple datasets. MPCE is the sum of perclass error (PCE) over datasets, definded as , which comprises of the error rate () and the number of classes () for each dataset.
Reprogramming Performance: Table 2 summarizes the performance of each method on 30 datasets. Notably, our reprogrammed V2S model attains either better or equivalent results on 20 out of 30 time series datasets, suggesting that V2S as a single method is a competitive and promising approach for time series classification. The transfer learning baseline TF has poor performance, which can be attributed to limited training data. V2S has higher mean/median accuracy (accuracy increases by 1.84/2.63%) and lower MPCE (relative error decreases by about 2.87%) than that of SOTA results, demonstrating the effectiveness of V2S. For most datasets, V2S has better performance than V2S, which can be explained by Theorem 1 through a lower empirical target risk upper bound (see Section 5.4).
5.4 Representation Alignment Loss
According to Theorem 1, the target risk is upper bounded by the sum of a fixed source risk and a representation alignment loss between the source and reprogrammed target data. The latter is measured by the Wasserstein1 distance of their logit representations. We use the following experiments to empirically verify the representation alignment loss during V2S training, and motivate its use for reprogramming performance assessment. Specifically, for computational efficiency we use the sliced Wasserstein2 distance (SWD) (Kolouri et al., 2018) to approximate the Wasserstein1 distance in Theorem 1. SWD uses onedimensional (1D) random projection (we use 1,000 runs) to compute the sliced Wasserstein2 distance by invoking 1Doptimal transport (OT), which possesses computational efficiency when compared to higherdimensional OT problems (Peyré & Cuturi, 2018). Moreover, the Wasserstein1 distance is upper bounded by the Wasserstein2 distance (Peyré & Cuturi, 2018), and therefore the SWD will serve as a good approximation of the exact representation alignment loss.
Wasserstein Distance during Training: Using the DistalPhalanxTW (Davis, 2013) dataset and V2S in Table 2, Figure 3 shows the validation (test) accuracy, validation (test) loss, and SWD during V2S training. One can observe a similar trend between test loss and SWD, suggesting that V2S indeed learns to reprogram the target data representations by gradually making them closer to the source data distribution, as indicated by Theorem 1.
Model Selection: Based on Theorem 1, one can leverage our derived risk bound for V2S model selection. Comparing V2S and V2S, Table 3 shows the validation loss of the source task (GSCv2 voice dataset (Warden, 2018)) and the mean/median SWD over all 30 training sets of the target tasks in Table 2. We find that V2S indeed has a lower sum of the source loss and SWD than V2S, which explains its improved performance in Table 2.
5.5 Additional Analysis on V2S Interpretation
To gain further insights on V2S, we study its acoustic saliency map and embedding visualization.
Attention and Class Activation Mapping: To interpret the prediction made by V2S, we provide neural saliency analysis over the spectrogram of the reprogrammed features by class activation mapping (CAM) (Zhou et al., 2016) using the Worms dataset (Bagnall et al., 2015) and V2S. Activation and attention mapping methods (Wu & Lee, 2019) have been used in auditory analysis (Fritz et al., 2007) and investigated on its relationship between audio signal and brain cortex activation by neural physiology studies (Kaya & Elhilali, 2017; Veale et al., 2017). Interestingly, as shown in Figure 4, the corresponding attention head (b) of pretrained AM (nontrainable during V2S process) could still recognize the original temporal patterns from the reprogrammed input signal in (a). We also show the Melspectrogram of reprogrammed input in (c), indicating the activated spatialtemporal acoustic features corresponding to weighted output prediction. Furthermore, in reference to the V2S architecture introduced in Figure 2, we select the first and the second convolution layer for CAM visualization in (d) and (e). From the analysis, we observe different functions of these two layers. The first convolution layer tends to focus on the target signal segments themselves as well as their lowfrequency acoustic features of the reprogrammed input’s Melspectrogram in (c), whereas the second convolution layer tends to put more focus on the highfrequency components in the reprogrammed input.
Embedding Visualization: We use tdistributed stochastic neighbor embedding (tSNE) (Van der Maaten & Hinton, 2008) to visualize the logit representations of the Strawberry training set (Holland et al., 1998) for the cases of before and after reprogramming, and the transfer learning baseline (TF). As shown in Figure 5, after reprogramming tSNE results show a clear separation between the embeddings from different target classes, suggesting that V2S indeed learns meaningful and discriminative data representations to reprogram the pretrained acoustic model for time series classification. On the other hand, the embedding visualization of transfer learning shows low classwise separability.
5.6 Additional Discussion
In what follows, we provide additional discussion and insights for V2S reprogramming.
Manytoone Label Mapping: The manytoone mapping can be viewed as ensemble averaging of singleclass outputs, a practical technique to improve classification for V2S reprogramming. We report ablation results (mean/medium accuracy (%)) using Table 2’s setup controlled by to1 V2S mapping as follows: =1: 89.21/94.17, =2: 89.58/94.52, =3: 89.86/94.99. These results show that manytoone label mapping performs better results in our experiments.
Significance Testing: In Section 5, our results suggest that V2S as a single method can achieve or exceed SOTA on 20 out of 30 datasets obtained from different methods. We further run significant testing on two levels using their accuracies based on the 30 datasets from Table 2 — (i) V2S v.s. SOTA numbers: pvalue=0.0017; (ii) V2S v.s. FCN (Wang et al., 2017): pvalue=0.0011, which indicate our results are significant.
Effect of Source Dataset: The effect of source dataset for V2S is captured by the source risk in Theorem 1, along with the representation alignment loss via SWD. We first evaluate different acoustic datasets to calculate their source test errors, where GSCv2 attains the smallest test error. The source test errors on GSCv2, TAU, AudioSet, ESC are 0.1709, 0.1822, 0.1839, 0.1765, and their SWD are reported in Appendix Table 5. Our theorem informs the performance of V2S on different source datasets. If other datasets can have smaller and SWD than GSCv2, we expect it to have better V2S performance.
Future Works: Our future works include a wider range of performance evaluations on different acoustic and speech models (e.g., those associated with lexical information) for V2S reprogramming, model reprogramming for lowresource speech processing, and extension to multivariate time series tasks. The proposed theory would also provide insights on analyzing the success of adversarial reprogramming in vision and language processing domains.
6 Conclusion
In this work, we proposed V2S, a novel approach to reprogram a pretrained acoustic model for time series classification. We also developed a theoretical risk analysis to characterize the reprogramming performance. Experimental results on UCR benchmark showed superior performance of V2S, by achieving new (or equal) stateoftheart accuracy on 20 out of 30 datasets. We also provided indepth studies on the success of V2S through representation alignment, acoustic saliency map, and embedding visualization.
Acknowledgements
The authors would like to thank the comments and discussion from anonymous reviewers during the doubleblind review process.
References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for largescale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283, 2016.
 Bagnall et al. (2015) Bagnall, A., Lines, J., Hills, J., and Bostrom, A. Timeseries classification with cote: the collective of transformationbased ensembles. IEEE Transactions on Knowledge and Data Engineering, 27(9):2522–2535, 2015.
 Cabello et al. (2020) Cabello, N., Naghizade, E., Qi, J., and Kulik, L. Fast and accurate time series classification through supervised interval search. ICDM 2020, 2020.

Chen et al. (2017)
Chen, P.Y., Zhang, H., Sharma, Y., Yi, J., and Hsieh, C.J.
ZOO: Zeroth order optimization based blackbox attacks to deep
neural networks without training substitute models.
In
ACM Workshop on Artificial Intelligence and Security
, pp. 15–26, 2017. 
Choi et al. (2017)
Choi, K., Joo, D., and Kim, J.
Kapre: Ongpu audio preprocessing layers for a quick implementation of deep neural network models with keras.
In Machine Learning for Music Discovery Workshop at 34th International Conference on Machine Learning. ICML, 2017.  Cramer et al. (2019) Cramer, J., Wu, H.H., Salamon, J., and Bello, J. P. Look, listen, and learn more: Design choices for deep audio embeddings. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3852–3856. IEEE, 2019.
 Dau et al. (2019) Dau, H. A., Bagnall, A., Kamgar, K., Yeh, C.C. M., Zhu, Y., Gharghabi, S., Ratanamahatana, C. A., and Keogh, E. The ucr time series archive. IEEE/CAA Journal of Automatica Sinica, 6(6):1293–1305, 2019.
 Davis (2013) Davis, L. M. Predictive modelling of bone ageing. PhD thesis, University of East Anglia, 2013.
 de Andrade et al. (2018) de Andrade, D. C., Leo, S., Viana, M. L. D. S., and Bernkopf, C. A neural attention model for speech command recognition. arXiv preprint arXiv:1808.08929, 2018.
 Elsayed et al. (2019) Elsayed, G. F., Goodfellow, I., and SohlDickstein, J. Adversarial reprogramming of neural networks. In International Conference on Learning Representations, 2019.
 Fawaz et al. (2018) Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.A. Transfer learning for time series classification. In 2018 IEEE international conference on big data (Big Data), pp. 1367–1376. IEEE, 2018.
 Fawaz et al. (2019) Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.A. Deep learning for time series classification: a review. Data Mining and Knowledge Discovery, 33(4):917–963, 2019.
 Fritz et al. (2007) Fritz, J. B., Elhilali, M., David, S. V., and Shamma, S. A. Auditory attention—focusing the searchlight on sound. Current opinion in neurobiology, 17(4):437–455, 2007.
 Gao & Pavel (2017) Gao, B. and Pavel, L. On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805, 2017.
 Gemmeke et al. (2017) Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. Audio set: An ontology and humanlabeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE, 2017.
 Geurts (2001) Geurts, P. Pattern extraction for time series classification. In European conference on principles of data mining and knowledge discovery, pp. 115–127. Springer, 2001.
 Hambardzumyan et al. (2020) Hambardzumyan, K., Khachatrian, H., and May, J. Warp: Wordlevel adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2020.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Heittola et al. (2020) Heittola, T., Mesaros, A., and Virtanen, T. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. arXiv preprint arXiv:2005.14623, 2020.
 Hershey et al. (2017) Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., et al. Cnn architectures for largescale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pp. 131–135. IEEE, 2017.
 Hills et al. (2014) Hills, J., Lines, J., Baranauskas, E., Mapp, J., and Bagnall, A. Classification of time series by shapelet transformation. Data mining and knowledge discovery, 28(4):851–881, 2014.

Holland et al. (1998)
Holland, J., Kemsley, E., and Wilson, R.
Use of fourier transform infrared spectroscopy and partial least squares regression for the detection of adulteration of strawberry purees.
Journal of the Science of Food and Agriculture, 76(2):263–269, 1998.  Hong et al. (2019) Hong, S., Xiao, C., Ma, T., Li, H., and Sun, J. Mina: Multilevel knowledgeguided attention for modeling electrocardiography signals. arXiv preprint arXiv:1905.11333, 2019.
 Hu et al. (2020) Hu, H., Yang, C.H. H., Xia, X., Bai, X., Tang, X., Wang, Y., Niu, S., Chai, L., Li, J., Zhu, H., et al. Devicerobust acoustic scene classification based on twostage categorization and data augmentation. arXiv preprint arXiv:2007.08389, 2020.

Hu et al. (2021)
Hu, H., Yang, C.H. H., Xia, X., Bai, X., Tang, X., Wang, Y., Niu, S., Chai,
L., Li, J., Zhu, H., et al.
A twostage approach to devicerobust acoustic scene classification.
In ICASSP 20212021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 845–849. IEEE, 2021. 
Kampouraki et al. (2008)
Kampouraki, A., Manis, G., and Nikou, C.
Heartbeat time series classification with support vector machines.
IEEE Transactions on Information Technology in Biomedicine, 13(4):512–518, 2008.  Kantorovich & Rubinstein (1958) Kantorovich, L. and Rubinstein, G. On a space of completely additive functions. In Vestnik Leningradskogo Universiteta, volume 13 (7), pp. 52–59, 1958.

Kashiparekh et al. (2019)
Kashiparekh, K., Narwariya, J., Malhotra, P., Vig, L., and Shroff, G.
Convtimenet: A pretrained deep convolutional neural network for time series classification.
In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2019.  Kaya & Elhilali (2017) Kaya, E. M. and Elhilali, M. Modelling auditory attention. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1714):20160101, 2017.
 Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
 Kloberdanz (2020) Kloberdanz, E. Reprogramming of neural networks: A new and improved machine learning technique. ISU Master Thesis, 2020.

Kolouri et al. (2018)
Kolouri, S., Rohde, G. K., and Hoffmann, H.
Sliced wasserstein distance for learning gaussian mixture models.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3427–3436, 2018.  Langkvist et al. (2014) Langkvist, M., Karlsson, L., and Loutfi, A. A review of unsupervised feature learning and deep learning for timeseries modeling. Pattern Recognition Letters, 42:11–24, 2014.
 Lea et al. (2016) Lea, C., Vidal, R., Reiter, A., and Hager, G. D. Temporal convolutional networks: A unified approach to action segmentation. In European Conference on Computer Vision, pp. 47–54. Springer, 2016.
 Lines et al. (2012) Lines, J., Davis, L. M., Hills, J., and Bagnall, A. A shapelet transform for time series classification. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 289–297, 2012.
 Lines et al. (2018) Lines, J., Taylor, S., and Bagnall, A. Time series classification with hivecote: The hierarchical vote collective of transformationbased ensembles. ACM Transactions on Knowledge Discovery from Data, 12(5), 2018.
 Liu et al. (2020) Liu, S., Chen, P.Y., Kailkhura, B., Zhang, G., Hero, A., and Varshney, P. K. A primer on zerothorder optimization in signal processing and machine learning. IEEE Signal Processing Magazine, 2020.
 Long et al. (2015) Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
 Neekhara et al. (2019) Neekhara, P., Hussain, S., Dubnov, S., and Koushanfar, F. Adversarial reprogramming of text classification neural networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 5219–5228, 2019.
 Peyré & Cuturi (2018) Peyré, G. and Cuturi, M. Computational optimal transport. arxiv eprints. arXiv preprint arXiv:1803.00567, 2018.
 Piczak (2025) Piczak, K. J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. ACM Press, 2025. ISBN 9781450334594. doi: 10.1145/2733373.2806390. URL http://dl.acm.org/citation.cfm?doid=2733373.2806390.
 Ratanamahatana & Keogh (2005) Ratanamahatana, C. A. and Keogh, E. Three myths about dynamic time warping data mining. In Proceedings of the 2005 SIAM international conference on data mining, pp. 506–510. SIAM, 2005.
 Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520, 2018.
 Saon et al. (2017) Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., Picheny, M., Lim, L.L., et al. English conversational telephone speech recognition by humans and machines. Proc. Interspeech 2017, pp. 132–136, 2017.
 Schäfer (2015) Schäfer, P. The boss is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery, 29(6):1505–1530, 2015.
 Simonyan & Zisserman (2015) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. International Conference on Learning Representations, 2015.
 Tsai et al. (2020) Tsai, Y.Y., Chen, P.Y., and Ho, T.Y. Transfer learning without knowing: Reprogramming blackbox machine learning models with scarce data and limited resources. In International Conference on Machine Learning, pp. 9614–9624, 2020.
 Van der Maaten & Hinton (2008) Van der Maaten, L. and Hinton, G. Visualizing data using tsne. Journal of machine learning research, 9(11), 2008.
 Veale et al. (2017) Veale, R., Hafed, Z. M., and Yoshida, M. How is visual salience computed in the brain? insights from behaviour, neurobiology and modelling. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1714):20160113, 2017.
 Vinod et al. (2020) Vinod, R., Chen, P.Y., and Das, P. Reprogramming language models for molecular representation learning. arXiv preprint arXiv:2012.03460, 2020.
 Wang et al. (2017) Wang, Z., Yan, W., and Oates, T. Time series classification from scratch with deep neural networks: A strong baseline. In 2017 International joint conference on neural networks (IJCNN), pp. 1578–1585. IEEE, 2017.
 Warden (2018) Warden, P. Speech commands: A dataset for limitedvocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
 Wu & Lee (2019) Wu, Y. and Lee, T. Enhancing sound texture in cnnbased acoustic scene classification. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 815–819. IEEE, 2019.
 Xu et al. (2014) Xu, Y., Du, J., Dai, L.R., and Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1):7–19, 2014.
 Yang et al. (2020) Yang, C.H., Qi, J., Chen, P.Y., Ma, X., and Lee, C.H. Characterizing speech adversarial examples using selfattention unet enhancement. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3107–3111. IEEE, 2020.

Yang et al. (2021a)
Yang, C.H. H., Qi, J., Chen, S. Y.C., Chen, P.Y., Siniscalchi, S. M., Ma,
X., and Lee, C.H.
Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition.
In ICASSP 20212021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6523–6527. IEEE, 2021a.  Yang et al. (2021b) Yang, C.H. H., Siniscalchi, S. M., and Lee, C.H. Pateaae: Incorporating adversarial autoencoder into private aggregation of teacher ensembles for spoken command classification. arXiv preprint arXiv:2104.01271, 2021b.
 Ye & Dai (2018) Ye, R. and Dai, Q. A novel transfer learning framework for time series forecasting. KnowledgeBased Systems, 156:74–99, 2018.
 Zhang et al. (2010) Zhang, D., Zuo, W., Zhang, D., and Zhang, H. Time series classification using support vector machine with gaussian elastic metric kernel. In 2010 20th International Conference on Pattern Recognition, pp. 29–32. IEEE, 2010.

Zhou et al. (2016)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A.
Learning deep features for discriminative localization.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929, 2016.
Appendix
Appendix A Proof of Lemma 1
For brevity we use to denote the integer set . We have
(5)  
(6)  
(7)  
(8)  
(9)  
(10) 
(a) follows the neural network model, (b) follows the definition of expectation, (c) follows the assumption of independent data drawing, and (d) follows that , and thus . (e) holds by setting and . Then by definition . We further make the following three notes: (i) and ; (ii) , and if ; (iii) There exist at least one such that One can use proof by contradiction to show (iii) is true. If for every , then summing over we get a contradiction that . Therefore,
(11)  
(12)  
(13)  
(14) 
where is defined as , and we use the fact that is 1Lipschitz for any (Gao & Pavel, 2017) (so is 2Lipschitz). Finally, (f) follows the KantorovichRubinstein theorem (Kantorovich & Rubinstein, 1958) of the dual representation of the Wasserstein1 distance.
Appendix B Proof of Theorem 1
First, we decompose the target risk function as
(15)  
(16)  
(17)  
(18) 
(a) is based on Assumption 3, (b) is based on the definition of risk function, (c) is by subtracting and adding the same term , and (d) is based on the triangle inequality.
Note that by Assumption 1, . Next, we proceed to bound . Using Lemma 1, we have
(19) 
Finally, take on both sides of equation (18) completes the proof.
Appendix C PreTrained Model Studies
We provide advanced studies over different pretrained acoustic architectures and the associated time series classification performance. In particular, we select the models below, which have attained competitive performance tested on Google Speech Commands version 2 dataset (Warden, 2018) or shown cuttingedge performance (ResNet (He et al., 2016)) in the acoustic scene (VGGish (Hershey et al., 2017)) and time series classification (TCN (Lea et al., 2016)). These models will be compared with V2S (recurrent Attention (de Andrade et al., 2018)) and V2S (V2S enhanced by UNet (Yang et al., 2021a)) used in the main paper.
ResNet: Deep residual network (He et al., 2016)
(ResNet) is a popular deep architecture to resolve the gradient vanish issues by passing latent features with a residual connection, and it has been widely used in acoustic modeling tasks. We select a 34layer ResNet model training from the scratch for V2S (denoted as V2S
), which follows the identical parameter settings in (Hu et al., 2020, 2021) for reproducible studies.VGGish: VGGish (Hershey et al., 2017) is a deep and wide neural network architecture with multichannel convolution layers, which has been proposed for speech and acoustic modeling. VGGish is also wellknown for the largescale acoustic embedding studies with AudioSet (Gemmeke et al., 2017) from 2 million Youtube audios. We use the same architecture and train two models: (i) training from scratch (denoted as V2S) and (ii) selecting an AudioSet pretrained VGGish and then finetuning (denoted as V2S) on the Google Speech commands dataset for V2S.
Temporal Convolution Network (TCN): TCN (Lea et al., 2016) is an efficient architecture using temporal convolution with causal kernel for sequence classification tasks. We select the TCN architecture and train it from scratch as a baseline for V2S (denoted as V2S).
OpenL3: OpenL3 (Cramer et al., 2019) is a much recent embedding method with a deep fusion layer for acoustic modeling. We use pretrained OpenL3 embeddings and a dense layer for classification as another baseline for V2S (denoted as V2S).
c.1 V2S Performance and Sliced Wasserstein Distance
Table 4 shows different neural architectures for acoustic modeling to be used with the proposed V2S method, where acoustic models are pretrained with the Google Speech Commands dataset (Warden, 2018) version two with 32 commands (denoted as GSCv2.) From the first three rows of Table 4
, we observe the recurrent attention models and TCN perform better in mean prediction accuracy, validation loss of the source task. For the target task (same as Table
2), recurrent attention models (V2S and V2S) attain the best performance.Model  V2S  V2S  V2S  V2S  V2S  V2S  V2S 
Parameters ()  0.2M  0.3M  1M  62M  62M  1M  4.7M 
Source Acc. ()  96.90  96.92  96.40  95.40  95.19  96.93  92.34 
Source Loss ()  0.1709  0.1734  0.1786  0.1947  0.1983  0.1756  0.2145 
Mean SWD ()  1.829  1.873  1.892  4.713  4.956  1.901  5.305 
Mean Target Acc. ()  89.91  87.92  87.22  67.12  63.23  86.45  60.34 
Target MPCE ()  2.03  2.10  2.23  33.4  38.3  2.34  41.34 
Both VGGish based architectures (V2S and V2S) show degraded performance on the target tasks prediction, which can be explained by the recent findings (Kloberdanz, 2020) on the degraded performance of using VGG (Simonyan & Zisserman, 2015) and MobileNetV2 (Sandler et al., 2018) based “wide” and deep convolutional neural architectures for reprogramming visual models. These findings could be also explained in the sense that the wide neural architectures fail to adapt the source domain distribution according to the sliced Wasserstein distance (SWD) results (fourth row in Table 4), as indicated by Theorem 1. We observe that using the pretrained models associated with higher source accuracy does not always guarantee higher target accuracy (e.g, V2S and V2S)
Appendix D Additional Ablation Studies
Based on the discussion in Section C, we further select three efficient V2S models with different model capacity (0.2M/1M/4.7M), including V2S, V2S, and V2S, to study the mean target accuracy in different training settings and to provide some insights into effective design of V2S models.
d.1 Pretrained Models from Different Dataset
In the previous model reprogramming studies (Tsai et al., 2020; Elsayed et al., 2019), little has been discussed regarding the effectiveness of using different datasets to train the pretrained models. We study three other public acoustic classification benchmark datasets (source tasks), (1) TAU Urban Acoustic Scenes 2020 Mobile (Heittola et al., 2020) (denoted as TAUUAC), from the annual IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), (2) AudioSet using in (Hershey et al., 2017), and (3) ESC50 (Piczak, 2025), a dataset for environmental sound classification, for providing a preliminary V2S study. We extract acoustic features by Melspectrogram using Kapre audio layer following the same resolution and frequency setup in (Yang et al., 2020, 2021b) for a fair and reproducible comparison.
As shown in Table 5, the V2S models pretrained from GSCv2 show a higher mean prediction accuracy and lower SWD than the other models, which could be due to the shorter sample length (less than one second) of its source acoustic inputs.
Pretrained Dataset  GSCv2  TAU  AudioSet  ESC 
Training Samples  105k  13.9k  2.08M  2k 
Output Classes  35  10  527  50 
Audio length per sample clips  1 sec.  10 sec.  10 sec.  5 sec. 
Mean Target Acc. w/ V2S ()  89.91  82.61  80.68  84.48 
Mean Target Acc. w/ V2S ()  87.22  83.57  79.96  83.05 
Mean Target Acc. w/ V2S ()  86.45  80.1  81.81  84.58 
Mean SWD per dataset ()  1.874  2.267  2.481  2.162 
d.2 Different V2S Mapping Settings
In (Tsai et al., 2020)
, frequency mapping techniques show improved performance for blackbox (e.g., zeroth order gradient estimation
(Chen et al., 2017; Liu et al., 2020) based) adversarial reprogramming models for image classification. We also follow the setup in (Tsai et al., 2020) to compare the manytoone frequency mapping and the manytoone random mapping for time series classification. However, the frequency mapping based V2S results show equal or slightly worse (0.013) mean prediction accuracy and WSD (+0.0028) performance with 100 runs, which may be owing to the differences of the dimensions and scales between the tasks of image and time series reprogramming.d.3 More tSNE Visualization
d.4 Hardware Setup and Energy Cost Discussion
We use Nvidia GPUs (2080Ti and V100) for our experiments with Compute Unified Device Architecture (CUDA) version 10.1. To conduct the results shown in Table 2, it takes around 40 min to run 100 epochs (maximum) with a batch size 32 for each time series prediction dataset considering the hyperparameters tuning (e.g., dropout rate) described in Section 5.2 of the main paper. In total, the experiments presented (30 datasets and its ablation studies) in this paper took around 120 computing hours with a 300W power supplier. As another advantage, the V2S reprogramming techniques freeze pretrained neural models and only used a reprogramming layer for training new tasks. The proposed method could potentially recycle welltrained models for an additional task to alleviate extra energy costs toward deploying responsible ML systems.
Comments
There are no comments yet.