Recently IoT in-home devices have shown their potential to significantly improve in-home healthcare system by further facilitating the patient-caregiver relationships . For instance, hands free and device free voice command features have significantly helped elders, disabled people and patients. Voice reminders and notifications for taking medications or doctor appointments, and ease of communication with caregivers through voice enabled hands-free messaging services, are some examples of applications helping everyone in the healthcare cycles . In current applications of voice-enabled IoT devices, the voice data are usually converted to text mainly because i) the required capacities for storing text is significantly less than voice and ii) data mining and extraction from text is much easier and efficient than voice . However, there are three main problems with speech recognition in health related applications. First, voice is a richer date type than text, i.e. there exist useful information that are specific to voice, like tone, volume, pitch  and this richness cannot be preserved by speech recognition. As shown in , richness of the voice is important for healthcare applications. Second, considering the healthcare data, as a clearly sensitive and private category of data, and the home environment, which is always known as the most private environment for users, there has always been a challenge for keeping the data private and secure from other parties outside the trust cycle, e.g. untrusted servers, and outsourcing services. The speech recognition requires high processing and storage resources because of its complexity. It is thus usually done on cloud servers, while this requirement limits the application in case of sensitive information and untrusted servers. Finally, the offline speech recognition techniques are a solution to the privacy preservation problem on untrusted servers, but they are less accurate compared to the online solutions. Also, the training phase in speech recognition may limit the application in case of dynamic and continuously growing systems .
In this paper we consider a scenario where voice data are generated by the user, and the information contained in these voice data such as tone, background voices and ambient sounds are being detected and utilized. Richness of the voice is important for health related applications. The background voices can reveal information about the environment, e.g., a music being played, a show on TV, or presence of other people. Moreover, the tone of a patient’s voice can easily and clearly reveal some information about her emotional and physical conditions and reflect her feelings and mood. In addition, happiness, sadness, anger or frustration can also be detected in patient’s voice even if the patient says the same word. To preserve the richness and privacy of the voice data, we propose an efficient and privacy-preserving voice-based search scheme, which stores the patient’s voices collected through voice-enabled IoT home devices at a server and enables the caregiver to later search their interested voices from the server. To preserve data privacy at the server, encrypted data storage is one popular technique . This technique is highly implemented and used for text, images, and video, but there is a small amount of prior work in case of voice data because of popularly used speech-to-text conversion services nowadays. The original voices are not usually stored in the database, and are mostly kept in special cases where the voice itself is important, e.g. music sound records. In our scenario, these voices will be encrypted and uploaded to a server and the patient’s caregiver can later query the interested data from the server, decrypt it and access the original voice of the patient. We also aim to achieve higher accuracy than existing works because it deals with the voice data directly and does not need to convert them to text. Specifically, contributions of this work can be summarized as follows:
First, we study the advantages and disadvantages of voice over text as the data type to be collected from the patients, from both patient and caregiver’s point of view. We found the richness of voice can be useful to in-home healthcare applications where existing data collection and search schemes cannot be applied.
Second, we present novel schemes for collecting, encrypting, and storing the patients’ voice over the semi-trusted server using voice-enabled IoT home devices, and voice-based search over mHealth data. Our scheme preserves both the richness and privacy of the voice data and achieves high efficiency in the voice search function.
Third, we evaluate our schemes by performing privacy and accuracy analysis using real data and show our methods are successfully preserving the privacy of the data from the server, and accurately detect different tones, moods, and background sound from the collected voice data.
The remainder of this paper is organized as follows; in section II we present the system model by introducing system components, design goals and, trust model. Section III covers the preliminaries of this work and followed by our proposed scheme at section IV, we bring the privacy and usability evaluation in section V. Then related works are presented in section VI and finally we conclude the paper in section VII.
Ii System Model
We consider a scenario in which the patient has equipped her home with smart home devices and uses these devices to securely record and send her voice to her caregiver.
Ii-a System components
As depicted in Fig. 1, our proposed system consists of five components:
User (): the primary user, e.g. the patient who is using the system to communicate with her caregiver.
Caregiver (): the caregiver of the patient who uses the system to receive the recorded voice data and uses them to make queries and get similar voice data to study the mental and physical conditions of the patient.
Device (): the IoT home device which records, encrypts and uploads the user’s voice data to a server.
Interface (): the interface caregiver uses to get and decrypt the list of recorder voice data of the user, and also make voice search and queries on each of the samples.
Database server (): the semi-trusted database server which is in charge of keeping the encrypted voice data and returning the query results after processing caregiver’s encrypted query.
Ii-B Design goals
Voice richness preservation - Voice data contains more information than text, and these extra information are usually lost in current applications. In our system, voice data are not converted to text and the actual voices are being transmitted to the care giver, thus all the information contained in the data is being preserved and used.
Voice privacy preservation - Health related information are private and sensitive, and usually are required to be preserved from the untrusted servers. In our system, the server is considered semi-trusted, thus no information about the contents of the data being transmitted is shared with the server and only the user and her caregiver are aware of the contents of the information.
Search efficiency - The search and matching between the database contents and the given search voice are done by the server. This operation is done by comparing the features of the voice data. These features are only extracted once and stored along with the voice itself into the database, thus the server is able to efficiently match the given voices.
Ii-C Trust model
Database server: Database server is semi-trusted in our system, i.e. the user information would not be shared with the server, but it is trusted to perform the voice matching tasks correctly and seamlessly.
Home IoT device: The home device is considered trusted as it is collecting and processing user’s sensitive and private voice information. It also is trusted to perform the encryption tasks on the recorded voices.
Caregiver and her Interface: The caregiver of the patient is considered trusted and has the privileged to receive and use the sensitive voice information of the user. The interface device between the caregiver and the voice data returned from the server, which is used to receive and search the voice database is also considered trusted.
Iii-a Homomorphic encryption
Homomorphic encryption provides the addition and multiplication operations over ciphertexts, i.e. heavy operations can be performed by untrusted parties without knowing the shared secret. This method is widely used in data aggregation and computation on privacy-sensitive content . A homomorphic encryption scheme can be described as follows:
A central authority runs a generator which outputs as system public parameters:
are two primes s.t. and ;
Rings , ;
Message space ;
A discrete Gaussian error distribution
with standard deviation.
Suppose user has a public/private key pair such that , with and , and . Let and be two messages encrypted by .
Encryption : , where are samples from .
Decryption : If denoting , then .
Consider the two ciphertexts and .
Addition: Let . If , let ; If, let . Thus, we have .
Multiplication: Let be a symbolic variable and compute . Thus, we have .
Iii-B Voice feature extraction
Psychophysical studies  have shown that human perception of the sound frequency contents for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, , measured in Hz, a subjective pitch is measured on a scale called the “Mel” scale.
Where is the subjective pitch in Mels corresponding to a frequency in Hz and is a constant. This leads to the definition of “Mel Frequency Cepstral Coefficient” (MFCC). Fig. 2 depicts the block diagram for MFCC algorithm.
The voice signal is first pre-emphasized with a filter to spectrally flatten the signal. Then the pre-emphasized voice signal is separated into short segments called frames. There usually is a overlap between two adjacent frames ensure stationary between frames. Then a Fast Fourier Transform is applied to the frames and after that, the spectrum of each frame is filtered by a set of filters, and the power of each band is calculated. Finally we can calculate the Mel-Frequency cepstrum from the output power of the filter bank using this equation:
Iv Proposed scheme
In this section, we introduce our proposed data collection, encryption, and storage scheme as well as our proposed keyword search mechanism for query over encrypted voice data.
Iv-a Scheme overview
The overview of our scheme is shown in Fig. 1. Consider a scenario where the user has a voice-enabled IoT home device to communicate and transfer her voice data to her personal database accessible by her caregiver. The IoT home device collects and encrypts the user voice locally and stores the encrypted voice in the database. The database is always updated with the latest user health data records. Her caregiver, e.g. caregiver, using an interface containing the user’s labeled voice samples which can send voice queries to the encrypted voice database, and obtain all the similar voices of that user using the same interface. This case cannot be addressed using the speech-recognition techniques because i) the tone of the patient’s voice is also important for us, and ii) there exists extra information on the background, which can reveal important information for the caregiver. There are six steps for the query process.
Step 1: Voice collection and encryption: User can easily and with the least effort record her voice using the voice enabled smart home device. This device collects and locally encrypts the user voice using a symmetric key encryption method .
Step 2: Encrypted voice upload: The in-home device then uploads the encrypted voice to the database server which is untrusted and cannot know the contents of the uploaded information, thus the encryption key is not shared with it.
Step 3: Query: User’s caregiver, e.g. caregiver, queries the database using the labeling interface, which is an application which associates all the voice data in user’s voice database to their suitable text labels defined by the caregiver. The encryption key is shared with this interface and caregiver can listen to the stored voices of the user, categorize and label them and then select the intended voices for the query.
Step 4: Query encryption: The selected keyword voice is then encrypted and sent to the server for the keyword matching. As mentioned before, since the server is untrusted, it does not have any information about the contents of the received query.
Step 5: Encrypted voice matching: Using our proposed encrypted voice matching mechanism, the server is able to calculate a similarity factor between the voice query and the suer voice data using the method introduced in subsection C. The server then returns these encrypted similarity metrics to the caregiver’s interface.
Step 6: Similarity metric decryption and voice data request: Caregiver’s interface then decrypts the received information from previous step and based on a pre-defined threshold values, detects a set of matching voices. References to these voices are then sent to the server for requesting the actual voice files.
Step 7: Results delivering: The voice data corresponding to the given references are returned from the server to caregiver’s interfaces and then to the caregiver after decryption.
Iv-B Local voice encryption by home device
Since the voice data to be transferred are personal and sensitive, and the server is considered untrusted, the voice files need to be encrypted locally before being transferred to the server. For this purpose, we first apply a Mel Frequency Cepstral Coefficient (MFCC) algorithm, as explained in the preliminaries section, on the voice files to extract the features needed for classifying the voices.
The feature matrix , is a 36 by , 2-d array of double values, where depends on the length of the input voice file and 36 is the number of filter banks we use. A sample output of this matrix is shown here:
The output of this algorithm is then encrypted using a homomorphic symmetric encryption method when the encryption key is shared only between the user and the caregiver. Also, the actual voice files themselves are encrypted, but not with the same encryption method because there is no operations performed on the voice data by the server. In this case we use AES encryption method for these voices. These encrypted voice samples are then associated with their encrypted features set and stored on the database on the cloud server together.
Iv-C Encrypted voice feature matching
After the encrypted voice information from the user are stored on the database, now the caregiver would be able to query these information. To do so, all the voice samples are available to the caregiver, and since the patient shares the encryption key with the caregiver, she is able to decrypt and listen to the actual voice files recorded by the user one by one. Server will perform the following operations to match a received voice features to all of the voice features in the database. Note that each voice sample consists of a by matrix from MFCC features as shown in equation (2) where
is depended on the length of each voice. For simplicity, the in-home device performs a column-based averaging operation and runs encryption on vectorto obtian as shown in equation (3) where stands for encrypted .
Now, to calculate the similarity metric between the two voices, the server calculates element-by-element distances by calculating the squares of their subtraction and then gets an average on all the distance values to get a single value as the distance between two voices as shown in equation (4).
Note that all these operations on encrypted data are enabled because of the used homomorphic encryption. The caregiver’s interface then decrypts this value and uses it to compare with pre-defined threshold values to decide if the two voice samples are similar enough. If they are, the caregiver further request the original voice sample using the references.
To evaluate the efficiency and privacy requirements of our proposed scheme, we have conducted experiments on real human and machine generated voice samples. We have used the online text-to-speech converters from “https://acapela-box.com/” and also have recorded human voices using regular, i.e. not noise-canceling cellphone and computer microphones in general environments like homes, offices and university campuses with different background voices (the voice samples are publicly available for research purposes). The first part of the evaluation, privacy analysis, is to confirm the system satisfies the privacy requirements in order for the scheme to prevent the server to infer the contents of the voice samples. The second part confirms that the scheme is usable enough in terms of the server being able to match the stored voices and received voice queries, thus the scheme is actually working.
V-a Privacy analysis
As mentioned in the proposed scheme, our scheme uses homomorphic encryption () to preserve the privacy of the user’s sensitive voice features, i.e. , and AES encryption () for voice data themselves, i.e. . That means only and are stored on the server. The encryption keys are shared only between the user and the caregiver and server is not able to infer the information while processing the requests and queries. Homomorphic encryptions gives us the power of transferring the highly computational operations on the server without revealing the actual data to it, which is a big advantage in this case with sensitive information and highly computational operations. Moreover, while performing the matching operations on the encrypted voice features, all the intermediate and final results are still encrypted and there is no relationship between them inferable by the server.
V-B Efficiency evaluation
The server must be able to accurately match the voice query with the stored voice data on the server. Specifically, we use two different thresholds learned from training data for two levels of similarity. Fist we will check if the voices are similar enough to be detected as same word with the same tone, or same background noise. This threshold value is called . If this check fails, i.e. the scheme does not detect enough similarity between the tones or background noises, then the second threshold, which naturally is larger than the first one, is used to detect if the words are the same, but said in different moods. This threshold is called . If the voices don’t match even with this threshold, then they are considered different words. For instance, server should be able to match the voices containing the word “happy”, said in an excited tone with similar voices and tones, using the , and it should be able to distinguish the word “happy” said in a bored mood as the same word but a different mood, also using . Formally, with the notation of for similarity between voice features and , and for “word.tone”, following equations are always correct when are disparate:
Our results show that the user voice data in the database are distinguishable by the server with the accuracy of for same word with same mood, and for same word with two different moods. Also the similar background noises are distinguishable by the server with an accuracy of . All the cases are shown in table I. We use the well-known “Accuracy”, “Sensitivity” and “Specificity” statistical metrics 111Accuracy is defined as (True Positive + True Negative) / (True Positive + False Positive + True Negative + False Negative) , sensitivity is and . to show our scheme’s performance.
|Different Background Noise||1.00||0.75|
An important point to mention here is the relationship between specificity and sensitivity. Intuitively, for both thresholds, if we take the threshold too small, none of the voices would be detected as similar, i.e. number of false negatives will grow, but on the other hand number of false positives would decrease, which results in higher sensitivity and lower specificity. And vice versa for the case where we take the thresholds too large. This trade-off is depicted on the Fig. 3 for detection of the same word as an example.
Vi Related works
IoT provides a perfect platform for smart ubiquitous healthcare  using body area sensors and IoT as the back-end for uploading the data. As an extension to body area sensors, the home monitoring systems for patient and specifically the elderly allows the caregivers to monitor the patients closely and continuously to avoid hospitalization costs [13, 14]. However, the concept of voice-enabled home IoT devices, is a fairly new and topic and has not been addressed comprehensively.
With advances in speech recognition technologies, there are many voice enabled smart home devices introduced to help users to interact with devices via speech. However, there are several privacy and security concerns when dealing with microphone-enabled smart devices.  distinguishes active and passive listening and introduces three different categories; manually activated, speech activated, and always on and studies some privacy implications of these challenges and mentions the ability of these devices to passively listen to all the voices in the environment to detect their wake-up words.
Several prior works are done around data storage on untrusted cloud servers and query processing on encrypted data. Keyword search on encrypted data has been studied for many different utilities [16, 17, 18, 19, 20, 21]. While standard methods of secrecy hide the content of the message, covert communication in wireless environments [22, 23] and computer networks [24, 25] hides the existence of the communication. More Specifically,  proposed an encryption method for IoT data storage and query processing on untrusted cloud database servers which applies the encryption at the origin of the data, but the server is still able to process the queries, however these studies are more concentrated on text as the type of stored data and voice data types are not widely considered for this purpose.
Mel Frequency Cepstral Coefficient (MFCC) has been widely used for extracting the features of spoken human voice mainly for speech recognition purposes [15, 27, 28, 29, 30] but this technique can also be applied for matching the features of the voice to associated voices directly without conversion to text.
Speech recognition and voice-to-text services are studied and implemented in a wide variety of applications as a promising technique to minimize the storage requirements and facilitate the data processing and information extraction. However, the requirement of a powerful processing capability has always been a drawback. Recently, local speech recognition engines are introduced [15, 6], but the capability of such systems are low and learning and customization of the system based on the individuals’ speech is very limited . On the other hand, using cloud computing for this purpose will require disclosure of the information to the server, which may be untrusted. Our proposed scheme addresses these issues by replacing speech recognition by a novel encrypted voice matching technique.
In this paper, we proposed an efficient and privacy-preserving voice-based search scheme. We studied the importance and advantages of the extra information in the voice over text for healthcare applications. We employed the voice feature extraction and matching algorithms to achieve the matching efficiency, and we employed the homomorphic encryption technique to achieve the voice privacy. Through evaluation in real experiment we showed that our scheme is able to detect the tone and background voices in the patient’s recorded voice data and categorize the voices based on them with an average accuracy of 80.8%. For our future work, we will include more characteristics of voice data in our scheme design such that the caregiver can make voice-based search with these characteristics. We will also explore other techniques for voice feature extraction to achieve higher accuracy.
This research program is supported by Joseph P. Healey Research Grant from UMass Boston, National Science Foundation award number 1618893, and National Science Foundation of Fujian province China (Grant No. 2016J01325).
-  A. Holopainen, F. Galbiati, and K. Voutilainen, “Use of smart phone technologies to offer easy-to-use and cost-effective telemedicine services,” in Digital Society, 2007. ICDS’07. First International Conference on the. IEEE, 2007, pp. 4–4.
-  D. Ceer, “Pervasive medical devices: less invasive, more productive,” IEEE Pervasive Computing, vol. 5, no. 2, pp. 85–87, 2006.
-  M. F. Schober, F. G. Conrad, C. Antoun, P. Ehlen, S. Fail, A. L. Hupp, M. Johnston, L. Vickers, H. Y. Yan, and C. Zhang, “Precision and disclosure in text and voice interviews on smartphones,” PloS one, vol. 10, no. 6, p. e0128337, 2015.
-  M. Forsberg, “Why is speech recognition difficult,” Chalmers University of Technology, 2003.
-  J. Halamka, “Early Experiences with Ambient Listening Devices (Alexa and Google Home),” http://geekdoctor.blogspot.com/2017/03/early-experiences-with-ambient.html, 2017, [Online; accessed 20-April-2017].
-  I. McGraw, R. Prabhavalkar, R. Alvarez, M. G. Arenas, K. Rao, D. Rybach, O. Alsharif, H. Sak, A. Gruenstein, F. Beaufays et al., “Personalized speech recognition on mobile devices,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 5955–5959.
-  D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public key encryption with keyword search,” in International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 2004, pp. 506–522.
-  R. Lu, X. Liang, X. Li, X. Lin, and X. Shen, “Eppa: An efficient and privacy-preserving aggregation scheme for secure smart grid communications,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 9, pp. 1621–1631, 2012.
-  S. Memon, M. Lech, and L. He, “Using information theoretic vector quantization for inverted mfcc based speaker verification,” in Computer, Control and Communication, 2009. IC4 2009. 2nd International Conference on. IEEE, 2009, pp. 1–5.
-  W. Han, C.-F. Chan, C.-S. Choy, and K.-P. Pun, “An efficient mfcc extraction method in speech recognition,” in Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on. IEEE, 2006, pp. 4–pp.
-  X. Liang, X. Li, K. Zhang, R. Lu, X. Lin, and X. S. Shen, “Fully anonymous profile matching in mobile social networks,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 9, pp. 641–655, 2013.
-  L. Atzori, A. Iera, and G. Morabito, “The internet of things: A survey,” Computer networks, vol. 54, no. 15, pp. 2787–2805, 2010.
-  H. Luo, S. Ci, D. Wu, N. Stergiou, and K.-C. Siu, “A remote markerless human gait tracking for e-healthcare based on content-aware wireless multimedia communications,” IEEE Wireless Communications, vol. 17, no. 1, 2010.
-  G. Nussbaum, “People with disabilities: assistive homes and environments,” Computers Helping People with Special Needs, pp. 457–460, 2006.
-  S. D. Dhingra, G. Nijhawan, and P. Pandit, “Isolated speech recognition using mfcc and dtw,” International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, vol. 2, no. 8, pp. 4085–4092, 2013.
-  G. Fahrnberger, “Sims: A comprehensive approach for a secure instant messaging sifter,” in Trust, Security and Privacy in Computing and Communications (TrustCom), 2014 IEEE 13th International Conference on. IEEE, 2014, pp. 164–173.
-  C. Wang, N. Cao, J. Li, K. Ren, and W. Lou, “Secure ranked keyword search over encrypted cloud data,” in Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on. IEEE, 2010, pp. 253–262.
-  H. Li, Y. Yang, T. H. Luan, X. Liang, L. Zhou, and X. S. Shen, “Enabling fine-grained multi-keyword search supporting classified sub-dictionaries over encrypted cloud data,” IEEE Transactions on Dependable and Secure Computing, vol. 13, no. 3, pp. 312–325, 2016.
-  B. Zhang and F. Zhang, “An efficient public key encryption with conjunctive-subset keywords search,” Journal of Network and Computer Applications, vol. 34, no. 1, pp. 262–267, 2011.
-  X. Duan, J. He, P. Cheng, Y. Mo, and J. Chen, “Privacy preserving maximum consensus,” in Decision and Control (CDC), 2015 IEEE 54th Annual Conference on. IEEE, 2015, pp. 4517–4522.
-  M. Naveed, M. Prabhakaran, and C. A. Gunter, “Dynamic searchable encryption via blind storage,” in Security and Privacy (SP), 2014 IEEE Symposium on. IEEE, 2014, pp. 639–654.
-  R. Soltani, B. Bash, D. Goeckel, S. Guha, and D. Towsley, “Covert single-hop communication in a wireless network with distributed artificial noise generation,” in Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on. IEEE, 2014, pp. 1078–1085.
-  R. Soltani, D. Goeckel, D. Towsley, B. Bash, and S. Guha, “Covert wireless communication with artificial noise generation,” IEEE Transactions on Wireless Communications, pp. 1–1, 2018.
-  R. Soltani, D. Goeckel, D. Towsley, and A. Houmansadr, “Covert communications on poisson packet channels,” in Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on. IEEE, 2015, pp. 1046–1052.
-  ——, “Covert communications on renewal packet channels,” in Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on. IEEE, 2016, pp. 548–555.
-  H. Shafagh, A. Hithnawi, A. Dröscher, S. Duquennoy, and W. Hu, “Talos: Encrypted query processing for the internet of things,” in Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. ACM, 2015, pp. 197–210.
-  L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques,” arXiv preprint arXiv:1003.4083, 2010.
-  A. Bala, A. Kumar, and N. Birla, “Voice command recognition system based on mfcc and dtw,” International Journal of Engineering Science and Technology, vol. 2, no. 12, pp. 7335–7342, 2010.
-  C. Goh and K. Leon, “Robust computer voice recognition using improved mfcc algorithm,” in New Trends in Information and Service Science, 2009. NISS’09. International Conference on. IEEE, 2009, pp. 835–840.
-  J. Martinez, H. Perez, E. Escamilla, and M. M. Suzuki, “Speaker recognition using mel frequency cepstral coefficients (mfcc) and vector quantization (vq) techniques,” in Electrical Communications and Computers (CONIELECOMP), 2012 22nd International Conference on. IEEE, 2012, pp. 248–251.