The iot has become a reality, with many smart devices communicating autonomously to provide users with seamless services like climate control at home, security systems in the workspace, and traffic management in public areas. To provide those services, iot devices use multiple sensors to “perceive” their environment and act on it to make appropriate decisions.
Numerous security incidents (Kolias et al., 2017) have shown that establishing and maintaining secure communications between iot devices is challenging. First, centralized approaches such as pki are impractical due to their high complexity and limited scalability (Elkhodr et al., 2016). Second, many iot devices do not feature user interfaces, making it impossible to perform security establishment using traditional mechanisms like passwords (Fomichev et al., 2018). To address this problem, recent research proposed using device sensor readings of the ambient environment, often called context information (Perera et al., 2014). This information is used to build context-based security schemes operating without user interaction such as zip (Schürmann and Sigg, 2013; Miettinen et al., 2014; Xi et al., 2016) and zia (Truong et al., 2014; Shrestha et al., 2014; Karapanos et al., 2015). We further refer to both as zis schemes.
The security of zis schemes is based on the assumption that context information has high entropy, changes frequently, and is unpredictable from outside the specified environment (Sigg, 2011). Context information, obtained from the ambient environment of an iot device, is used to derive a shared secret key between colocated devices in zip or to serve as a proof of physical proximity between devices in zia. For example, similarity in ambient audio sensed by two colocated devices was successfully used in both zip (Schürmann and Sigg, 2013) and zia (Karapanos et al., 2015), with the latter scheme becoming part of a commercial product (Futurae Technologies AG, 2017). Other research explored the applicability of different context information in zis schemes: temperature, humidity, pressure, and luminosity (Shrestha et al., 2014; Miettinen et al., 2014), magnetic fields, acceleration and rotation rates (Shrestha et al., 2016a; Schürmann et al., 2017), as well as observed WiFi and Bluetooth beacons (Truong et al., 2014).
zis schemes have three main advantages compared to traditional approaches. First, they offer high usability by minimizing user involvement in pairing and authentication procedures. Second, zis schemes can scale to a large number of devices, including those that do not share a common sensing modality (Han et al., 2018). Third, zis schemes can be built on top of devices’ sensing capabilities, reducing modification overhead and facilitating interoperability.
Despite the great potential of zis schemes to enable a more secure and usable iot, prior work raised questions about their practical applicability (Shepherd et al., 2017) and security soundness (Shrestha et al., 2018; Shrestha et al., 2016c). The evaluation of the proposed schemes in realistic iot scenarios is crucial yet mostly missing. In our work, we fill this gap by conducting the first large-scale comparative study of existing zis schemes. We reproduce five state-of-the-art zis schemes (Schürmann and Sigg, 2013; Truong et al., 2014; Miettinen et al., 2014; Shrestha et al., 2014; Karapanos et al., 2015) and evaluate their ability to distinguish authorized and unauthorized devices on comprehensive datasets of context information collected in three realistic iot scenarios: a connected car, smart office and smart office with mobile heterogeneous devices. Our evaluation reveals trade-offs between different kinds of context information and context features, and gives insights into pitfalls of the reproduced schemes in practice.
In our scenarios, we collect seven different kinds of context information, given in Table 1: audio, WiFi and ble beacons, barometric pressure, humidity, luminosity, and temperature, from which we compute 16 distinct context features. We implement iot scenarios by distributing sensing devices among various spots, each reflecting a potential iot functionality like a smart light, with people following their daily routine in these scenarios. When evaluating mobility, we additionally supply users with different sensing devices.
From our car, office and mobile scenarios, we collect context information datasets of 1.7, 214 and 23.2 GB, respectively, and annotate them with the ground truth. To our knowledge, these are the largest datasets of annotated context information collected in the zis domain so far. Our analysis reveals that many of the reproduced schemes are challenged by our scenarios and often cannot maintain the classification accuracy found by their original authors, reaching error rates between 0.6% and 52.8%. We also observe that many schemes have limited adaptability to difficult circumstances and frequently are not robust, with parameters optimal for one scenario leading to notably lower classification performance in the other.
To facilitate future research, we publicly release the collected context information in an anonymized form, ground truth information, the computed context features, machine learning datasets and results for all reproduced schemes, as well as the full source code used in data collection and evaluation procedures, including metadata for reproducibility (cf. Section 3.5) (Fomichev et al., 2019b). We further enhance reproducibility by releasing raw audio recordings from the mobile scenario (Fomichev et al., 2019a), making our dataset the first of its kind in the domain of zis.
In summary, we make the following contributions:
We evaluate the scheme performance and robustness for use in different scenarios. We also provide insights into pitfalls of the reproduced schemes.
We release the first open-source toolkit, containing datasets of diverse context information, including audio, together with the source code used to collect these data and implementations of the five zis schemes.
In this section we provide the terminology used in this paper, present our system and threat model, and describe the zis schemes that we reproduced and evaluated.
We start by clarifying the relevant terms in the domain of zis.
Context information. We define context information as the data collected from device sensors (e.g., microphones, light sensors, etc.), augmented with metadata like timestamps (Perera et al., 2014).
Context. We refer to a set of context information collected by a device from its ambient environment over time as the context of the device.
Colocation. We define colocation as a set of devices residing in the same physical space. In our scenarios, the spaces are different cars and offices, thus devices within the same car or office are colocated, otherwise non-colocated. The term colocation highly depends on the use case of the zis scheme. In the case of wearables, colocated devices are on the same body (Brüsch et al., 2018), whereas for a smart home, colocated devices are inside a house (Han et al., 2018).
Context feature. We define context feature as a concise context property computed from context information. Context features are based on a snapshot of context information (Schürmann and Sigg, 2013; Truong et al., 2014; Shrestha et al., 2014; Karapanos et al., 2015) or on relative changes of context information over time (Miettinen et al., 2014). They calculate a distance or similarity metric between two samples of context information (Truong et al., 2014; Shrestha et al., 2014; Karapanos et al., 2015)
, or derive a binary fingerprint vector from a sample of context information(Schürmann and Sigg, 2013; Miettinen et al., 2014).
2.2. System and Threat Model
We assume an iot scenario containing a number of devices that are colocated and equipped with a set of sensors to collect context information. The goal of zis schemes is to have two colocated devices establish a secure connection (zip) or a proof of proximity (zia) without user interaction, utilizing context features to secure the process. We assume no established infrastructure and, in the case of zip, no prior trust between devices.
Our adversary is based on the models used in the reproduced zis schemes. The adversary is an iot device located in an adjacent car or office. This device can be benign, accidentally trying to pair or authenticate with proximate devices in its wireless range (e.g., iot device in a neighbor’s car), or it can be malicious, intentionally trying to pair or authenticate with non-colocated devices. The adversary is non-colocated with benign devices, thus it can neither observe their context information, nor compromise benign devices to circumvent a zis scheme. However, the adversary is physically close to benign devices (i.e., adjacent car or office), equipped with the same sensing hardware to collect context information, and can communicate with them over a wireless link.
The goal of the adversary is to obtain similar enough context information to fool benign devices into believing that it is co-located with them. Compared to threat models of the reproduced schemes our adversary is more powerful as it possesses two extra capabilities. First, it remains permanently in close proximity to benign devices, including times of low context activity such as during the night. Second, due to symmetric deployment of devices in our scenarios, the adversary has much better chances of following the same trends in context information (e.g., lighting conditions) as benign devices.
We purposely make our adversary powerful to evaluate the scheme performance in challenging scenarios. This allows us to establish the worst-case baseline adversary, facilitating comparison of the reproduced zis schemes (discussed in Section 5), as well as gain first insights into possible attack vectors for an active adversary (Shrestha et al., 2018).
2.3. Reproduced ZIS schemes
To select zis schemes for our study, we surveyed frequently cited schemes published at top security venues in the last five years. We selected schemes that targeted iot scenarios and utilized different context information or context features. We excluded schemes based on behavioral biometrics, e.g., gait (Brüsch et al., 2018), gesture (Shrestha et al., 2016b) or keystroke dynamics (Mare et al., 2014), as these schemes designed for wearable iot scenarios. In the end, we arrived at five schemes, which we reproduced from the ground up, relying on the help of the original authors to ensure the correctness of our implementations. We briefly introduce each scheme in its respective result subsection (cf. Section 4) and refer to the Appendix for detailed descriptions.
3. Study Design
We designed our study to cover the majority of relevant context information used in current zis schemes. We selected three realistic IoT scenarios: in the first two, we used identical sensing devices to collect context information, minimizing the effects of hardware variations on our results. In the third scenario, we used different sensing devices to evaluate the impact of device heterogeneity. This section describes the design and conduct of our experiments, as well as ethical concerns when dealing with sensitive personal data collected in our study.
3.1. Data Collection
The goal of our experiment was to collect a comprehensive real-world dataset of context information that can serve as a baseline for comparing current and future zis schemes. In the first two scenarios, we collected data using a Texas Instruments SensorTag CC2650 and a Raspberry Pi 3. Audio data was collected using a Samson Go USB microphone, which recorded a mono audio stream with a 16 kHz sampling rate, and encoded it using the lossless FLAC format. The Raspberry Pi also collected all visible wireless ap and ble devices, including their signal strength, every ten seconds. The remaining context information (accelerometer, barometer, gyroscope, humidity, light intensity, magnetometer, and temperature) was recorded using the SensorTag, connected to the Raspberry Pi using ble. 111While accelerometer, gyroscope and magnetometer are not used by any scheme, we collected their data for use in future schemes. Sensor data was recorded with a sampling rate of 10 Hz.
In the third scenario, we additionally used Samsung Galaxy S6 smartphones and Samsung Gear S3 smartwatches to collect the same context information. Since those devices are not equipped with temperature and humidity sensors we combined them with a RuuviTag+. We tried to obtain the same sampling rate on all our devices, however, the Galaxy S6 limits barometric pressure and luminosity readings to 5 Hz. The summary of used sensing devices and sampling rates is given in Table 19 in the Appendix. All events that could influence the context information (e.g., windows/doors being opened or closed, people entering or leaving the recording area, traces of mobile devices, etc.) were documented automatically or by hand in a ground truth sheet.
3.2. Scenario 1: Car
In the first scenario, we used two cars from different manufacturers. Each car was equipped with six sensing devices distributed inside the vehicle as shown in Figure 1. The devices occupied similar spots in both cars: one device was placed on top of the dashboard facing the windshield, inside the glove compartment, in between the front seats facing upwards, attached to each handhold above the two rear doors, and put in the middle of the trunk. This placement covers all prominent spots one might expect a sensor or a personal device inside a car (cf. Table 20 in the Appendix).
After setting up the cars, we drove a predefined route of three hours and () on the afternoon of an autumn day. The time was chosen to ensure that the collection began while the sun was visible and ended after sunset, to collect a variety of lighting conditions. The route included city traffic, country roads, and highway (cf. Figure 10 in the Appendix for a map). We drove both cars close to each other within a distance of (), which we varied from time to time. In addition, we took a short break, with the cars parked side by side.
The challenge for the zis schemes is to identify colocated devices in a single car, while excluding devices from different cars that might be nearby or just listening to the same radio station.
3.3. Scenario 2: Office
A typical application for IoT devices is the deployment in a smart home or office. To collect realistic context information in this scenario, we deployed eight sensing devices in three office rooms as shown in Figure 2. We put the devices in similar places, representing typical iot spots: one device was attached to the main screen of a workplace (smart workstation, several spots), above the windows (smart shades), near the ceiling lights (smart lights), in a closed cupboard (smart device), near the door at around two meters height (smart room/motion sensor) and in a corner at around 2.5 meters (environmental sensor). The summary of device locations in the office scenario is given in Table 21 in the Appendix.
We collected context information for one full week, resulting in five work days with people present and two days of the weekend, when offices 1 and 2 were empty, and one person was working in office 3. Offices 1 and 2 were adjacent and connected with a door, which was closed most of the time. Office 3 was on the opposite side of the floor. All three rooms had a similar setup in terms of size and position of furniture but a different number of participants working in them (one in office 2, two in office 1 and three in office 3).
The collected dataset is intended for testing zis schemes designed for smart homes and offices. The challenge here is to distinguish between the three different rooms. Ideally, a scheme identifies all colocated devices in one room but excludes all others.
3.4. Scenario 3: Office with Mobile Heterogeneous Devices (Mob/het)
We extended the office scenario by including both static devices permanently residing inside offices, and mobile devices carried by users (cf. Figure 2). We added a number of appliances (i.e., vacuum robot and its station in office 1, fan in office 2, coffee machine in office 3), facilitating device mobility when users move to use them. Each office was equipped with four static devices (SensorTag), covering similar spots and the appliances: one device was attached to the main screen of a workplace (smart workstation, several spots), near a power plug (smart plug), on top of a vacuum robot station (smart robot station) and coffee machine (smart coffee maker), near a fan (smart fan). We equipped four participants with three mobile devices each: a laptop (with attached smartphone to collect context information), smartphone and smartwatch. We also placed a smartphone on top of the robot vacuum cleaner. Device locations are summarized in Table 22 in the Appendix.
We collected context information for eight hours from 9 am till 5 pm, representing a typical working day. Over the course of the day participants moved freely between the offices to get a cup of coffee, have a meal or attend a meeting, each time carrying a set of their mobile devices. We also moved the vacuum robot between the offices, letting it autonomously run a complete cleaning cycle.
Similarly to the office scenario, the challenge for zis schemes is to distinguish devices present in the same office, while excluding devices in others.
3.5. Reproducibility and Re-usability
In total, our dataset contains 239 GB of context information, including more than 4250 hours of audio recordings, over 1 billion sensor readings, and over 12 million WiFi and ble beacons. Computing the context features of the reproduced schemes took over 300 000 CPU hours. The audio-based features were computed using Matlab on a high-performance cluster. The remaining features were implemented in Python on a high-performance server. After compression, they utilize almost 1 TB of disk space. This includes the computed features, aggregated statistics, and metadata for reproduction and validation following the recommendations by Benureau et al. (Benureau and Rougier, 2018).
To facilitate future reuse, we release the source code of the entire data collection and evaluation stack, as well as the collected context information in an anonymized form, all intermediate and final data files (including machine learning models) and the code used to generate the visualizations. Privacy concerns prevent us from releasing the audio data recorded in the Car and Office scenarios, but we are able to provide researchers with the audio recordings from the Mobile scenario upon request (Fomichev et al., 2019a). See (Fomichev et al., 2019b) for an index of all released data and code.
3.6. Ethical Considerations
The study was approved by our institutional ethical review board, data protection officer, and workers’ council. Participants gave informed consent for the collection, use, and release of the data. During collection, the audio data was encrypted with keys controlled by the affected participants, requiring their explicit consent and cooperation to decrypt the data for processing. In the mob/het scenario, we gave participants the chance to inspect the recordings before obtaining informed consent for their release.
|Schürmann, Sigg (Schürmann and Sigg, 2013)||4.2||Generates fingerprints with good randomness properties, but shows varying performance on subscenarios, and provides only limited robustness.||0.154||0.241||0.140|
|Truong et al. (Truong et al., 2014)||4.4||Achieves the best error rates in office and mob/het scenario, but shows low robustness and high reliance on audio feature, and struggles with heterogeneous settings.||0.104||0.069||0.123|
In this section, we report on the performance in distinguishing colocated and non-colocated devices for the five reproduced schemes (cf. Table 2). The performance evaluation of each scheme is structured as follows. We first provide a concise overview of the scheme by explaining the context features used to distinguish colocated and non-colocated devices. Then, we explain the methodology of the original scheme and provide details of our evaluation. Next, we present and interpret the performance results of the scheme for each scenario. To quantify the performance we compute the eer, which is the point of equal far and frr. In addition, we assess how much usability the schemes can deliver if a specific security level is required by setting a number of target far (between 0.1% and 5%) and analyzing the resulting frr.
We evaluate the scheme robustness by analyzing an increase in error rates (either far or frr) from the original eer when applying parameters found to be optimal in one scenario to another. This simulates a scheme being used in a scenario it was not trained on, like an IoT device optimized for office use being deployed in a car. We further summarize each studied scheme by comparing our results with the original findings and providing key takeaways from our evaluation. This facilitates a direct comparison of the different schemes in our scenarios.
We introduce subscenarios to investigate the impact of changes in the environment (e.g., time of day, moving vs. parked cars) on the scheme performance. A subscenario represents a subset of context information collected at a specific stage in the scenario. For the car scenario, we distinguish three subscenarios: the city and highway subscenarios contain context information of the cars driving inside city limits or on the highway, respectively, and the parked subscenario includes context information from the time the cars were parked. Similarly, we construct three subscenarios for the office scenario: the weekday subscenario contains context information collected from Monday to Friday from 8 am to 9 pm, the night includes context information for all seven days from 9 pm to 8 am, and the weekend consists of context information from Saturday and Sunday in the timeframe from 8 am to 9 pm. We omit the subscenario evaluation in the mob/het scenario as there were no specific stages in this scenario.
We assess the performance of all schemes except (Truong et al., 2014) and (Shrestha et al., 2014) on time intervals of 5, 10, 15, 30, 60 and 120 seconds with the length denoted . The interval represents a timeframe over which context information is aggregated to compute a context feature, e.g., a 5 second audio snippet or a 30 second WiFi capture. (Truong et al., 2014) is evaluated on time intervals of 10 and 30 seconds, as the scheme is less well-suited to an arbitrary interval length due to the used features, while (Shrestha et al., 2014) does not use any intervals.
4.1. Karapanos et al.
Karapanos et al. (Karapanos et al., 2015) proposed using maximum cross-correlation between snippets of ambient audio from two devices to decide if they are colocated. The cross-correlation is computed on a set of one-third octave bands (ANSI/ASA S1.11, 2004) and averaged to a similarity score. One-third octave bands split the audible spectrum (20 Hz to 20 kHz) into 32 frequency ranges of different sizes. To prevent erroneous authentication when audio activity is low, a power threshold is applied to discard audio snippets with insufficient average power. The similarity score is checked against a fixed similarity threshold to decide if two devices are colocated, and can thus be authenticated. Tuning the similarity threshold allows trading usability for security and vice versa. The authors evaluated their scheme in several scenarios such as a quiet office, lecture hall, and café. The scheme details are given in Appendix A.1.
To investigate the scheme performance we compute similarity scores between colocated and non-colocated devices on different interval lengths. We increase the minimum length of audio snippet and maximum correlation lag to achieve a comparable level of synchronization to the original implementation. These changes have a negligible impact on the similarity score computation, as stated in Appendix A.1. To understand factors affecting the performance, we analyze the behavior of similarity scores on different octave bands.
|15||0.026||0.043||0.002||0.060||0.128||0.126||0.123||0.129||0.170222In cases where far and frr do not match to three digits after the decimal, we average them and denote the result as eer.|
We observe eer between 0.006 and 0.050, decreasing with rising interval length (cf. Table 3). To understand this behavior we compute the distributions of colocated and non-colocated similarity scores for each interval. Overlaps of these distributions explain the corresponding error rates: in the car scenario, the overlaps range from 1.1% to 8.5%. We observe a clearer separation between colocated and non-colocated similarity scores at longer intervals, caused by a sharper drop of non-colocated similarity scores. When targeting low far, the resulting frr are below 0.2 on the intervals above , dropping rapidly with a growing far (cf. (a)).
Our octave band analysis shows the profound influence of lower frequencies (below 315 Hz) caused by a running car on the overall similarity score. This explains the lowest eer reaching 0.0 in the uniform sound environment of a highway (cf. Table 3). The more diverse sound environment of a city shows a severalfold increase in eer compared to the highway subscenario. Surprisingly, in a low-activity environment of parked cars, the eer are only a few percentage points above the city subscenario. Investigating this phenomenon revealed that the power threshold discards up to 90% of similarity scores in the parked subscenario, retaining only those scores that resulted from intense audio activity.
Applying office and mob/het eer thresholds to the car dataset leads to a marginal increase in error rates below 1 percentage point on the intervals to for the office, and on for the mob/het, with other intervals showing severalfold growths in error rates. Among subscenarios, we see limited robustness between quiet (parked) and active environments (city and highway) at , as well as when applying city thresholds to the highway dataset for .
In the office scenario, we observe eer between 0.098 and 0.141, decreasing with growing interval length (cf. Table 3). We attribute these eer to larger overlaps between colocated and non-colocated classes, ranging from 19% to 28%. We see a clear trend of higher similarity scores between non-colocated devices in adjacent offices (offices 1 and 2 in Figure 2). Our octave band analysis reveals close resemblance between these scores on lower frequencies below 250 Hz and on higher frequencies above 1250 Hz. Thus, both low frequencies penetrating adjacent offices and high frequency sounds like a police siren can increase non-colocated similarity scores. When targeting low far, the resulting frr start around 0.9 and never drop below 0.2 (cf. (b)).
We observe that higher audio activity of weekdays results in lower eer on the intervals below . However, on longer intervals the eer of low-activity environments (i.e., night and weekend) become lower compared to the weekday. Investigating this phenomenon in more detail reveals two reasons for such a behavior. First, the power threshold retains a few similarity scores originated from intense audio activity in the night and weekend subscenarios. Second, in low-activity environments sounds are infrequent, localized and short-term, making them easier to capture on longer intervals by colocated devices and less prone to be leaked to non-colocated devices.
Applying car and mob/het eer thresholds to the office dataset results in a minor increase in error rates below 2 percentage points on the intervals for the car, and on for the mob/het, with other error rates rising a few extra percentage points. In subscenarios, we observe robustness only between low-activity environments of night and weekend, showing an increase in error rates below 2 percentage points on all intervals.
Investigating similarity scores in the mob/het scenario revealed that 75% to 100% of the scores generated by smartphones and watches were discarded by the default power threshold (40 dB) in the absence of intense audio activity (e.g., running vacuum robot). We adjusted the power thresholds for smartphones and watches to 38 and 35 dB respectively, significantly increasing the scheme availability in the cases of medium audio activity (e.g., low-voiced conversation), while still discarding the similarity scores from quiet environments.
With the new power thresholds, we observe increased eer between 0.157 and 0.183, rising with interval length, reversing the trend seen in the car and office scenarios (cf. Table 3). Once again, higher eer are explained by larger overlaps between colocated and non-colocated classes, ranging from 33% to 36%. When targeting low far, the resulting frr vary almost linearly between 0.9 and 0.37 (cf. (c)).
We found that microphone diversity and device mobility are likely reasons for the reversed eer trend. The similarity scores among heterogeneous devices are generally lower, decreasing significantly towards longer intervals. Our analysis suggests that the main reason for these lowered scores is diverse sensitivity and frequency response of heterogeneous microphones (Kardous and Shaw, 2014)
. We empirically observed that smartwatch microphones are optimized for human voice but rather insensitive to low frequencies, while on smartphones low frequencies cause a lot of noise in recordings, and the USB microphones show the best signal quality on a wide frequency range. On longer intervals, device mobility further increases signal variation: the probability of capturing a unique signal (e.g., a keystroke by smartwatch) or wide-band scratching noises (e.g., smartphone rubbing against a pocket) increases.
Applying car and office eer thresholds to the mob/het dataset leads to a minor increase in error rates up to 1.5 percentage points on the intervals for the car, and for the office, with other intervals showing several percentage points extra growths in error rates.
Our results show that the scheme by Karapanos et al. can reliably distinguish colocated and non-colocated devices in the car scenario, but degrades in performance in the office and mob/het. We generally achieve higher eer compared to the authors, who observe an eer of 0.002. Possible reasons for that are the increased distance between colocated devices and sustained closeness of non-colocated devices in our scenarios.
When the scheme is used among homogeneous devices (car and office scenarios) we observe better performance with increasing interval length and more intense audio activity. The difference between car and office eer is due to a smaller distance between colocated devices in the car, and more intense audio activity, especially on lower frequencies (highway). We see that highway eer decrease marginally towards longer intervals, suggesting the use of short- to medium-sized intervals in active environments, reducing the run-time overhead of the scheme.
With heterogeneous devices (mob/het scenario) using longer intervals decreases the scheme performance, and intense audio activity is only beneficial if heterogeneous microphones can similarly record it (e.g., human voice), otherwise the performance will further decrease, especially on longer intervals. Considering that built-in microphones in mobile devices are user-interaction oriented, the scheme can benefit from shorter intervals and audio activity in the frequency range of human voice in heterogeneous settings.
The power threshold allows the scheme to cope with quiet environments, sometimes at the price of excluding a significant portion of the dataset (e.g., parked car), trading off availability for security. However, as we have seen in the mob/het scenario, the power threshold proposed by the authors severely decreases scheme availability already in the cases of medium audio activity, urging the need to carefully select this parameter, depending on the characteristics of the microphones.
The scheme consistently shows robustness on medium-sized intervals () among our scenarios, suggesting that it can potentially adapt to new environments on these intervals.
4.2. Schürmann and Sigg
Schürmann and Sigg (Schürmann and Sigg, 2013) propose encoding a snippet of ambient audio into a binary fingerprint to pair two devices. The generated fingerprint consists of 16 individual shorter fingerprints that reflect the energy changes of successive frequency bands in the audio snippet over shorter timeframes. The similarity between the fingerprints derived by two devices informs a pairing decision. These fingerprints need to exhibit good randomness in order to secure a key establishment procedure between devices via fuzzy commitments. The authors evaluated their scheme in a series of deployments, ranging from staged lab measurements to recordings in a busy canteen and near a road. A detailed description of the scheme can be found in Appendix A.2.
We evaluate the performance of the scheme by generating fingerprints using different intervals . Due to hardware constraints, we use a lower audio sampling rate, which reduces the length of the fingerprint from 512 to 496 bits. This change introduces a marginal deviation from the original implementation as detailed in Appendix A.2. To evaluate the similarity of the generated fingerprints of two devices, we calculate the similarity percentage as .
The scheme uses a fixed similarity threshold that distinguishes colocated from non-colocated devices. In addition, we investigate the randomness of the fingerprints by interpreting them as random walks, with 1- and 0-bits representing steps in the positive and negative direction (Brüsch et al., 2018)
. The outcomes will follow a binomial distribution if the fingerprints are uniformly random. We also investigate bit transition probabilities by interpreting each bit of the fingerprint as a state in a Markov chain.
We observe eer between 0.154 and 0.271, decreasing with increasing interval length (cf. Table 4). These error rates correspond to the observed overlaps in similarity between colocated and non-colocated devices, which ranges between 30% and 51%. When optimizing for a low far, the resulting frr exceed 0.8 for certain parameters and never drop significantly below 0.3 (cf. (a)). The system performs best in scenarios with diverse sound environments, like driving within city limits, showing consistently lower eer in the city subscenario (cf. Table 4), dropping as low as 0.096. Environments with a uniform sound environment, like driving on the highway, show slightly increased error rates, but still remain consistently below the error rates for the full dataset. In low-activity environments like parked cars, the scheme shows significantly increased error rates of up to 0.362—an increase of 0.134 over the city environment with the same parameters.
The fingerprints exhibit good randomness across all devices. Their Markov property is good, with for all bits. When interpreting fingerprints as random walks, the resulting distribution of endpoints is close to the expected binomial distribution (cf. Figure 6). When splitting the 496-bit fingerprints into their constituent 31-bit fingerprints and analyzing them separately, the random walks show a more varied distribution. Some are close to the expected binomial distribution (cf. (d)), while others show a flatter distribution (cf. (c)), indicating more fingerprints contain a larger number of 1- or 0-bits than expected. Investigating these sensors in more detail, we found that their microphones were affixed to surfaces that vibrated more than average. As fingerprints are derived from variations in signal energy over time, the biased fingerprints may have been caused by periodic variations in the energy induced by the vibrations.
Applying the threshold from the office scenario to this dataset results in an increase in error rates between 3.6 and 11.2 percentage points, with the larger changes occurring for and 120. The mob/het threshold increases the error rates by 1.7 to 7.1 percentage points, with the largest changes for and 30, while shows the smallest change. In subscenarios, the most stable results are obtained between city and highway, changing between 4.1 and 9.5 percentage points in both directions. The other combinations show significantly larger error rate increases, in some cases up to 25.7 percentage points. This indicates that the scheme has limited robustness in cases where the environments are similar, but is not robust to larger changes in environmental characteristics.
In the office, we observe generally increased eer, ranging from 0.241 to 0.419 and decreasing with increasing interval lengths (cf. Table 4). These error rates are explained by the higher overlaps between colocated and non-colocated classes, which lie between 48% and 79%. In particular, we observe that the computed similarities between some non-colocated devices exceeded the similarities with all of their respective colocated devices, especially using smaller interval sizes . Investigating these anomalous pairs in more detail revealed that the high similarities occur mostly at night and on the weekend, i.e., at times of very low ambient activity. However, the question why these particular devices were affected while others behaved normally remains unanswered.
When optimizing for a low far, the resulting frr for the full scenario are universally above 0.5 (cf. (b)). Once again, the system performs best in environments with high audio activity, in this case the weekdays, showing significantly reduced error rates compared to the night and weekend.
The fingerprints again show good randomness, with a strong Markov property and random walks close to the expected distribution for the full fingerprints. When investigating the sub-fingerprints, we observe a slight bias towards 0 in the lowest three bits of some devices, with . Most of the affected devices were located in office 2, but there is no discernible pattern in which devices exhibit this behavior and no obvious explanation.
Applying the car threshold to this dataset results in error rate increases of 3.9 to 10.9 percentage points, with the largest changes at and 120. Conversely, the threshold obtained in the mob/het scenario will increase error rates by 5.6 to 12.6 percentage points, with the largest changes at , 15 and 30. The error rates of the night and weekend subscenarios remain almost completely stable when exchanging their thresholds. All other combinations show larger changes, often showing swings of more than 10 percentage points.
The error rates in the mob/het scenario exhibit a larger spread than in the other scenarios, with eer ranging from 0.140 to 0.363 and decreasing with rising interval lengths (cf. Table 4). They once again correlate with the overlaps in similarity between colocated and non-colocated devices, which ranges from 27% to 62%. When optimizing for low far, the resulting frr range from close to 1.0 to 0.40 (cf. (c)).
Although the Markov property is universally good, the randomness of the fingerprints shows significant variation. While the fixed sensors show decent randomness, the mobile devices (smartphones and smart watches) deviate from the expected distribution, showing similar behavior to the biased sensors in the car scenario. Part of this deviation can likely be explained by the different characteristics of the microphones (cf. Section 4.1.4). Devices that were covered (i.e., smartphones in pockets and smart watches worn under long-sleeved clothing) showed the largest deviation from the expected distribution, with strong biases towards sub-fingerprints consisting of mostly 1- or 0-bits. This is likely related to the movement of cloth over the devices generating wide-band scratching noises, in combination with sound attenuation caused by the clothing.
Applying the car threshold to this dataset results in increases in error rates between 1.5 and 7.4 percentage points, with the largest changes for and 30. The office threshold increases error rates by 5.3 to 11.8 percentage points, with the highest increases for and 30.
The scheme by Schürmann and Sigg is unable to reliably distinguish colocated from non-colocated devices in our scenarios. We also observed unexplained high similarities for specific non-colocated device pairs in the office scenario. In particular, the scheme breaks down in environments with low ambient activity (a limitation also noted by the original authors), however, even in high-activity environments like a driving car, the error rates exceed 10% for almost all parameters. Still, it may be possible to increase the overall performance of the scheme by excluding low-energy samples with a power threshold (similar to Karapanos et al.).
The fingerprints exhibit good randomness in many cases, however, they struggle with noisy inputs, like vibration- or friction-induced sounds, and will in some cases generate fingerprints that consist almost entirely of 1- or 0-bits. In particular, devices carried in pockets or under long sleeves seem to cause problems.
While the scheme is robust in some pairs of subscenarios, the robustness is very limited. Interestingly, the intervals behave differently for different combinations of scenarios—while the error rates of are almost unaffected in some pairs, in others, they show very large changes. The same is true for other intervals like .
Schürmann and Sigg do not report error rates in their evaluation, so a direct comparison is impossible. However, the average separation between colocated and non-colocated fingerprints they report is larger than that observed in our scenarios. One possible explanation may be a tighter synchronization of audio signals in their experiment, as their samples were recorded by a single device with two microphones, thus avoiding any problems related to recordings not being exactly in sync. In a practical setting, such a tight synchronization between two devices will be more challenging to achieve (our synchronization method is described in Appendix A).
4.3. Miettinen et al.
The scheme proposed by Miettinen et al. (Miettinen et al., 2014) uses two context features, one based on audio and the other on luminosity. In both cases, changes over extended timeframes are recorded and encoded into a binary context fingerprint of a fixed length . The similarity of these fingerprints is then used to decide if devices can establish a connection by serving as a shared secret to bootstrap a key exchange using fuzzy commitments. Due to this usage, the randomness of the fingerprints is once again of interest. The authors evaluated their scheme in an office, a home scenario, and a mobile scenario simulating wearable devices. They also propose an optional extension to ensure sufficient fingerprint quality by discarding fingerprints with an insufficient surprisal, which measures how unexpected a fingerprint is for the current time of day. However, they did not evaluate the effect of this proposal. More details are given in Appendix A.3.
Our methodology is identical to that used for the paper by Schürmann and Sigg (cf. Section 4.2). As the fingerprints generated by the scheme span long timeframes (up to 34 hours), we omit the subscenario evaluation, as allocating fingerprints to specific subscenario timeframes is impossible.
Empty cell denotes insufficient data to generate fingerprint. Best value in scenario marked in bold.
Computed on subset.
Both luminosity- and audio-based fingerprints show relatively high error rates, with the lowest observed eer being 0.492 and 0.226, respectively (cf. Table 5). These high error rates can be explained by the high overlap of similarity percentages between the colocated and non-colocated groups, showing overlaps between 83% and 96% for the luminosity fingerprints. The overlaps are lower, but still significant for the audio fingerprints, with overlaps between 39% and 79% being observed. When aiming for a specific far, the resulting frr is universally above 0.5 for the audio fingerprint (cf. (a)). For the luminosity feature, the frr are 1.0 for all targeted far, indicating that all samples are rejected, making the scheme usability unacceptable.
Once again, the security of the scheme does not only depend on the error rates, but also on the randomness of the generated fingerprints. Here, we observe the luminosity fingerprints to be heavily biased towards zero. The audio fingerprints contain more 1-bits, but still do not show sufficient randomness. This limited randomness and high bias also explain the high overlap in the fingerprint similarity distributions. Rejecting fingerprints with insufficient surprisal excluded over 90% of the luminosity fingerprints even for the smallest specified surprisal value and consistently increased error rates for all attempted thresholds. For audio fingerprints, we evaluated a series of thresholds for different parameters and found that in many cases, the error rates do not decrease significantly and in some cases will even increase, unless over 95% of the dataset is excluded.
Applying the threshold from the office scenario increases the error rates for audio fingerprints by varying amounts, in some cases remaining stable, in others increasing by close to 25 percentage points, where higher values of and result in higher robustness. For luminosity fingerprints, increasing reduces robustness and can lead to all samples being rejected, while smaller values of with large sometimes show stable error rates. With the mob/het threshold, the system rejects all audio fingerprints. On luminosity fingerprints, it shows unpredictable behavior, being robust for certain parameters and rejecting all samples for others, with no discernible patterns.
In the office scenario, we observe lower error rates, with eer between 0.249 and 0.120 for the audio fingerprints (cf. Table 5), which can be explained by the decreased overlaps between the fingerprint similarity percentages of colocated and non-colocated devices (between 24% and 49%). For the luminosity fingerprints, the error rates remain high, with the lowest observed eer being 0.344, which can be explained by overlaps between 80% and 99%. In many cases, far and frr only become equal with thresholds close to 100% similarity, at which point the far becomes 0.0 and the frr 1.0. When aiming for a low far, the resulting frr remain large for both audio (cf. (b)) and luminosity fingerprints (where the error rate is almost universally 1.0).
The luminosity fingerprints consist overwhelmingly of 0-bits, which explains the observed overlaps in similarity percentages. This can be explained by the low variance in luminosity in offices, which are often lit by electric lighting with only very infrequent changes. Audio fingerprints show more variance but are usually biased, with the probability of obtaining a 1-bit varying between 0.4 and 0.63. The distribution within the fingerprint is also unequal, as the fingerprints are almost completely zero at night, leading to further biases (cf.Figure 6). Rejecting fingerprints with insufficient surprisal once again excluded most of the dataset in the luminosity feature, and only led to improvements of 1-2 percentage points in the audio fingerprints while excluding 10-20% of the dataset.
Applying the car threshold to the dataset again results in varying increases in the audio fingerprint error rates, following the same trends outlined for this combination in the car section. The luminosity feature accepts almost all fingerprints, with far between 0.9 and 1.0. With the threshold from the mob/het scenario, the scheme rejects all audio fingerprints, and either accepts or rejects all luminosity fingerprints, following no discernible pattern.
In the mob/het scenario, the colocation of devices changes over time, as they move between offices. This makes it impossible to perform a comprehensive evaluation of the scheme proposed by Miettinen et al., as the mobile devices often do not stay colocated with any device long enough to establish a pairing. We thus limit our evaluation to a timeframe of approx. 2.5 hours at the beginning of the recording, during which the colocation of all devices remains static.
The error rates for both luminosity and audio fingerprints are increased compared to the other scenarios, in some cases significantly so (cf. Table 5). The eer of the luminosity fingerprints are above 0.5 for all combinations of parameters, and the best observed eer of the audio fingerprint exceeds those of the car and office scenarios by more than four percentage points. Aiming for a low far will result in unacceptably high frr (cf. (c)). For audio fingerprints, this decreased performance can be attributed to the varying microphone characteristics leading to sounds being received with different amplitudes, resulting in deviating fingerprints. The luminosity fingerprints are challenged by the different positions of the mobile devices, which are in some cases carried in pockets and thus do not receive the same luminosity readings as other devices.
Luminosity fingerprints remain heavily biased towards zero, and the audio fingerprints also frequently show strong biases towards 1 or 0, following no discernible dependence on the parameters and . Using the surprisal thresholds leads to small improvements (less than 2 percentage points) in the error rates for audio fingerprints, at the cost of excluding 10-20% of the dataset. For luminosity fingerprints, even the smallest threshold excludes 96% of the dataset and does not improve the error rates significantly.
Applying the car threshold to the dataset will result in varying error rates, often rejecting all samples, and never coming close to the original error rates for the audio fingerprint. The luminosity fingerprints will occasionally reach error rates close to the original, following no particular pattern, but will often reject all fingerprints as well. The behavior of the office threshold is similar, rejecting close to all samples for fingerprint types.
Our evaluation has shown that the scheme is unable to provide good separation between colocated and non-colocated devices, exhibiting large far and frr. Low far can only be obtained at the cost of large frr. The best performance is achieved using audio fingerprints in the office scenario, likely because of the homogeneous hardware and low level of background noise. We also investigate the impact of using the surprisal thresholds proposed by Miettinen et al. and find that it will in some cases slightly increase the performance of the scheme but excludes a significant fraction of the dataset in the process, reducing the availability.
The randomness of the generated fingerprints is limited, with devices often showing strong biases towards either 1 or 0, enabling adversaries to break the scheme in a practical deployment by guessing the fingerprint. This illustrates the importance of using an environmental data source with sufficient variability (unlike fixed electric lighting) and a quantization scheme that ensures a roughly equal proportion of 1- and 0-bits, e.g., (Schürmann and Sigg, 2013).
Miettinen et al. did not compute error rates but observed an average colocated luminosity and audio fingerprint similarity of 95% and 91.8%, respectively, using an interval of in their office scenario. For non-colocated devices, they saw similarities between 68% and 88% for luminosity and 62% to 71% for audio. We were unable to achieve this degree of similarity on our dataset.
4.4. Truong et al.
Truong et al. (Truong et al., 2014) propose combining multiple types of context information to increase the reliability and performance of zia schemes. They collect WiFi, Bluetooth, GPS, and audio data and compute a number of context features, aggregated over a time interval
. Features based on the first three modalities are computed based on distances between sets of observed devices and signal strengths, while the audio data is used to calculate the maximum cross-correlation and time-frequency distance between the audio snippets. Colocation is determined using a machine learning classifier, which has been trained with a labeled dataset of colocated and non-colocated features. Due to technical limitations of the used hardware, we were unable to capture GPS data. However, Truonget al. found that the GPS feature contained the least amount of discriminative power in their dataset, which was obtained by having volunteers in two cities collect context information and colocation ground-truth data using smartphones and tablets in locations of their choice. The full details of the scheme are given in Section A.4.
To investigate the performance of machine learning colocation prediction, we use the H2O framework (team, 2015)
to train a set of classifiers and pick the best performers. We evaluated Gradient Boosting Machines (GBMs)(Friedman, 2001)
and Random Forests (DRFs)(Breiman, 2001) as classifiers, and then select the algorithm that gives the best cross-validated performance. These classifiers perform well in a wide range of datasets (Fernández-Delgado et al., 2014), they are fast, and they can handle instances with missing data directly in the model, allowing us to use instances with missing data in our datasets, which would otherwise have to be discarded. This is desirable, as in the real world, data may be incomplete (e.g., due to missing GPS fixes). These partial instances still provide information about the generating distribution and therefore are beneficial for the model, as shown by Tang et al. (Tang and Ishwaran, 2017). When building the cross-validation folds, H2O uses stratified sampling. This helps alleviate issues that can arise from class imbalances such as in datasets that contain more non-colocated than colocated instances.
To rank the classifiers, we use 10-fold cv and estimate the auc, which measures the quality of the predictions irrespective of the selected thresholds. A higher auc indicates a more accurately discriminative model. Using this measure is valid in this case, as we are interested in lower false accept and false reject errors along the predicting threshold domain.
For the learning, we let H2O split the data into training and validation datasets of 80% and 20% respectively. H2O will train a set of models independently from each other and automatically perform a parameter search to find optimal parameters for the specific dataset. Once we have found the top performing models, we get the cross-validated predictions . To convert those predictions to actual classes we use a threshold and classify predictions that satisfy as colocated. By optimizing the threshold, we balance the values between far and frr to obtain the eer or our target far. We also evaluate the impact of the individual features in the process using the normalized relative importance. The authors evaluated different interval length and came to the conclusion that increasing above 10 seconds did not significantly increase the performance of the scheme. To validate this result, we evaluate two datasets, with and 30. We present our results in Table 6 and 7.
In this scenario, we obtain an eer of 0.111 and 0.104 for and 30, respectively. Cross-correlation and time-frequency distance of the audio recordings account for 85% of the relative feature importance. This is expected, as the route passed through many areas without WiFi ap and ble devices.
When investigating subscenarios, we observe that the parked subscenario exhibits a significantly higher eer than the other subscenarios. In this subscenario, the models also show a lower reliance on audio features, with those features making up only 64% of importance, and a higher precedence being given to WiFi and BLE features. This can likely be explained by the low audio activity in this subscenario, leading the model to use these less reliable features and thus reducing classification performance.
The frr, given in (a), show similar trends: city and highway exhibit the lowest frr for the desired far, with the parked car significantly above them, and the full dataset somewhere in between. The frr also show a steeper drop in the beginning that tapers off later.
To test the robustness of the model, we use it to obtain predictions on the data from the other scenarios, applying the eer threshold determined before. This results in significantly increased error rates for all combinations of scenarios and intervals, with far larger than 0.44 for the office scenario, and frr in excess of 0.6 for the mob/het scenario, indicating that the model’s performance will deteriorate when used on data from a scenario it has not been trained on and thus is not robust to being operated in different environments.
The robustness of models trained on subscenario datasets shows significant variation. Combinations of the city and highway subscenarios show changes between 0 and 4 percentage points, while combinations involving the parked subscenario show changes between 25 and 82 percentage points. This indicates that the models are robust to small changes in the environment, but cannot adapt to significant deviations.
We observe a slightly improved eer of 0.084 () and 0.069 (). Surprisingly, the WiFi features are not more relevant, despite WiFi being one of the best features reported by Truong et al., and one would expect more stable signals for stationary devices compared to the mobile car scenario. However, our results show the audio features are even more relevant with a combined relative importance of 91%. To investigate if this is caused by the missing WiFi data in the dataset (cf. Section 5), we repeat the analysis, excluding instances where the WiFi data is missing due to a scan error, and obtain unchanged results. Thus, even in a dataset that contains WiFi data for all samples, the feature does not become more relevant for the classifier. The subscenarios show a much more similar behavior than in the previous scenario, with eer between 0.071 (weekend) and 0.087 (weekday). This trend is also shown in the frr evaluation in (b), where the curves are all closely matched.
When running the model on the car dataset for , we obtain an far and frr of 0.174 and 0.412, respectively, with increasing the error even further. Applying it to the mob/het dataset yields error rates of 0.022 and 0.711, respectively, once again increasing further for . This shows that the models are sufficiently different such that generalization is low and therefore robustness of the scheme suffers. Switching between the different subscenarios results in less pronounced changes, but still in some cases doubles the error rates. The scheme appears especially challenged when applying the weekend model to the other subscenarios, often doubling the error rates, while the weekday model is fairly robust, with only minor changes to most error rates. This is likely due to the higher complexity of the weekday dataset, which contains data from a more diverse set of situations.
In the mob/het scenario, we obtain eer of 0.127 and 0.123, respectively (cf. Table 8). Once again, the most important features are audio-based, although their importance is less pronounced, making up only 60% and 56% of relative feature importance for and 30, respectively. Optimizing for a low far will result in frr between 0.9 and 0.3 (cf. (c)). This lower overall performance and the reduced prominence of the audio features is likely related to the issue of heterogeneous microphone characteristics, which we previously observed in the scheme proposed by Karapanos et al. (cf. Section 4.1.4), as Truong et al. use similar audio features.
Using the model to classify the car and office datasets results in significantly increased error rates (FAR 0.7, FRR 0.22 for all combinations), showing that the model is not robust to different environments.
Our evaluation shows that the scheme can achieve a good eer in some of our scenarios, although it does not reach the error rates of the original paper, which observed a far and frr of 0.0198 and 0.0167 for . We also see that models generated in one scenario show a significant loss in accuracy when being used in another scenario, and that the scheme encounters problems when using heterogeneous microphones. The authors also performed an experiment where pairs of devices were placed in close proximity (which matches our office scenario), obtaining a far of 0.0476, but did not report the frr, which prevents a direct comparison.
Contrary to the original evaluation, the classification performance increased with larger intervals. We also saw a much higher importance of the audio feature than the original paper and a correspondingly lower importance of the WiFi feature. This is likely related to the collection strategy employed by the authors, who collected their dataset in different locations across two cities, which can be easily distinguished by their different WiFi signals.
The subscenario evaluation shows that the system does not work well in environments with little context activity, like cars parked in areas without WiFi and BLE devices. The differences between the subscenarios were less pronounced in the office scenario, where a larger number of WiFi and BLE devices were visible at all times.
Two factors limit the validity of our results. First, we did not collect GPS data, used by the authors. We assume that the impact would have been low in the office and mob/het scenarios, where devices were located close to each other and mostly static, however, it may have improved performance in the car scenario. Second, we use a different classifier than the authors, who used a Multiboost classifier (Webb, 2000), which is not supported in H2O. Still, DRFs and GBMs use ensemble methods similar to Multiboost and are unlikely to give significantly worse results.
4.5. Shrestha et al.
Shrestha et al. (Shrestha et al., 2014) propose combining readings from temperature, humidity, altitude, and precision gas sensors to decide if two devices are colocated. They compute the absolute difference between the readings of two devices and use a Multiboost classifier (Webb, 2000) trained on a labeled dataset to distinguish colocated and non-colocated devices. As our devices did not feature a precision gas sensor, we omit this feature. The sensor readings are not averaged over time intervals but used individually. Their dataset was obtained by collecting data from several locations using a pair of devices. Any data collected at different locations and times is interpreted as non-colocated. Additional details of the scheme are given in Section A.5.
Although the machine learning methodology is identical to that used for the paper by Truong et al. (cf. Section 4.4 for details), the characteristics of the datasets and volume of data demand different treatment. One assumption made by any classifier in machine learning is that it estimates a surjective function from a vector of features to a particular class , i.e., all unique instances in the dataset map to exactly one class. However, our datasets do not fulfill this requirement, as several identical instances map from the same feature values to different classes. As the classifier has no additional data to base its decision on, it is unable to distinguish these ambiguous instances and thus can never reach a performance of 100%, i.e., the eer has a lower bound larger than 0. This indicates that more features are needed to discriminate the classes properly. We show the percentage of these ambiguous instances (Amb.) in each dataset in Table 9.
At the same time, it also indicates a potential for compression. Indeed, after analyzing the original office dataset with a size of 81 GB, we observe that many instances are repeated. Therefore, we introduce a pre-processing step before training, where we group all equal instances and keep a count of how many times they appear. These counts are used as weights in the later learning stage, which acts as a lossless compression mechanism. This way, we reduce the dataset to approximately 600 MB, which allows us to train models much faster and with significantly lower computational resources without sacrificing classification performance.
For the car scenario, we obtain an eer of 0.115, with the classifier relying almost evenly on all the features to make the predictions. The individual subscenarios achieve even lower eer, showing rates of 0.034 (parked), 0.08 (highway) and 0.081 (city). This performance correlates with the percentage of ambiguous instances, with subscenarios with more ambiguous instances obtaining higher error rates. The low error rate of the parked subscenario is likely related to the use of temperature sensors, which captured the different rates of heat dissipation of the cars after they were parked. When aiming for a specific far, the differences between the subscenarios are maintained, with the parked subscenario showing consistently lower frr (cf. (a)).
The model is not very robust, showing significantly increased error rates when applied to the office or mob/het dataset (FAR 0.342, FRR 0.503 for office, 0.276 / 0.717 for mob/het). Similarly, applying models specific to one subscenario to another increases the error rates by at least 32 percentage points.
Here, the classifier reaches an eer of 0.247, showing a significantly lower performance than in the car scenario. It relies strongly on the temperature differences to make the predictions. Such a focus on a feature with a low range of potential values may make the classifier more vulnerable to active attacks and is thus undesirable. The low performance is mirrored in the subscenarios, with error rates of 0.148 (weekend) to 0.271 (weekday), which is also borne out in the high frr when aiming for a specific far (cf. (b)). Once again, higher percentages of ambiguous instances translate to higher error rates. We also observe that the percentage of ambiguous instances grows with the size of the dataset. This is to be expected, as a larger dataset has a higher chance of obtaining these instances, as the range of potential values is limited.
When the office model is used on the other two datasets, the error rates increase significantly (FAR 0.186, FRR 0.693 for car, 0.184 / 0.787 for mob/het), showing that the model is not robust to being used in different environments. Of the three subscenario models, the weekday model is the most robust, with error rate increases of below 6 percentage points when applied to different subscenario datasets. However, applying the night model to weekday data increases the error rate by 35 percentage points, and the weekend model increases its error rate by over 43 percentage points with the weekday dataset. This shows that the robustness is limited.
Due to a lack of humidity and temperature sensors, the laptops and robot were not included in the evaluation. The classifier reaches an eer of 0.141 (cf. Table 9), relying primarily on the altitude readings, with a lower importance given to temperature and humidity. A closer investigation revealed that the phones’ barometric pressure readings deviated from those of the watches and SensorTags, showing an offset of approximately 2 hPa. The temperature and humidity readings varied widely, with the position of the device (smartphone in pocket, sensor on screen, …) having a much larger influence than the room they operate in. The error rate is likely related to this challenging environment, as well as the number of ambiguous instances, which make up 12.2% of the dataset. When optimizing for a low far, the resulting frr are at least 0.3 (cf. (c)).
Applying the model to the car dataset results in notably increased error rates (FAR 0.302, FRR 0.672). The office dataset gives similar error rates (0.306 / 0.606), showing limited robustness of the model in different environments.
Overall, the scheme by Shrestha et al. cannot reliably separate colocated from non-colocated devices in most scenarios. This is in stark contrast to the error rates reported by the authors, who obtained an far and frr of 0.0581 and 0.0296, respectively. This deviation can partially be explained by the lack of precision gas features in our datasets, reducing the number of dimensions the models can discriminate on. Another explanation is the more challenging environment our data was collected in—the authors collected their data in widely spaced locations at different times of day, and their non-colocated class consisted of pairings between different locations. This also explains the high discriminative power of the altitude readings reported by the authors, and indicates that the scheme will likely have a much better performance if only coarse colocation is required.
The high number of ambiguous instances shows that the scheme would benefit from incorporating additional sensors to improve its discriminative power. Additionally, features tracking the change in values over time may be a more promising approach, as our dataset shows these changes to be more consistent between devices in the same room than the sensor readings themselves. The results also show that even if high classification performance can be reached, it is still highly specific to the environment it was collected in, and does not transfer well into other environments, i.e., the robustness of the scheme is limited.
In this section, we discuss the implications of the obtained results, the limitations of our method, and avenues for future work.
Our results, summarized in Table 2, show that the scheme by Truong et al. obtains the best eer in the office (0.069) and mob/het (0.123) scenarios, while the scheme by Karapanos et al. achieves a significantly lower error rate (0.006) in the car scenario. This indicates that depending on the use case, both are solid choices. We also observed a large variation in performance on different subscenarios, ranging from perfect accuracy (Karapanos et al., , highway) to significantly degraded performance compared to using the full dataset (Schürmann et al., , parked), illustrating the importance of fine-grained test scenarios.
Some schemes struggle to adapt to times of low ambient activity like the night, where we observed a lower separation between colocated and non-colocated devices. These times need to be taken into account when designing a scheme intended for continuous operation. Karapanos et al. and Miettinen et al. added measures to reduce the impact of these times by dynamically discarding samples with low ambient activity (Karapanos et al., 2015) or high predictability (Miettinen et al., 2014), trading off availability for security.
Even if they can operate in environments with low activity, most schemes suffer from a lack of robustness, i.e., parameters that are optimal for one scenario do not give good performance on the other. Some schemes can achieve a certain degree of robustness for specific parameters (e.g., interval sizes), most notably that by Karapanos et al.
, but no scheme is robust with all parameters. The same trend holds when exchanging parameters or models between different subscenarios, like day and night, especially if they have significantly different ambient activity levels. This further illustrates the importance of testing schemes in a wide variety of settings. We urge researchers to pay special attention to robustness, to facilitate use in the wide variety of different and sometimes unexpected environments the iot will be deployed in.
Even if schemes can provide good performance in settings with homogeneous devices (i.e., the same hardware), they may still fail when encountering devices with different characteristics. These challenges have also been encountered in other research fields such as participatory sensing. Examples from our dataset include microphones with varying sensitivity and frequency response (Kardous and Shaw, 2014; Maisonneuve et al., 2009; Lu et al., 2009), which leads to lower correlations, or incorrectly calibrated sensors (e.g., air pressure) measuring with a fixed offset from one another. In addition, the way a device is carried influences the observed sensor data (Miluzzo et al., 2010). Schemes need to be able to still provide good results under these conditions if they are intended for use cases where the used hardware is not carefully controlled by a single party and should be tested with heterogeneous devices.
Many schemes do not explicitly state their colocation definition. It is often unclear if they are intended to distinguish personal workspaces inside an office, different rooms, or parts of a city, making it difficult to identify security guarantees schemes provide in any specific situation. This hinders a fair evaluation and comparison of these schemes, and makes it hard to determine if our results impact its designated use case. Authors should explicitly define what their scheme considers (non)colocated to allow for fair comparisons.
Technical issues during the recording led to data loss for some features, especially the WiFi captures, which stopped working on some devices. In the mob/het scenario, three devices stopped the data collection before the eight-hour countdown, resulting in partial loss of audio and sensor data. This reduces the amount of available data for the evaluation of features based on these modalities. In the same way, the SensorTag platform occasionally delivered incorrect readings for the luminosity, which we detected and excluded.
Our goal was to compare the different zis schemes in a fair manner and as specified by their original authors. While we attempted to stay as close to the published version of the scheme as possible, in some cases, minor changes had to be made to parameters such as interval length or sampling rate. These deviations are noted in the Appendix, and their influence on the results should be negligible. We did not attempt to optimize any parameters aside from interval length and the power threshold of Karapanos et al. for the mob/het scenario for our dataset, so it is possible that some schemes could perform better when they are instantiated with different parameters.
We also note that our scenarios are challenging, as they include devices in isolated positions (glove compartment, cupboard, pocket), low isolation between two offices, and two cars that traveled next to each other for extended amounts of time. This is intentional to be able to investigate the performance of the systems in challenging situations, as consumer iot deployments rarely follow best practices for a deployment facilitating zero-interaction security. We also did not include any scenarios in busy areas like shopping malls, which may show different behavior due to the higher environmental variations, as we chose to focus on the likely application domain of zero-interaction technologies, consumer applications. Additionally, obtaining approval and informed consent for a long-term data collection in a public place would have been infeasible in our jurisdiction.
Our chosen scenarios only cover a subset of interesting iot environments. Other scenarios may pose different challenges for the schemes. For example, in a smart building scenario, an ideal zis scheme would have to be able to authenticate all devices within the same building, while excluding adjacent buildings. In a café scenario, schemes would need to be able to distinguish individual tables. We also did not include any scenarios where devices operated in environments without any humans for extended amounts of time (e.g., automated factory work floors or storage units), which could pose challenges to many schemes due to the potentially low variation in the context information or different noise characteristics.
The collection of additional datasets will assist efforts to create more adaptive and robust schemes and to understand the limitations of existing ones. Another avenue for future work is the robustness to adversarial settings, where part of the context information can be controlled by an active adversary (e.g., by injecting sound).
We reproduced and evaluated five zip and zia schemes in three realistic scenarios—smart office, connected car and smart office with mobile heterogeneous devices—posing different challenges in aspects like environmental noise, context leakage, and times of low activity. Our results show that none of the reproduced schemes can perfectly separate devices in all scenarios. The schemes by Karapanos et al. (Karapanos et al., 2015) and Truong et al. (Truong et al., 2014) show promising results, but no scheme reliably outperforms all others in all scenarios. The error rates also indicate that zero-interaction security should not be used as the only access control factor, as even a false accept rate of 1% would be considered insufficient for some real-world applications. In fact, Karapanos et al. explicitly proposed their zia scheme as a more convenient second authentication factor (Karapanos et al., 2015) instead of a stand-alone solution.
We also observed that a good average-case separation of context features aggregated over the whole dataset does not imply a good authentication performance on individual samples. zip and zia schemes should thus be evaluated in terms of their error rates in both the average case as well as individual subscenarios to get a realistic impression of their performance. In the same way, the evaluation should be performed using a set of heterogeneous devices in a realistic, challenging environment, to test the limits of the scheme.
Our evaluation revealed that in many cases, features based on ambient audio performed best. However, researchers need to take the privacy implications of using audio recordings into account, as this may not be acceptable in some environments like hospitals. Additionally, the computational costs of processing have to be considered, as expensive audio processing operations may not always be possible on resource-constrained iot devices. Finally, we observed that devices with differing microphone characteristics can significantly degrade the performance of sound-based schemes. Thus, we encourage researchers to continue to investigate the possibility of using more power-efficient features based on low-power sensors. Here, we found that instead of using the absolute difference between sensor readings, trends over time may be a more reliable colocation indicator.
We also observed that the robustness and adaptiveness of many schemes varies dramatically for different scenarios. Schemes should explicitly state which environments they are designed for. Additionally, they should support robustness and adaptiveness, potentially by automatically adapting their internal parameters to their environment, and should be evaluated on data from different scenarios and devices.
Finally, we release the first extensible open source toolkit (Fomichev et al., 2019b) for researching zero-interaction security, containing reference implementations of the reproduced schemes, the audio recordings of the mob/het scenario, and over 1 billion samples of labeled sensor data. We also release all data generated by our evaluation to facilitate the reproduction of our results and provide a common benchmarking baseline for future schemes.
We would like to thank Daniel Wegemer, Jiska Classen, Timm Lippert, Robin Klose, Vanessa Hahn and Santiago Aragón for assistance in conducting this research. Furthermore, we are thankful to Hien Truong, Nikolaos Karapanos, Dominik Schürmann, Markus Miettinen and Babins Shrestha for assistance in reproducing their work. This work has been co-funded by the DFG within CRC 1119 CROSSING and CRC 1053 MAKI projects, and as part of project C.1 within the RTG 2050 “Privacy and Trust for Mobile Users”. Calculations for this research were conducted on the Lichtenberg high performance computer of the TU Darmstadt.
Appendix A Reproduced ZIS schemes
In this appendix, we provide more details about the reproduced zis schemes. We give a brief overview of each scheme’s functionality and use case and describe the implementation of context features utilized by the scheme.
Before computing audio features we aligned audio recordings from different sensing devices as follows. At the beginning of data collection all devices were synchronized using the ntp. First, we performed the coarse-grained alignment using devices’ timestamps to synchronize the start of the audio recordings. Second, during the feature computation we performed a fine-grained alignment between two input audio recordings using the cross-correlation function in Matlab (The MathWorks, Inc., 2018b). Specifically, we considered the first hour of audio recordings to find a lag between them, using the xcorr function ( seconds), then we used this lag to align two audio recordings and cut them to the length of the shortest recording. These aligned recordings are then split into intervals and used to compute audio features.
In the mobile scenario, we increased the to 15 seconds to more precisely find the lag between audio recordings of heterogeneous devices. In addition, we found that heterogeneous devices have an inherent audio drift, causing desynchronization of audio recordings. We removed this drift by applying a time-stretching effect to audio recordings in the Audacity tool (change Tempo).
a.1. Karapanos et al.
The scheme by Karapanos et al. (Karapanos et al., 2015) calculates a similarity score between snippets of ambient audio from two devices to decide if these devices are colocated. The similarity score is the average of the maximum cross-correlations between two audio snippets computed on a set of one-third octave bands. To prevent erroneous authentication when audio activity is low, a power threshold is applied to discard similarity scores from audio snippets with insufficient average power. The similarity score is then checked against a fixed similarity threshold to decide if two devices are colocated. The scheme is designed to provide colocation evidence between a user’s smartphone and a computer with a running browser. This evidence is utilized as a second authentication factor when a user wants to log-in to an online service such as a bank account. In this work, we focus on computing and comparing similarity scores and do not target the specific use case of the second authentication factor.
We first provide notations adopted from the original paper in Table 10. Second, we present parameters of the sound similarity algorithm used in the original and our implementations in Table 11. Our goal was to follow the original implementation as close as possible, however, we introduced a few changes, as we did not have tight synchronization between audio snippets. Third, we present our implementation of the sound similarity algorithm in Section A.1.1.
|input audio snippets|
|length of input audio snippets in seconds|
|max cross-correlation lag in seconds|
|sampling rate of input audio snippets in kHz|
|average power threshold in dB|
|set of considered one-third octave bands|
|number of considered one-third octave bands|
|, sec||, sec||, kHz||, dB|
|3||0.15||44.1||40||50Hz – 4kHz (20)|
|5||1||16||40/38/35||50Hz – 4kHz (20)|
|10||1||16||40/38/35||50Hz – 4kHz (20)|
|15||1||16||40/38/35||50Hz – 4kHz (20)|
|30||1||16||40/38/35||50Hz – 4kHz (20)|
|60||1||16||40/38/35||50Hz – 4kHz (20)|
|120||1||16||40/38/35||50Hz – 4kHz (20)|
As shown in Table 11, our implementation differs with respect to these parameters from the implementation by Karapanos et al. First, we increase both the length of input audio snippets from 3 to 5 seconds and the length of the maximum cross-correlation lag from 0.15 to 1 second to achieve a comparable level of authorization to the authors. We observed that even after the alignment procedure (cf. Audio preprocessing) there might be an offset within long audio recordings (24 hours), which can affect synchronization between audio snippets. That is why, we set to maintain a balance between security ( thwarts attackers trying to guess the audio environment) and non-tight synchronization, which can happen in a realistic iot scenario. The increase of leads to the increase of the audio snippet length to 5 seconds. Second, we use a lower sampling rate for the input audio snippets : 16 vs. 44.1 kHz, which does not affect the sound similarity algorithm itself, but can be used to speed up the computations, as a smaller number of samples needs to be processed. Despite the lower sampling rate and, thus, narrower audio spectrum (8 kHz) we cover the same set of octave bands as the original implementation.
As stated in Section 4, we evaluate the performance of the scheme on a number of intervals from 5 to 120 seconds.
a.1.1. Implementation of the sound similarity algorithm
As input, we have two aligned audio snippets and of equal length with a sampling rate .
Both and are split into one-third octave bands using a bank of band-pass filters:
For each and the normalized maximum cross-correlation is computed as the function of the lag (we omit indexes for simplicity):
In Equation 2 the term is a cross-correlation function between two discrete signals and :
is the number of samples in the signals111We assume signals and have the same length and the lag is bounded within a range . The normalization term accounts for different amplitudes of signals and , with and being the auto-correlation functions. The resulting maximum cross-correlation is bounded within a range , because we take the absolute value of the normalized cross-correlation .
The resulting similarity score between two audio snippets and is obtained by taking the average of the normalized maximum cross-correlations computed in each one-third octave band:
The similarity score is only used if the input audio snippets have sufficient average power: . Otherwise, it is discarded and no authentication is attempted.
|Band Number||, Hz||, Hz||, Hz|
- lower band frequency, - calculated center frequency (nominal frequency), - upper band frequency
a.2. Schürmann and Sigg
The scheme by Schürmann and Sigg (Schürmann and Sigg, 2013) computes a binary fingerprint from a snippet of ambient audio, based on energy differences in successive frequency bands. Two devices wishing to establish a pairing compute such fingerprints from their ambient environments. These fingerprints are used in a fuzzy commitment scheme to obtain a shared secret. One device uses its fingerprint to hide a randomly chosen secret and sends this commitment to the other device, which can only retrieve the random secret from the commitment if it has a sufficiently similar fingerprint. In this work, we focus on deriving and comparing binary fingerprints and we do not target a specific use case of establishing a shared secret key.
We first provide notations adopted from the original paper in Table 13. Second, we present parameters of the audio fingerprinting algorithm used in the original and our implementations in Table 14, where we introduce a few changes, as our audio snippets have a lower sampling rate. Third, we present our implementation of the audio fingerprinting algorithm in Section A.2.1.
|input audio snippet|
|length of the input audio snippet in seconds|
|sampling rate of the input audio snippet in kHz|
|number of frames to split the input audio snippet|
|number of frequency bands to split each frame|
|length of each frame in seconds (duration)|
|width of each frequency band in Hz|
|binary fingerprint of length in bits|
|, bits||, sec||, kHz||, frames||, bands||, sec||, Hz|
As shown in Table 14, our implementation differs with respect to some parameters from the implementation by Schürmann and Sigg. First, we use a lower sampling rate of 16 kHz instead of the original 44.1 kHz, which affects the number of frequency bands we can split our frames into. With a 16 kHz sampling rate our audio spectrum is only 8 kHz, thus we can only obtain 32 non-overlapping frequency bands, each of width 250 Hz . Having 32 frequency bands instead of 33 as in the original implementation results in shorter binary fingerprints of 496 instead of 512 bits. Second, we vary the lengths from 5 to 120 seconds, which also affects the length of a single frame , which varies between 0.29 and 7.06 seconds. We note that shorter audio frames (e.g. ) are more susceptible to synchronization issues between input audio snippets, thus reducing the similarity of binary fingerprints generated from these snippets. However, starting from our frame length is bigger than in the original implementation, which makes our results comparable and allows us to access the performance of the scheme (i.e. distinguishing between colocated and non-colocated devices) on longer audio snippets.
a.2.1. Implementation of the audio fingerprinting algorithm
As input, we have an audio snippet of length with a sampling rate (audio snippets from different devices are aligned). The number of frames and the number of frequency bands are selected to obtain the binary fingerprint of the desired length:
The width of a frequency band depends not only on the number of bands but also on the available audio spectrum which is limited by the Nyquist frequency ():
The audio snippet S is split into n successive frames of equal length in samples ( is a x1 vector).
Each frame is split into non-overlapping frequency bands of width using a bank of band-pass filters:
In our implementation the available audio spectrum is 8 kHz, thus we split it into the following 32 bands of width 250 Hz: , using a 20th-order Butterworth filter (The MathWorks, Inc., 2018a) for each band.
For each frame the energy of each frequency band is computed as (superscript denotes transpose):
The results of energy computation are stored in the energy matrix ():
The binary fingerprint is obtained by iterating over consecutive frames and frequency bands . Each bit of the fingerprint is generated by checking the energy difference between successive frequency bands of two consecutive frames ():
a.3. Miettinen et al.
The scheme by Miettinen et al. (Miettinen et al., 2014) is inspired by the audio fingerprinting scheme proposed by Schürmann and Sigg (cf. A.2) but works on longer timescales. It uses noise level and luminosity measurements to derive long-term binary fingerprints, which can defend against adversaries that are colocated for short timeframes. The scheme utilizes such fingerprints in a fuzzy commitment scheme (as described in A.2) to gradually evolve a shared secret key to achieve pairing between two devices that are colocated for a sustained period of time. In this work, we focus on deriving and comparing long-term binary fingerprints and we do not target a specific use case of establishing a shared secret key.
We first provide notations adopted from the original paper in Table 15. Second, we present parameters of the context fingerprinting algorithm used in the original and our implementations in Table 16. Our goal was to follow the original implementation as close as possible, however, we introduced a few changes, as we use audio with a higher sampling rate to generate noise levels. We discuss the effect of those changes on the parameters of the context fingerprinting algorithm. Third, we present our implementation of the context fingerprinting algorithm in Section A.3.1.
|length of the context snapshot in seconds|
|a new snapshot is recorded every seconds|
|sampling rate of recorded audio in kHz|
|measurement window in seconds|
|relative threshold for fingerprint generation|
|absolute threshold for fingerprint generation|
|, sec||, sec||, kHz||, sec|
As shown in Table 16, our implementation differs with respect to these parameters from the implementation by Miettinen et al. (Miettinen et al., 2014). First, we use audio with a higher sampling rate : 16 vs. 8 kHz to generate noise levels. The noise levels are generated by averaging absolute amplitudes of audio samples over seconds, given by the measurement window. Thus, for , we obtain one noise level from 16000 audio samples, whereas the original implementation computes one noise level from only 8000 audio samples, which makes our noise levels more fine-grained. The original implementation uses two different measurement windows : 0.1 and 1 sec. The shorter measurement window speeds up the fingerprint generation but may be susceptible to synchronization issues, thus we opt for a longer measurement window. For luminosity measurements we do not use the measurement window. We collect luminosity readings at 10 samples per second and use all samples generated during context snapshot length to obtain the fingerprint. Second, we evaluate the context fingerprinting algorithm on the context snapshots of different lengths from 5 to 120 seconds. Thus, we can assess the performance of the scheme (i.e. distinguishing between colocated and non-colocted devices) on shorter context snapshots.
a.3.1. Implementation of the context fingerprinting algorithm
As input we have sets of noise level and luminosity measurements generated from context information collected in our scenarios (i.e. car and office) as stated above. The number of bits in the resulting context fingerprints is given by and , where denotes the set cardinality.
The context snapshot for a timeslot consists of all measurements taken in the timeslot of seconds, . For each context fingerprint the average value is computed as:
Each set of measurements ( or ) can be represented as a sequence of context snapshots . Then the fingerprint bit which corresponds to each snapshot is generated as:
We note that the values for and (cf. Table 16) are not given in the original paper but were provided by the authors in private communication.
The resulting fingerprint for the set of measurements ( or ) is obtained as:
To avoid using fingerprints that are exclusively zero in times of low ambient noise and light, Miettinen et al. proposed an extension to their system: they propose to compute the surprisal of a fingerprint before using it. The surprisal of a single bit of the fingerprint is defined as its self-information , measured in bits:
The surprisal of the whole fingerprint is the sum of the surprisal of its individual bits:
Calculating this surprisal requires knowledge about how often bits occur at specific positions of the fingerprint during specific times of the day, indicated as in the formula. Miettinen et al. do not state the time resolution, but it is implied that the probabilities are tracked on a per-hour basis. For the office scenario, which covers multiple days, we track the probabilities independently for the individual days, i.e., fingerprints generated on weekdays do not influence the probabilities and thus surprisals for the weekend.
Miettinen et al. propose to set a surprisal threshold that the surprisal of a fingerprint has to exceed in order to be considered valid for pairing. This avoids the problem of attacks by adversaries guessing the low-entropy fingerprints generated at night. The threshold is defined as
denotes the number of incorrect bits the fuzzy commitment will tolerate and denotes an extra security margin. However, the authors do not state how this margin should be chosen. Our margin choice is described in Section 4.3
a.4. Truong et al.
The scheme by Truong et al. (Truong et al., 2014) uses WiFi, Bluetooth, GPS and ambient audio collected by two devices to compute a number of context features, which are then fed into a machine learning classifier that outputs a prediction if these devices are colocated. This scheme is designed to provide colocation evidence to thwart relay attacks on wireless channels between a user’s device and terminal, which employ zia (e.g. unlock a computer if a user’s smartphone is nearby). In this work, we focus on computing context features and obtaining classification results from the machine learning algorithms and we do not target the specific use case of thwarting relay attacks.
We first provide notations adopted from the original paper in Table 17. Second, we describe how different context features are computed. Third, we provide details of our machine learning methodology, where we discuss our datasets, the parameters of machine learning algorithms that we use and the evaluation procedure.
Due to a lack of GPS support in the used hardware, we were unable to collect GPS information. However, since our office scenario is static and the car scenario mostly considers geographically close cars, the information value of the GPS features would have been low. In addition, the original authors report that the GPS feature contains the least amount of discriminative power in their dataset.
a.4.1. Non-audio features
|identifier of the th beacon observed by device|
|signal strength of th beacon observed by device|
|value substituted for missing signal strengths|
|set of records sensed by device|
|number of different beacons observed by device|
|beacons seen by and|
|beacons seen by or , substituted for missing|
|input audio snippets|
|length of input audio snippets in seconds|
|sampling rate of input audio snippets in kHz|
The features for WiFi, Bluetooth and GPS are defined over a number of sets. Individual samples for each context information are defined as a tuple , where denotes the identifier of the observed beacon (i.e., BLE MAC address, WiFi BSSID) and denotes the received signal strength. The set of records observed by devices and is denoted as and , respectively, while and denote the number of unique beacons observed by the devices. The notation is also given in Table 17. Given these preconditions, the following sets are defined:
Truong et al. uses these sets to define a total of six features, five of which we implement: the Jaccard Distance , mean Hamming distance , Euclidean distance , mean exponential of difference , the sum of squared ranks . The sixth feature, subset count, is only used for the GPS data and thus omitted. The features are given by the following formulas, where is specific to certain context information.
denotes the set cardinality; () is the rank of () in the set () sorted in ascending order.
For WiFi, all features are used. The signal strength for each observed identifier is set to the average observed signal strength for that identifier over all included scans. , which is substituted as signal strength for devices that have been observed by one but not the other device, is set to -100. For ble, features 1 and 3 are used, once again using the average observed signal strength for each identifier as and .
In case both sensors observe no beacons, the distances are not defined, and the original paper does not specify a behavior for this case. In private communication, the authors recommended choosing either zero (if the system should be biased towards accepting) or a very high number (if it should be biased towards rejecting). In our case, we chose to replace undefined values with the distance to bias the system towards rejecting when in doubt.
a.4.2. Audio features
Truong et al. use two audio features: the maximum cross-correlation and time-frequency distance computed on snippets of ambient audio of length seconds. The authors do not provide the sampling rate of their audio snippets; in our implementation kHz. We compute these context features on audio snippets of different lengths from 5 to 120 seconds. In the end, we create and evaluate two different datasets for machine learning, one using , the other .
In the following, we explain how the maximum cross-correlation and time-frequency distance are computed. We note that this information is not available in the original paper and was obtained via private communication with the authors.
As input we have two aligned audio snippets and of equal length with a sampling rate .
and are normalized as (superscript denotes transpose):
Here, the denominator represents a square root of the signal’s energy.
The maximum cross-correlation between the normalized audio snippets and is computed as (we omit prime superscripts in for simplicity):
denotes the absolute value, is the number of samples in audio snippets, and the lag is set to the default value (The MathWorks, Inc., 2018b). The resulting maximum cross-correlation is bounded within a range , because we take the absolute value .
To compute the frequency distance between audio snippets and a fft weighted by a Hamming window is applied:
Since the fft is symmetric, only a half of the fft values is taken to construct frequency vectors for and :
denotes the absolute value, and are lengths of fft vectors and .
Frequency vectors and are normalized similarly to step (1):
The frequency distance between audio snippets and is given by:
denotes element-wise multiplication.
The time distance between audio snippets and is given by:
The time-frequency distance between audio snippets and is given by:
a.4.3. Machine Learning
with grafted C4.5 decision trees(Webb, 1999) as weak learners in their evaluation. As Weka does not support large datasets, we chose to use the H2O framework (team, 2015)
instead. For the training of the classifiers, we set the seed to 1619 and the early stopping to 5 rounds. This means that the training is repeatable when using the same seed and dataset, and the system will consider learning complete once no improvements have been made for five iterations. We let H2O train a set of independent models and perform a hyperparameter search to optimize the parameters (e.g., number of trees in the random forest) for the dataset, maximizing the cross-validated auc. Afterwards, we select the top performing model and determine its eer as described inSection 4.4.
a.5. Shrestha et al.
The scheme by Shrestha et al. (Shrestha et al., 2014) utilizes ambient temperature, humidity, pressure, and precision gas collected by two devices to compute a number of context features, which are then fed into a machine learning classifier that outputs a prediction if these devices are colocated. Similarly to Truong et al., this scheme addresses relay attacks by providing colocation evidence between two devices involved in zia. In this work, we focus on computing context features and obtaining classification results from the machine learning algorithms and we do not target a specific use case of thwarting relay attacks.
We first provide notations adopted from the original paper in Table 18. Second, we describe how different context features are computed. Third, we provide details of our machine learning methodology, where we discuss our datasets, the parameters of machine learning algorithms that we use and the evaluation procedure.
Due to a lack of hardware support, we were unable to collect precision gas and thus omit this context feature.
|sample of context information by device|
|distance between samples of devices and|
a.5.1. Context features
The authors convert ambient pressure in millibars to altitude in meters using the following formula before computing context features.
For each of the considered context information (ambient temperature, humidity, and altitude), the context feature is given by the absolute difference between two samples of context information collected devices and at time :
a.5.2. Machine learning
Appendix B Study design
Table 19 presents hardware used to collect context information in car, office and mob/het scenarios.
|Sensor type||Sensing device (sampling rate)|
|Samsung Galaxy S6||Samsung Gear S3||RuuviTag+|
|Audio||16 kHz||16 kHz||16 kHz||-|
|Barometric pressure||10 Hz||5 Hz||10 Hz||10 Hz|
|Humidity||10 Hz||-||-||10 Hz|
|Luminosity||10 Hz||5 Hz||10 Hz||-|
|Temperature||10 Hz||-||-||10 Hz|
|ble beacons||0.1 Hz||0.1 Hz||0.1 Hz||-|
|WiFi beacons||0.1 Hz||0.1 Hz||0.1 Hz||-|
|Accelerometer||10 Hz||50 Hz||50 Hz||-|
|Gyroscope||10 Hz||50 Hz||50 Hz||-|
|Magnetometer||10 Hz||50 Hz||50 Hz||-|
- = sensor not available
Figure 10 shows the route the cars took during the car scenario (cf. Section 3.2). The route covers city traffic, country roads and highways between the cities of Darmstadt and Frankfurt in the state of Hesse in Germany (the actual GPS traces can be found in (Fomichev et al., 2019b)).
|Office 1||Office 2||Office 3|
|01||Near WiFi access point (h)||09||Screen of User 2 (m)||17||Wall behind Users 2 and 3 (h)|
|02||Window sill (m)||10||Window sill (m)||18||Window sill (m)|
|03||Above door to Office 2 (h)||11||Above door to Office 1 (h)||19||Lamp above User 1 (h)|
|04||Lamp above User 1 (h)||12||Lamp above User 1 (h)||20||Screen of User 1 (m)|
|05||Screen of User 1 (m)||13||Right screen of User 1 (m)||21||Screen of User 2 (m)|
|06||Screen of User 2 (m)||14||Left screen of User 1 (m)||22||Screen of User 3 (m)|
|07||In the cupboard (h)||15||In the cupboard (l)||23||Shelf next to the door (m)|
|08||Wall next to the door (h)||16||Shelf left of the door (m)||24||In the cupboard (h)|
h-= high position; m-= medium position; l-= low position
|Office 1||Office 2||Office 3|
|01||Screen of User 1 (m)||11||Screen left from User 3 (m)||18||Screen in front of User 4 (m)|
|02||Screen of User 2 (m)||12||Screen of User 3 (m)||19||Screen of User 4 (m)|
|03||Near a power plug (l)||13||Near a power plug (l)||20||Near a power plug (l)|
|04||On top of robot station (l)||14||Near a fan (l)||21||On top of coffee machine (m)|
|05||Smartphone of User 1||15||Smartphone of User 3||22||Smartphone of User 4|
|06||Smartwatch of User 1||16||Smartwatch of User 3||23||Smartwatch of User 4|
|07||Laptop of User 1||17||Laptop of User 3||24||Laptop of User 4|
|08||Smartphone of User 2|
|09||Smartwatch of User 2|
|10||Laptop of User 2|
|25||Smartphone on top of robot|
h-= high position; m-= medium position; l-= low position; -= mobile device
- ANSI/ASA S1.11 (2004) ANSI/ASA S1.11 2004. Specification for Octave-Band and Fractional-Octave-Band Analog and Digital Filters. Standard. American National Standards Institute.
- Benureau and Rougier (2018) Fabien C. Y. Benureau and Nicolas P. Rougier. 2018. Re-run, Repeat, Reproduce, Reuse, Replicate: Transforming Code into Scientific Contributions. Frontiers in Neuroinformatics 11 (2018), 69.
- Breiman (2001) Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5–32.
- Brüsch et al. (2018) Arne Brüsch, Ngu Nguyen, Dominik Schürmann, Stephan Sigg, and Lars Wolf. 2018. On the Secrecy of Publicly Observable Biometric Features: Security Properties of Gait for Mobile Device Pairing. CoRR abs/1804.03997 (2018).
- Elkhodr et al. (2016) Mahmoud Elkhodr, Seyed Shahrestani, and Hon Cheung. 2016. The Internet of Things: New Interoperability, Management and Security Challenges. International Journal of Network Security and its Applications 8, 2 (2016), 85–102.
- Fernández-Delgado et al. (2014) Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do We Need Hundreds of Classifiers to Solve Real World Classification Problems. J. Mach. Learn. Res 15, 1 (2014), 3133–3181.
- Fomichev et al. (2018) Mikhail Fomichev, Flor Álvarez, Daniel Steinmetzer, Paul Gardner-Stephen, and Matthias Hollick. 2018. Survey and Systematization of Secure Device Pairing. IEEE Communications Surveys Tutorials 20, 1 (2018), 517–550.
- Fomichev et al. (2019a) Mikhail Fomichev, Max Maass, Lars Almon, Alejandro Molina, and Matthias Hollick. 2019a. Audio Data from Mobile Scenario from ”Perils of Zero-Interaction Security in the Internet of Things”. https://doi.org/10.5281/zenodo.2537984
- Fomichev et al. (2019b) Mikhail Fomichev, Max Maass, Lars Almon, Alejandro Molina, and Matthias Hollick. 2019b. Index of Supplementary Files from ”Perils of Zero-Interaction Security in the Internet of Things”. https://doi.org/10.5281/zenodo.2537721
- Friedman (2001) Jerome H Friedman. 2001. Greedy Function Approximation: a Gradient Boosting Machine. Annals of statistics (2001), 1189–1232.
- Futurae Technologies AG (2017) Futurae Technologies AG. 2017. Futurae Authentication Suite. https://www.futurae.com/product/strongauth/ [Online, Accessed 2018-04-25].
- Hall et al. (2009) Mark A Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The WEKA Data Mining Software: an Update. SIGKDD Explorations 11, 1 (2009), 10–18.
- Han et al. (2018) Jun Han, Albert Jin Chung, Manal Kumar Sinha, Madhumitha Harishankar, Shijia Pan, Hae Young Noh, Pei Zhang, and Patrick Tague. 2018. Do You Feel What I Hear? Enabling Autonomous IoT Device Pairing Using Different Sensor Types. In 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 836–852.
- Karapanos et al. (2015) Nikolaos Karapanos, Claudio Marforio, Claudio Soriente, and Srdjan Capkun. 2015. Sound-Proof: Usable Two-Factor Authentication Based on Ambient Sound. In USENIX Security Symposium. 483–498.
- Kardous and Shaw (2014) Chucri A. Kardous and Peter B. Shaw. 2014. Evaluation of Smartphone Sound Measurement Applications. The Journal of the Acoustical Society of America 135, 4 (apr 2014), EL186–EL192.
- Kolias et al. (2017) C. Kolias, G. Kambourakis, A. Stavrou, and J. Voas. 2017. DDoS in the IoT: Mirai and Other Botnets. Computer 50, 7 (2017), 80–84. https://doi.org/10.1109/MC.2017.201
- Lu et al. (2009) Hong Lu, Wei Pan, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2009. SoundSense: Scalable Sound Sensing for People-Centric Applications on Mobile Phones. In Proceedings of the 7th international conference on Mobile systems, applications, and services - Mobisys ’09. ACM Press, New York, New York, USA, 165.
- Maisonneuve et al. (2009) Nicolas Maisonneuve, Matthias Stevens, Maria E. Niessen, and Luc Steels. 2009. NoiseTube: Measuring and Mapping Noise Pollution with Mobile Phones. In Information technologies in environmental engineering. Springer, 215–228.
- Mare et al. (2014) Shrirang Mare, Andrés Molina Markham, Cory Cornelius, Ronald Peterson, and David Kotz. 2014. Zebra: Zero-effort Bilateral Recurring Authentication. In Security and Privacy (SP), 2014 IEEE Symposium on. IEEE, 705–720.
- Miettinen et al. (2014) Markus Miettinen, N Asokan, Thien Duc Nguyen, Ahmad-Reza Sadeghi, and Majid Sobhani. 2014. Context-based Zero-Interaction Pairing and Key Evolution for Advanced Personal Devices. In ACM Conference on Computer and Communications Security (CCS). ACM, 880–891.
- Miluzzo et al. (2010) Emiliano Miluzzo, Michela Papandrea, Nicholas D Lane, Hong Lu, and Andrew T Campbell. 2010. Pocket, Bag, Hand, etc. - Automatically Detecting Phone Context through Discovery. PhoneSense 2010: International Workshop on Sensing for App Phones (November 2, 2010), held at ACM SenSys ’10 (Zurich, Switzerland, November 2-5, 2010) (2010), 21–25.
- Perera et al. (2014) Charith Perera, Arkady Zaslavsky, Peter Christen, and Dimitrios Georgakopoulos. 2014. Context Aware Computing for the Internet of Things: A Survey. IEEE communications surveys & tutorials 16, 1 (2014), 414–454.
- Schürmann et al. (2017) Dominik Schürmann, Arne Brüsch, Stephan Sigg, and Lars Wolf. 2017. BANDANA - Body Area Network Device-to-device Authentication Using Natural gAit. In IEEE International Conference on Pervasive Computing and Communications (PerCom). IEEE, 190–196.
- Schürmann and Sigg (2013) Dominik Schürmann and Stephan Sigg. 2013. Secure Communication Based on Ambient Audio. IEEE Transactions on mobile computing 12 (2013), 358–370.
- Shepherd et al. (2017) Carlton Shepherd, Iakovos Gurulian, Eibe Frank, Konstantinos Markantonakis, Raja Naeem Akram, Emmanouil Panaousis, and Keith Mayes. 2017. The Applicability of Ambient Sensors as Proximity Evidence for NFC Transactions. In 2017 IEEE Security and Privacy Workshops (SPW). IEEE, 179–188.
- Shrestha et al. (2016a) Babins Shrestha, Manar Mohamed, and Nitesh Saxena. 2016a. Walk-Unlock: Zero-Interaction Authentication Protected with Multi-Modal Gait Biometrics. CoRR abs/1605.00766 (2016).
- Shrestha et al. (2016b) Babins Shrestha, Manar Mohamed, Sandeep Tamrakar, and Nitesh Saxena. 2016b. Theft-Resilient Mobile Wallets: Transparently Authenticating NFC Users with Tapping Gesture Biometrics. In Proceedings of the 32nd Annual Conference on Computer Security Applications. ACM, 265–276.
- Shrestha et al. (2014) Babins Shrestha, Nitesh Saxena, Hien Thi Thu Truong, and N Asokan. 2014. Drone to the Rescue: Relay-resilient Authentication Using Ambient Multi-Sensing. In International Conference on Financial Cryptography and Data Security (FC). Springer, 349–364.
- Shrestha et al. (2018) Babins Shrestha, Nitesh Saxena, Hien Thi Thu Truong, and N Asokan. 2018. Sensor-based Proximity Detection in the Face of Active Adversaries. IEEE Transactions on Mobile Computing (2018).
- Shrestha et al. (2016c) Babins Shrestha, Maliheh Shirvanian, Prakash Shrestha, and Nitesh Saxena. 2016c. The Sounds of the Phones: Dangers of Zero-Effort Second Factor Login based on Ambient Audio. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 908–919.
- Sigg (2011) Stephan Sigg. 2011. Context-based Security: State of the Art, Open Research Topics and a Case Study. In Proceedings of the 5th ACM International Workshop on Context-Awareness for Self-Managing Systems. ACM, 17–23.
Tang and Ishwaran (2017)
Fei Tang and Hemant
Random Forest Missing Data Algorithms.
Statistical Analysis and Data Mining: The ASA Data Science Journal10, 6 (2017), 363–377.
- team (2015) The H2O.ai team. 2015. H2O: Scalable Machine Learning. http://www.h2o.ai version 22.214.171.124999 [Online, Accessed 2018-04-25].
- The MathWorks, Inc. (2018a) The MathWorks, Inc. 2018a. Bandpass IIR Filter. https://mathworks.com/help/signal/ref/designfilt.html [Online, Accessed 2018-04-25].
- The MathWorks, Inc. (2018b) The MathWorks, Inc. 2018b. Cross-correlation. https://mathworks.com/help/signal/ref/xcorr.html#bual1fd-maxlag [Online, Accessed 2018-04-25].
- Truong et al. (2014) Hien Thi Thu Truong, Xiang Gao, Babins Shrestha, Nitesh Saxena, N Asokan, and Petteri Nurmi. 2014. Comparing and Fusing Different Sensor Modalities for Relay Attack Resistance in Zero-Interaction Authentication. In IEEE International Conference on Pervasive Computing and Communications (PerCom). IEEE, 163–171.
Geoffrey I. Webb.
Decision Tree Grafting from the All-tests-but-one
Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2(IJCAI’99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 702–707.
- Webb (2000) Geoffrey I. Webb. 2000. MultiBoosting: A Technique for Combining Boosting and Wagging. Machine Learning 40, 2 (2000), 159–196.
- Xi et al. (2016) Wei Xi, Chen Qian, Jinsong Han, Kun Zhao, Sheng Zhong, Xiang-Yang Li, and Jizhong Zhao. 2016. Instant and Robust Authentication and Key Agreement among Mobile Devices. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 616–627.