Consider the problem of sending audio messages to different listeners in a reveberant room, while making sure that each message can only be understood by its intended recipient. Importantly, no eavesdropper anywhere in the room should be able to understand any of the messages.
This problem is related to personal audio zones and sound field reproduction [1, 2, 3, 4, 5, 6, 7, 8, 9] where the goal is to reproduce different sound streams in a few predefined zones in a room, while minimizing the sound level everywhere else. In most of these approaches, however, an eavesdropper with a sensitive microphone (or a good ear) can easily understand the messages. The reason is that the loudspeakers simply reproduce linearly filtered versions of desired messages which remain highly correlated with any residual error signal.
To address the problem of private audio communication, we propose two methods. The first approach communicates audio messages at intended focusing spots by emitting appropriately filtered white Gaussian noise signals from loudspeakers. The filters are constructed such that after passing through specific sets of paths and time delays, these filtered random signals sum up coherently as they arrive at the target focusing points. On the other hand, they yield incoherent signals at locations with different sets of signal propagation paths. This is an extension of our previous work , in which we chopped the intended message signals to generate audio chunks which, in place of white Gaussian noise, were filtered and emitted by loudspeakers. After writing 
, we realized that using the chopped signals is not essential and that the results can in fact be improved by simply using white noise.
In our second approach, the idea is to send random noise from loudspeakers in addition to message signals, such that the noise signals add up to zero only at the intended listening points while they continue to mask the messages everywhere else. This results in the interception of clean audio messages at the focusing spots while having low intelligibility at other locations. This technique is inspired by standard methods in wireless networking on jamming eavesdroppers [11, 12]. However, to the best of our knowledge, the prior works consider fading wireless channels without explicitly considering inter-symbol interference (echoes). While this could be a fair assumption for networks like WiFi where sampling times are much larger than propagation delays of wireless signals, this is not the case in room acoustics. Hence, we adapt this jamming scheme to work with long convolutional channels.
Privacy in multizone reproduction systems was first studied in 
where the authors also use noise to mask message signals in “quiet” zones to reduce intelligibility. While their method is applicable both in anechoic and in reverberant conditions, the performance is degraded in the presence of echoes. On the other hand, as we elaborate later, our methods critically rely on echoes and multipath propagation. In particular, our solutions exploit the spatial diversity of room impulse responses (RIRs) across different locations in a room, and the redundant degrees of freedom in signal transmission provided by multiple loudspeakers. Unlike in multizone methods, however, we can only deliver messages to a small, fixed region of space. On the other hand, we achieve good performance using a rather small number of loudspeakers and impulse response measurements (in our experiments we use only six).
The problem of jamming eavesdroppers has been studied extensively in wireless communication. The theoretical foundation was laid by Shannon  and later extended by [15, 16] who showed the feasibility of secrecy if the communication channel of an eavesdropper is degraded. The methods in [11, 12, 17] use artificial noise;  showed the possibility of secret communication as a consequence of slow wireless fading. Prior works have also looked at a related problem of eavesdropper detection [19, 20, 21].
In this paper, we empirically show that unlike traditional multi-zone sound field reproduction which is usually degraded in reverberant environments [22, 23], both of our proposed approaches give excellent results in the presence of echoes since echoes enhance spatial diversity. We derive conditions needed to generate desired messages at the focusing spots, and demonstrate both numerically and in real experiments that with six speakers and the knowledge of RIRs at the intended listening points, private audio communication is effectively achievable. In addition, we compare the robustness of the two approaches to various system failures and uncertainties.
2 Problem Formulation
Consider a system with loudspeakers, each playing a message signal to listeners. Without loss of generality, let the desired length of the signal at the listener be . We also assume that the room impulse response (RIR) between the listener and the speaker is a sequence which is long.
This signal received by the listener is given as a sum of convolutions:
where is the signal transmitted by the speaker with length
. We define intended message vectoras a concatenation of all : . Similarly, we define channel matrices of size as , where each is a Toeplitz convolution matrix composed using . Defining and , (1) can be rewritten as:
If the matrix has full row rank, we can reconstruct any desired message signals at the listeners. A well-known solution to (2) is given by , where is the pseudo inverse of . Though this solution suffices for message reconstruction at the listeners, it does not enforce unintelligibility at other locations. We could, however, exploit the additional degrees of freedom provided by the nullspace of to generate a suitable that ensures signal degradation outside the target focusing spots.
We note that for typical audio sampling rates, RIR lengths and message lengths,
is far too large to compute the pseudoinverse explicitly. That is why we solve all least-squares design problems in this paper by the conjugate gradient method. Since the involved matrices are all block-Toeplitz, the conjugate gradient method can be efficiently implemented using fast Fourier transforms.
3 The two approaches
As per (2), can be suitably chosen to ensure that the message signals outside the focusing spots remain unintelligible. In this section, we present two methods to achieve this task, each constructing in a different way: (i) multichannel convolutional synthesis by noise and (ii) noise in the nullspace approach.
3.1 Multichannel convolutional synthesis by noise
Recall from (1) that the signal arriving at each listener is . In this first approach, we constrain to be a convolution of a filter of length with a signal of length
, drawn from standard normal distribution. This is equivalent to
where is an Toeplitz convolution matrix composed using the vector , with . We define and a block diagonal matrix as
Then equations in (3) can be combined for all to give and
Given and , can be computed using conjugate gradient method.
This model constrains to lie on a subspace of random vectors. To understand why, consider the signal emitted by the loudspeaker; can be written as
We can interpret as a sum of randomly-scaled translates of filters , where have been constructed such that after convolution of with room impulse responses, they sum up to yield the desired messages at the listeners. Thus a specific set of RIRs , corresponding to the intended listener–speaker pairs correctly descrambles the translates. In a room with rich spatial diversity, since a location other than the intended listening points would have a rather different set of RIRs, we can not expect the descrambling to yield the correct output. The randomness of then ensures non-intelligibility of the resulting signal.
3.2 Noise in the nullspace
We adapt the second approach from the wireless communications literature. Concretely, is chosen as a sum of a message-carrying vector and a noise-like signal , so that
We construct and to satisfy and . This is achieved by choosing as the projection of a random noise vector on the nullspace of the channel matrix , i.e., , where the entries of are iid standard Gaussian and is the projector on the null space of .
As mentioned in Section 2, is typically large, which makes the direct computation of its nullspace a prohibitively complex task. Instead, we first find the projection of on the row space of by solving
We again use the conjugate gradient method to solve (6) using fast Fourier transforms since is block-Toeplitz. Once is found, the nullspace projection is simply .
4 Conditions for perfect reconstruction
In this section, we present the conditions needed to ensure perfect reconstruction of any set of message signals of length at the listeners (or any ) for both approaches.
4.1 Multi-channel convolutional synthesis by noise
From (4), perfect reconstruction can be achieved if the overall channel matrix has full row rank, .
We make an assumption that the room and the loudspeaker and listener positions are generated at random from so that the resulting distribution of the nullspace of is absolutely continuous with respect to the Haar measure on the Grassmanian. Then we have the following result.
Suppose . Then has full row rank with probability one.
has full row rank with probability one.
We have that by rank inequalities. With the conditions of the proposition, this implies that . The only way to have a strict inequality is that the nullspace of intersects the range of along a subspace of dimension greater that . On the other hand, because the nullspace of is continuously distributed and independent from , it will intersect the range of exactly along a subspace of dimension with probability one. ∎
This result implies that for most setups in sufficiently reverberant rooms, we will be able to produce the desired messages at the listener positions.
4.2 Noise in nullspace approach
From (5), needs to have full row rank for perfect reconstruction of all . Similarly to the previous case, since is a function of the RIRs between the speaker-listener pairs, it is not completely in the user’s control to ensure that it has full rank as it depends on room geometry and the spatial diversity of RIRs. In practice, if we assume a randomized setup and room as in the previous section, can be expected to have full row rank with probability one.
The following conditions are necessary for to have full row rank.
The number of rows of should be at least as large as the length of .
There should be at least as many columns as rows in .
needs to be greater than the highest relative time delay among each listener-speaker pair.
ensures that we have sufficient samples to generate the desired message length; is elementary linear algebra; ensures that “silent” regions do not exist within a signal generated at a listening point. ∎
It should be noted that gives a lower bound on the number of speakers, , needed for reconstruction, i.e., . This is lower than the number of speakers needed by the MCCS approach, as per Proposition 4.1
5 Experimental Results
We evaluate the performance of the two proposed techniques using both numerical and real experiments. The numerical experiments are performed with 6 loudspeakers randomly placed in a convex simulated room of size m m having walls with absorption coefficient 0.35. RIRs between the speakers and listeners are calculated based on image source model, using the pyroomacoustics package . We perform the real experiments in an office space of size m m using two Genelec 8030B and four Genelec 8010A loudspeakers. The RIRs are measured using the exponential sine sweep technique . In all experiments, the power of signals emitted by the loudspeakers is kept fixed. The intelligibility of the generated sounds is assessed using Short-Time Objective Intelligibility (STOI)  measure.
5.1 Numerical experiments
5.1.1 Perfect reconstruction: A case for echoes
In order to provide an insight into the importance of echoes in our solution, we first perform an experiment in a simulated anechoic room. We randomly place two listeners inside the room and calculate STOI values of the signals arriving there using the two approaches. An additional location is randomly chosen to check how degraded the audio signals appear outside the target focusing spots. We then repeat the same experiment but in the presence of echoes. Fig. 1 (a) shows that in the anechoic setting, while the signal at the first listener has a high intelligibility with STOI values close to 1 for both approaches, the second listener does not. On the other hand, Fig. 1(b) shows that in the presence of echoes, signal intelligibility is restored at the second listener as well. This indicates that the spatial diversity provided by echoes helps in conditioning the channel matrix , which in turn supports perfect reconstruction of messages at target locations.
5.1.2 Signal degradation outside focusing spots
Both Fig. 1 (a) and (b) indicate that the nullspace-based method has a greater impact on signal degradation at the location chosen outside the focusing spots. To examine this further, we calculate STOI scores at 4200 locations in a simulated reverberant room and create heat maps as shown in Fig. 1 (c) and (d). The bright spots at the locations of focusing points indicate regions of high intelligibility in both plots, whereas the relatively dark regions in Fig. 1 (d), represent lower STOI values for the nullspace approach and, thus, reduced intelligibility as compared to the MCCS approach in Fig. 1 (c).
Both methods perform signal degradation outside the focusing spots using noise vectors. To understand how these random vectors result in unintelligibility of sound, we first investigate the role of noise variance. For 100 randomly selected speaker-listener configurations, we check the impact of increasing noise variance on STOI values for both methods. Fig.2 (a) shows a decline in median STOI scores as the input noise power is increased for the nullspace approach, whereas they do not change much for the MCCS method.
This result is not surprising because in the nullspace approach, noise is fed into the loudspeakers with the message signals in an additive sense. Thus, a deterioration of SNR and subsequent STOI hit is expected with increase in noise variance. On the other hand, from Section 3.1, the signal emitted by the loudspeaker is . In this setting, if the variance of each sample is increased, the filter is simply scaled to preserve the original .
We now investigate the factors that impact the jamming capability of the MCCS approach. Recall that this method involves “scrambling” of message-carrying input filters by noise which are thereby appropriately descrambled at the intended locations by the correct RIR values. Thus, we expect that longer noise vectors would have a stronger impact on signal integrity when the RIR changes. To verify this claim, we vary the length of noise vectors as a proportion of a fixed length , and calculate the STOI scores for 100 randomly chosen speaker-listener configurations. Fig. 2(b) verifies that increasing the length of noise vectors leads to a decrease in median intelligibility scores outside the focusing spots.
These results point towards an interesting phenomenon. For the nullspace approach, the jamming capability can be improved by increasing the input noise power which is upper bounded by the input power constraints at the loudspeaker. On the other hand, in MCCS appoach, for a fixed message length and fixed , is fixed. Thus, jamming can be improved by increasing as long as (from Proposition 4.1).
5.1.3 Robustness to system uncertainties
Here, we assess how the reconstruction of audio messages at the target listeners is affected due to system uncertainties: in particular, the impact of malfunction of a set of speakers after the appropriate
have been estimated, and inaccuracies in the measurement of RIR values. For this, we did simulations over 100 random speaker listener configurations, and checked how the STOI scores were affected. Fig.3(a) indicates that the STOI values for MCCS method decline less rapidly with increasing speaker drops as compared to the nullspace method. Similarly, Fig. 3(b) indicates that errors in the knowledge of RIRs before signal transmission by the loudspeakers lead to reduced intelligibility in the focusing spots. Again, the MCCS approach shows more robustness to system errors as compared to the nullspace approach.
5.2 Experiment in a real setting
We perform an experiment to evaluate the two approaches in a real room with 6 loudspeakers and measure the STOI scores of generated sounds at 7 locations with microphones. The experimental setup is shown in Fig. 4 (a) . Two microphones are chosen to be the focusing spots, and the rest are placed at increasing distances from Spot 2. Fig. 4 (b) shows the measured STOI values at these locations. The observed intelligibility at the two spots is good with high STOI values, and the signals become considerably degraded 50 cm away from the focusing spots. As expected from simulations, the nullspace approach has a stronger impact on signal degradation outside the target locations.
In this paper, we presented two approaches to address private audio communication problem in a reverberant room. Both approaches are based on emitting noise signals from loudspeakers and then utilizing the echoes in the room to ensure that they yield intelligible messages at selected locations, while being incoherent elsewhere. Simulated and real experiments suggest that with just six loudspeakers and a few impulse response measurements, we can deliver clear audio messages at the desired locations while ensuring unintelligibility everywhere else. The experiments further suggest that the nullspace based method is more capable of jamming locations outside the targeted focusing spots, whereas the MCCS method is more robust to errors in system design.
-  M. Poletti, “An investigation of 2-d multizone surround sound systems,” in Audio Engineering Society Convention 125. Audio Engineering Society, 2008.
-  Y. J. Wu and T. D. Abhayapala, “Spatial multizone soundfield reproduction: Theory and design,” IEEE Transactions on audio, speech, and language processing, vol. 19, no. 6, pp. 1711–1720, 2011.
-  T. Betlehem, W. Zhang, M. A. Poletti, and T. D. Abhayapala, “Personal sound zones: Delivering interface-free audio to multiple listeners,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 81–91, 2015.
-  S. J. Elliott, J. Cheer, J.-W. Choi, and Y. Kim, “Robustness and regularization of personal audio systems,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 7, pp. 2123–2133, 2012.
-  Y. Cai, M. Wu, and J. Yang, “Sound reproduction in personal audio systems using the least-squares approach with acoustic contrast control constraint,” The Journal of the Acoustical Society of America, vol. 135, no. 2, pp. 734–741, 2014.
-  J.-W. Choi and Y.-H. Kim, “Generation of an acoustically bright zone with an illuminated region using multiple sources,” The Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1695–1700, 2002.
-  A. J. Berkhout, D. de Vries, and P. Vogel, “Acoustic control by wave field synthesis,” The Journal of the Acoustical Society of America, vol. 93, no. 5, pp. 2764–2778, 1993.
-  D. B. Ward and T. D. Abhayapala, “Reproduction of a plane-wave sound field using an array of loudspeakers,” IEEE Transactions on speech and audio processing, vol. 9, no. 6, pp. 697–707, 2001.
-  W. Jin, W. B. Kleijn, and D. Virette, “Multizone soundfield reproduction using orthogonal basis expansion,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 311–315.
-  Y.-J. Liu, J. Casebeer, and I. Dokmanić, “Cocktails, but no party: multipath-enabled private audio,” arXiv preprint arXiv:1809.05862, 2018.
-  R. Negi and S. Goel, “Secret communication using artificial noise,” in IEEE Vehicular Technology Conference. Citeseer, 2005, vol. 62, p. 1906.
-  S. Goel and R. Negi, “Guaranteeing secrecy using artificial noise,” IEEE transactions on wireless communications, vol. 7, no. 6, 2008.
-  J. Donley, C. Ritz, and W. B. Kleijn, “Improving speech privacy in personal sound zones,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 311–315.
-  C. E. Shannon, “Communication theory of secrecy systems,” The Bell System Technical Journal, vol. 28, no. 4, pp. 656–715, Oct 1949.
-  I. Csiszár and J. Korner, “Broadcast channels with confidential messages,” IEEE transactions on information theory, vol. 24, no. 3, pp. 339–348, 1978.
-  A. D. Wyner, “The wire-tap channel,” Bell system technical journal, vol. 54, no. 8, pp. 1355–1387, 1975.
-  S. Goel and R. Negi, “Secret communication in presence of colluding eavesdroppers,” in Military Communications Conference, 2005. MILCOM 2005. IEEE. IEEE, 2005, pp. 1501–1506.
-  J. Barros and M. R. Rodrigues, “Secrecy capacity of wireless channels,” in Information Theory, 2006 IEEE International Symposium on. IEEE, 2006, pp. 356–360.
-  A. Mukherjee and A. L. Swindlehurst, “Detecting passive eavesdroppers in the mimo wiretap channel,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 2809–2812.
-  A. Chaman, J. Wang, J. Sun, H. Hassanieh, and R. Roy Choudhury, “Ghostbuster: Detecting the presence of hidden eavesdroppers,” in Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. ACM, 2018, pp. 337–351.
-  C. Stagner, A. Conrad, C. Osterwise, D. G. Beetner, and S. Grant, “A practical superheterodyne-receiver detector using stimulated emissions,” IEEE Transactions on Instrumentation and Measurement, vol. 60, no. 4, pp. 1461–1468, 2011.
-  W. Jin and W. B. Kleijn, “Theory and design of multizone soundfield reproduction using sparse methods,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 12, pp. 2343–2355, 2015.
-  T. Betlehem and T. D. Abhayapala, “Theory and design of sound field reproduction in reverberant rooms,” The Journal of the Acoustical Society of America, vol. 117, no. 4, pp. 2100–2111, 2005.
-  R. Scheibler, E. Bezzam, and I. Dokmanić, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 351–355.
-  A. Farina, “Simultaneous measurement of impulse response and distortion with a swept-sine technique,” in Audio Engineering Society Convention 108. Audio Engineering Society, 2000.
-  C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4214–4217.