Acoustic scene analysis (ASA), which analyzes scenes in which sounds are produced, is now a very active research area in acoustics, and it is expected that ASA will enable many useful applications such as systems monitoring elderly people or infants [1, 2], automatic surveillance systems [3, 4, 5, 6], automatic file-logging systems [7, 8, 9], and advanced multimedia retrieval [10, 11, 12, 13].
To analyze scenes from an acoustic signal, many approaches based on machine learning techniques have been proposed. For instance, Eronenet al.  and Mesaros et al. 
have proposed spectral feature-based methods such as mel-frequency cepstral coefficients (MFCCs) and Gaussian mixture models (GMMs). Hanet al.  and Jallet et al. 17], Kim et al. , and Imoto and co-workers [8, 19] have investigated ASA utilizing intermediate feature representations based on acoustic event histograms.
ASA based on spatial information extracted from a microphone array composed of smartphones, smart speakers, and IoT devices has also been proposed [20, 21, 22]. Many of these methods extract spatial information based on observed time differences or sound power ratios between channels, and therefore, they require that the microphones are synchronized and the microphone locations and array geometry are known. However, since the distributed microphones in multiple smartphones, smart speakers, or IoT devices are often unsynchronized and the microphone locations and array geometry are unknown, conventional methods cannot be applied to such distributed microphone arrays. To extract spatial information using unsynchronized distributed microphones whose locations and array geometry are unknown, Imoto and Ono have proposed a spatial cepstrum that can be applied under these conditions 
. In this approach, log-amplitudes obtained by multiple microphones are converted to a feature vector similarly to when using the cepstrum, which is based on principal component analysis (PCA) of the feature vector.
On the other hand, the numbers of smartphones, smart speakers, or IoT devices that have multiple microphones have been increasing. A microphone array composed of these microphones are often partially synchronized or closely located as shown in Fig. 1; we refer to these synchronized or closely located microphones collectively as connected microphones. The time delay or sound power ratio between channels is a significant cue for extracting spatial information even when the microphones are partially connected; however, the conventional spatial cepstrum does not consider whether some of the microphones are partially connected.
In this paper, we propose a novel spatial feature extraction method for a distributed microphone array that can take into account whether or not microphones are partially connected. To consider whether any pairs of microphones are connected, we utilize a graph representation of the microphone connections, where the power observations and microphone connections are represented by the weights of the nodes and edges, respectively. Then, the proposed method introduces a graph Fourier transform, which enables spatial feature extraction considering the connections between microphones.
This paper is organized as follows. In section 2, the spatial cepstrum used in conventional spatial feature extraction for a distributed microphone array is introduced. In section 3, the proposed method of extracting a spatial feature for partially connected distributed microphones and the similarity of the proposed method to the conventional cepstrum and spatial cepstrum are discussed. In section 4, experiments performed to evaluate the proposed method are reported. In section 5, we conclude this paper.
Ii Conventional Spatial Feature Extraction for Distributed Microphones
To extract spatial information from unsynchronized distributed microphones whose locations and array geometry are unknown, the spatial cepstrum, which is a similar technique to the cepstrum feature, has been proposed .
Suppose that a multichannel observation is recorded by microphones and denotes the power observed for microphone at time frame . In the case of unsynchronized distributed microphones, synchronization over channels is still a challenging problem and phase information may be unreliable. Therefore, the spatial cepstrum utilizes only the log-amplitude vector
which is relatively robust to a synchronization mismatch. Considering that the distributed microphones may be non-uniformly located, PCA is then applied for the basis transformation of the spatial cepstrum instead of the inverse discrete Fourier transform (IDFT). Suppose that is the covariance matrix of and given by
where is the number of time frames. Since is a symmetric matrix, the eigendecomposition of can be represented as
where and, the spatial cepstrum is defined as
The spatial cepstrum can extract spatial information without microphone locations or the array geometry, although it requires training sounds to estimate the eigenvector matrixby PCA. Moreover, since the spatial cepstrum does not consider whether or not the microphones are connected, observed time differences or sound power ratios between channels cannot be utilized for spatial feature extraction.
Iii Spatial Feature Extraction Based on Graph Cepstrum
Iii-a Graph Cepstrum
We consider the situation that a microphone array is composed of multiple generic acoustic sensors mounted on smartphones, smart speakers, or IoT devices, where some of the microphones mounted on each device are connected. To extract spatial information while considering microphone connections, we here propose a novel spatial feature extraction method that utilizes a graph representation of the multichannel observations and microphone connections. Specifically, to extract spatial information, the proposed method performs the graph Fourier transform  instead of PCA in the spatial cepstrum. This makes it possible to take into account which pairs of microphones are connected.
Consider the logarithm powers of the observations on the graph shown in Fig. 2, where the power observations and microphone connections are represented by the weights of the nodes and edges, respectively. Here, the adjacency matrix is defined as
where is an arbitrary weight of the connection within the range of 0.0–1.0. We also assume the degree matrix D, which is a diagonal matrix whose diagonal elements are represented as
The degree matrix indicates the number of microphones connected with microphone . Then, the unweighted graph Laplacian is written as
where L is also a symmetric matrix since both D and A are symmetric matrices. Thus, eigendecomposition of can be expressed as
where and are the eigenvector matrix and the diagonal matrix whose diagonal elements are equal to the eigenvalues in ascending order, respectively. The eigenvector matrix and its transpose are the graph Fourier transform (GFT) matrix and the inverse graph Fourier transform (IGFT) matrix, respectively, which enable the basis transformations considering the connections between microphones.
Thus, the proposed spatial feature, which can consider the connections between microphones, is defined in terms of the IGFT of the log-amplitude vector as
Because this proposed spatial feature also resembles the conventional cepstrum as well as the spatial cepstrum, we call it the graph cepstrum (GC).
Iii-B Graph Cepstrum on Ring Graph
Let us consider a circular connected condition, namely the ring graph condition shown in Fig. 3. For this condition, a graph Laplacian is represented as the circulant matrix
On the basis of the fact that a circulant matrix is diagonalized by an IDFT matrix  defined by
the IGFT is identical to the IDFT.
Thus, in the case of a ring graph, the GC is identical to the definition of the cepstrum. Moreover, it is also identical to the definition of the spatial cepstrum of circular symmetric microphones in an isotropic sound field . This means that the ring connection in the GC domain corresponds to the circular symmetric arrangement of microphones in an isotropic sound field in the acoustic spatial condition.
Iv-a Experimental Conditions
To evaluate the effectiveness of the proposed method for partially synchronized microphones, we conducted classification experiments on acoustic scenes in a living room. Since most of the public datasets for acoustic scene analysis including TUT Acoustic Scenes 2017  and AudioSet  are provided in single or stereo channels, we recorded a multichannel sound dataset with 13 synchronized microphones in a real environment. The sound dataset includes nine acoustic scenes, “vacuuming,” “cooking,” “dishwashing,” “eating,” “reading a newspaper,” “operating a PC,” “chatting,” “watching TV,” and “doing the laundry,” which happen frequently around the living room. The microphone arrangement and the locations of the sound sources are shown in Fig. 4
. The recorded sounds consisted of 257.1 min. of recordings, which were randomly separated into 5,180 sound clips for model training and 2,532 sound clips for classification evaluation, where no acoustic scene overlapped with another scene in all the sound clips. To evaluate the scene classification performance with synchronization mismatch among the microphone groups, the recorded sounds for classification evaluation were misaligned with various error times among the microphone groups shown in Fig.4
. The error times were randomly sampled from a Gaussian distribution with
and various variances. The other recording conditions and experimental conditions are listed in Table I.
|Sampling rate||48 kHz|
|Quantization bit rate||16 bits|
|Sound clip length||8 s|
|Frame length / FFT point||20 ms / 2,048|
|Network structure of CNN||3 conv. & 3 dense layers|
|Pooling in CNN layers||2|
|Activation function||ReLU, softmax (output layer)|
|# channels of CNN||32, 24, 16|
|# units of dense layers||128, 64, 32|
Iv-B Spatial Information Extracted by Graph Cepstrum
To clarify how the GC extracts spatial information, we show the IGFT matrix in Fig. 5. The th-row vector of corresponds to the th-eigenvector of the graph Laplacian . The th-order GC is calculated using the th-row vector of as follows:
where , , , and are the th-order GC, the th-row vector of , the entry of , and the th element of , respectively. This indicates that the th-order GC is obtained by a linear combination of log-amplitudes , where is the weight of the linear combination. From Fig. 5, it can be interpreted that the first-order GC represents the average sound level in the whole space because all the weights are positive. For the middle-order eigenvectors, the signs of the weights between connected microphones are similar. This indicates that the GC can capture spatial information while taking the connections of microphones into account. For the higher-order eigenvectors, the weights of only part of the connected microphone group are active and the signs of the weights differ. These eigenvectors capture spatial information of the sound sources close to the microphone groups because if the sound sources are far from the microphone groups, the linear combination of the microphone groups is canceled in Eq. (13).
Iv-C Acoustic Scene Classification
Acoustic scenes were then modeled and classified for each sound clip using a Gaussian mixture model (GMM), a supervised acoustic topic model (sATM) [8, 28], and a convolutional neural network (CNN). Specifically, the GMM was applied to acoustic feature vectors and for each acoustic scene . After that, acoustic scene of sound clip was estimated by calculating the product of the likelihoods over the sound clip as follows:
where , , and are the number of frames in sound clip , an acoustic feature vector calculated frame by frame, such as or , and the likelihood of acoustic scene at time frame , respectively. As other methods for acoustic scene classification utilizing a distributed microphone array, we also evaluated classifiers based on late fusion-based classification methods .
Iv-D Experimental Results
The classification performance of acoustic scenes is shown in Fig. 6. For each experimental condition, the acoustic scene modeling and classification were conducted ten times with various synchronization error times sampled randomly. These results show that when the synchronization error between microphone groups is small, the GC and conventional spatial cepstrum effectively classify acoustic scenes. When the synchronization error between microphone groups increases, the scene classification performance for the GC slightly decreases. In contrast, the classification accuracy decreases rapidly when using conventional methods. This indicates that the proposed GC is more robust against synchronization error than conventional methods.
In this paper, we proposed an effective spatial feature extraction method for acoustic scene analysis using partially synchronized or closely located distributed microphones. In the proposed method, we derived the graph cepstrum (GC), which is defined as the inverse graph Fourier transform of the logarithm power of a multichannel observation. We then demonstrated that the GC in a ring graph is identical to the conventional cepstrum and spatial cepstrum in a circularly symmetric microphone arrangement with an isotropic sound field. Our experimental results using real environmental sounds showed that the GC more robustly classifies acoustic scenes than conventional spatial features even when the synchronization mismatch between partially synchronized microphone groups is large.
Part of this work was supported by the Support Center for Advanced Telecommunications Technology Research, Foundation.
-  Y. Peng, C. Lin, M. Sun, and K. Tsai, Proc. IEEE International Conference on Multimedia and Expo (ICME), pp. 1218–1221, 2009.
-  P. Guyot, J. Pinquier, and R. André-Obrecht, “Water sound recognition based on physical models,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 793–797, 2013.
-  A. Harma, M. F. McKinney, and J. Skowronek, “Automatic surveillance of the acoustic activity in our living environment,” Proc. IEEE International Conference on Multimedia and Expo (ICME), 2005.
-  R. Radhakrishnan, A. Divakaran, and P. Smaragdis, “Audio analysis for surveillance applications,” Proc. 2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 158–161, 2005.
-  S. Ntalampiras, I. Potamitis, and N. Fakotakis, “On acoustic surveillance of hazardous situations,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 165–168, 2009.
-  T. Komatsu and R. Kondo, “Detection of anomaly acoustic scenes based on a temporal dissimilarity model,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 376–380, 2017.
-  A. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,” IEEE Trans. Audio Speech Lang. Process., vol. 14, no. 1, pp. 321–329, 2006.
-  K. Imoto and S. Shimauchi, “Acoustic scene analysis based on hierarchical generative model of acoustic event sequence,” IEICE Trans. Inf. Syst., vol. E99-D, no. 10, pp. 2539–2549, October 2016.
-  J. Schröder, J. Anemiiller, and S. Goetze, “Classification of human cough signals using spectro-temporal Gabor filterbank features,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6455–6459, 2016.
-  T. Zhang and C. J. Kuo, “Audio content analysis for online audiovisual data segmentation and classification,” IEEE Trans. Audio Speech Lang. Process., vol. 9, no. 4, pp. 441–457, 2001.
-  Q. Jin, P. F. Schulam, S. Rawat, S. Burger, D. Ding, and F. Metze, “Event-based video retrieval using audio,” Proc. INTERSPEECH, 2012.
-  Y. Ohishi, D. Mochihashi, T. Matsui, M. Nakano, H. Kameoka, T. Izumitani, and K. Kashino, “Bayesian semi-supervised audio event transcription based on Markov Indian buffet process,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3163–3167, 2013.
-  J. Liang, L. Jiang, and A. Hauptmann, “Temporal localization of audio events for conflict monitoring in social media,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1597–1601, 2017.
-  A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, “Acoustic event detection in real life recordings,” Proc. 18th European Signal Processing Conference (EUSIPCO), pp. 1267–1271, 2010.
-  Y. Han, J. Park, and K. Lee, “Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification,” the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1–5, 2017.
H. Jallet, E. Çakır, and T. Virtanen,
“Acoustic scene classification using convolutional recurrent neural networks,”the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1–5, 2017.
G. Guo and S. Z. Li,
“Content-based audio classification and retrieval by support vector machines,”IEEE Trans. Neural Networks, vol. 14, no. 1, pp. 209–215, 2003.
-  S. Kim, S. Narayanan, and S. Sundaram, “Acoustic topic models for audio information retrieval,” Proc. 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 37–40, 2009.
-  K. Imoto, Y. Ohishi, H. Uematsu, and H. Ohmuro, “Acoustic scene analysis based on latent acoustic topic and event allocation,” Proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2013.
-  H. Kwon, H. Krishnamoorthi, V. Berisha, and A. Spanias, “A sensor network for real-time acoustic scene analysis,” Proc. IEEE International Symposium on Circuits and Systems, pp. 169–172, 2009.
-  H. Phan, M. Maass, L. Hertel, R. Mazur, and A. Mertins, “A multi-channel fusion framework for audio event detection,” Proc. 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5, 2015.
-  P. Giannoulis, A. Brutti, M. Matassoni, A. Abad, A. Katsamanis, M. Matos, G. Potamianos, and P. Maragos, “Multi-room speech activity detection using a distributed microphone network in domestic environments,” Proc. 23rd European Signal Processing Conference (EUSIPCO), pp. 1271–1275, 2015.
-  K. Imoto and N. Ono, “Spatial cepstrum as a spatial feature using distributed microphone array for acoustic scene analysis,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 6, pp. 1335–1343, June 2017.
-  A. Ribeiro, A. G. Marques, and S. Segarra, “Graph signal processing: Fundamentals and applications to diffusion processes,” Proc. 24th European Signal Processing Conference (EUSIPCO), 2016.
-  G. Golub and C. Van Loan, Matrix Computations, Johns Hopkins University Press, 1996.
-  A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets and baseline system,” Proc. the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), pp. 85–92, 2017.
-  J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780, 2017.
-  K. Imoto and N. Ono, “Acoustic scene classification based on generative model of acoustic spatial words for distributed microphone array,” Proc. European Signal Processing Conference (EUSIPCO), pp. 2343–2347, 2017.
-  J. Kürby, R. Grzeszick, A. Plinge, and G. A. Fink, “Bag-of-features acoustic event detection for sensor networks,” Proc. the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), pp. 55–59, September 2016.