Speaker diarization is defined as the task of assigning the utterances contained in a spoken document to their respective speakers. For this purpose, two steps are involved, either sequentially or in conjunction: detection of change points between speakers; clustering of the resulting spoken segments in order to assign to the same speaker its own utterances. The clustering step is usually achieved in an iterative hierarchical process, either by agglomerating the closest spoken segments into the same cluster in a bottom-up strategy or by splitting the whole stream of utterances into smaller clusters in a top-down way.
This task is usually performed as an unsupervised one, in particular without allowing any prior knowledge of the number of speakers. This lack of information makes the stop condition of the hierarchical clustering process quite critical.
Speaker diarization (sd) systems were first developed for processing of audio-only streams in adverse – but controlled – acoustic conditions, such as telephone conversations, broadcast news, meetings… Some recent works applied them to videos whose production context is uncontrolled, facing difficulties due to content and environment variabilities.
In , the authors apply standard sd tools to the audio source of various kinds of video documents. The reported results exhibit a Diarization Error Rate (der) much higher than for those classical application fields. The most dramatic decrease in performance is observed when the sd systems are applied to cartoons and movie trailers: among the possible reasons involved, the authors notice the high number of speakers implied in these kinds of stream, as well as the high variability of the acoustic environment (speech and music segments overlapping each other, sound effects). Moreover, as in most of previous related works on audiovisual sd, diarization problem is here addressed by applying audio-only systems to the audio channel of videos, without any integration of the video-related features that could help the diarization system.
However, some recent works focus on multimodal approaches for performing speaker segmentation of video streams: in , the authors evaluate a method based on early fusion of audio and video gmms, and a classical bic-based agglomerative process on the resulting two-channel information stream. This technique is evaluated on the ami corpus  that consists of audiovisual recordings of four participants playing roles in a meeting scenario.
In this paper, we are interested in diarization of tv series, as a major basic part of a wider project aiming at automatic structuring of fictional videos.
Applying a sd system to tv series, where the speaker number is generally higher than in full-length movies, may thus be expected to be quite challenging. Nevertheless, fictional films exhibit formal regularities at a visual level. For instance, dialogue scenes require that the “180-degree” convention be respected in order to preserve the visual fluidity of the exchange: so that both speakers seem to look at each other when they appear successively on the screen, the first one must look right and the second one must look left, resulting in keeping the two cameras along the same side of an imaginary line connecting them. Such a rule results in a specific visual pattern made of two alternating, recurring shots and highly typical of a dialogue scene.
Relying on such patterns, we propose here to split the speaker diarization process into two steps when applied to fictional films: the first one consists in a local speaker diarization inside the boundaries of the visually detected dialogue scenes; the next one consists in clustering the local hypothesized speakers while preventing speakers locally assumed to be distinct from being merged into the same cluster and propagating this constraint at each iteration of the process.
Such a two-step clustering process is somehow related to what is denoted in  as the “hybrid architecture” in the cross-show speaker diarization context. In cross-show sd, diarization is achieved on a set of shows originating from a same source and containing possibly recurring speakers. The shows are first processed independently, before the resulting hypothesized speakers are clustered in a second stage. In , the authors make use of speaker diarization in conjunction with face clustering to identify the persons involved in a debate video: the best modality to identify a person is chosen and the identity information acquired for an instance is propagated to its whole cluster. Finally, speaker diarization has already been applied to tv series, but as a mean, among other modalities, to segment the whole video stream into homogeneous narrative scenes. In , the performances of mono-modal and multi-modal approaches for the scene segmentation task are evaluated and compared.
In this paper, rather than using speaker diarization to structure the tv movie, we propose to use its structure, as hypothesized from visual patterns, to improve the speaker diarization of such contents. The way such visual patterns are extracted is described in Section 2. The two steps of our speaker diarization approach, as well as the acoustic features used, are described in Section 3. Experimental results are presented and discussed in Section 4.
2 Visual patterns detection
The whole video stream can be regarded as a finite sequence of fixed images (or video frames) displayed on the screen at a constant rate to simulate motion continuity. As mentioned in , a shot is defined as “an unbroken sequence of frames taken from one camera”.
As noticed in Section 1, because of technical narrative constraints, recurring and alternating shots frequently occur in the dialogue scenes of fictional movies, resulting in specific patterns.
In order to automatically extract such patterns, we then first need to split the whole video stream into shots and compare them to detect the recurring ones.
2.1 Shot segmentation and detection of similar shots
Defined by the continuity of the images it contains, a shot can also be defined, in a contrastive way, in opposition to the previous one. Shot segmentation is thus classically performed by detecting the transitions, either abrupt or gradual, between temporally contiguous shots (). Remaining marginal in tv series, gradual transitions are here discarded and only abrupt ones (or cuts) are considered.
A cut between two contiguous shots is hypothesized if two temporally adjacent images differ from each other beyond a given threshold . Similarly, the present shot and a past one are considered as similar if the difference between the first image of the former and the last image of the latter stays below another threshold .
Both tasks, shot cut detection as well as shot similarity detection, require that two images be compared. 3-dimension histograms of the image pixel values in the hsv color space are used to describe the image. However, two different images may share the same color histogram, resulting in an irrelevant similarity: spatial information about the color distribution on images is reintroduced by splitting the whole image into 30 pixel blocks, each associated with its own histogram; block-based comparison of the resulting local histograms, as described in , is then performed. The similarity between two color histograms is measured by their correlation.
The two thresholds and
respectively needed to achieve both tasks, shot cut detection and shot similarity detection, are estimated by experiments on a development set.
2.2 Shot patterns extraction
Let be a finite set of shot labels, two shots sharing the same label if they are hypothesized as similar as stated in the subsection 2.1.
The whole movie can then be described by a finite string of shot labels, with each .
For any couple of shot labels , the following regular expression denotes a subset of the set of all the possible shot label sequences :
The set of strings denoted by such a regular expression corresponds to all the shot label sequences containing inserted between two occurrences of with a possible repetition of the alternation , whatever be the previous and following shot labels. This regular expression formalize the intuition of the “two-alternating-and-recurring-shots” pattern mentioned in section 1 and typical of dialogue scenes.
For a given movie described by a sequence of shot labels, a set of shot patterns is extracted by considering all the couples of shot labels such that :
In other words, contains all the label pairs which occur as recurring subsequences of the form showed on Figure 1 in the whole movie sequence s.
The set of utterances covered by the pattern are then all these which occur whenever the two shots alternate with each other according to rule 1.
In order to increase the coverage of the patterns included in and reduce their sparsity, two extensions or the condition 1 are introduced.
In addition to rule 1, isolated expressions of the two alternating shots of the form are taken into account, increasing the total amount of speech captured by the patterns.
The number of patterns is reduced while the average pattern coverage is increased by iteratively merging in a new pattern two patterns and with at least one label in common. As showed in Figure 2, such situations frequently occur during dialogues when one of the speakers (here the one appearing on the shots and ) is alternatively filmed from two distinct cameras. The resulting pattern gather all the utterances and covered by the merged patterns.
Table 1 reports the total coverage of the patterns extracted from the movies of our corpus (described in subsection 4.1), expressed as the ratio between the amount of speech covered by the patterns and the total amount of speech. The average duration of the speech covered by each pattern is also indicated, as well as the average number of speakers by pattern. These data are both computed by applying the basic version of the regular expression , as given in equation 1, and by using the extended expression of .
|coverage (%)||spch/patt (s.)||# of spk/patt|
As indicated in Table 1, the extracted visual patterns cover in average a bit more than half (51.99%) of the total amount of speech contained in the tv movies of our corpus.
69.85% of the patterns contain 2 speakers, 8.09% three and 22.06% only one. However, most of these one-speaker patterns correspond to short scenes, where the probability that one of the speakers remains silent increases.
Figure 3 shows the ability of such visual patterns to capture the main characters of a narrative movie: 97.96% of the characters speaking at least 5% of the time are involved in such patterns.
3 Speaker diarization
Speaker diarization is performed in two steps: speaker diarization is first achieved locally by clustering the set of utterances covered by the pattern ; in a second stage, the locally hypothesized speakers are clustered in order to merge recurring speakers.
3.1 Acoustic features
Easily available, the subtitles of the movie are here used as an way to estimate the boundaries of the corresponding speech segments. As an exact transcription of the speech uttered, the subtitles temporally match it, despite a slight and variable latency before they are displayed on the screen and after they disappear. When the latency was too large, the subtitle boundaries were manually adjusted.
Moreover, a subtitle generally corresponds to a spoken segment uttered by a single speaker; on the remaining ones that cover two speech turns, the boundaries of each utterance are indicated, allowing to split the whole subtitle into two shorter ones.
The detection of change points between the possible audio sources, as a prerequisite of most of the diarization systems, is thus here avoided, allowing us to focus on the clustering process.
The acoustic parameterization of the resulting spoken segments is achieved by extracting 19 cepstral coefficients plus energy, completed by their first and second derivatives.
As a state of the art approach in the speaker verification field, i-vectors are used to retain the relevant acoustic information from each spoken segment (). I-vectors are extracted by using a 512-components gmm/ubm and a total variability matrix trained on a development set.
The initial set of instances to cluster is then made of 60-dimension normalized i-vectors, each corresponding to a speech segment uttered by a single speaker.
3.2 Agglomerative local clustering
A first step of agglomerative clustering is processed within each local dialogue scene as hypothesized by the use of the visual patterns described in subsection 2.2.
For the set of utterances covered by the pattern , the bottom-up clustering algorithm relies on the following:
The Mahalanobis distance is chosen as a similarity measure between the i-vectors corresponding to the spoken segments, resulting in a matrix of similarity between the utterances contained in .
The covariance matrix used to compute the Mahalanobis distance is the within class covariance matrix of the training set, as mentioned in  and computed as follows:
where denotes the number of spoken segments of the training set, the number of speakers and the number of segments uttered by the speaker ; is the mean of the i-vectors corresponding to utterances of speaker and denotes the i-vector corresponding to the th utterance of speaker .
The Ward’s aggregation criterion is used during the agglomeration process to estimate the distance between the clusters and ; it is computed as follows:
where and are the respective mass of the two clusters, and their respective mass centers and the distance between the mass centers.
Finally, the Silhouette method is used to cut the dendogram resulting of the clustering process and obtain the final partition of the spoken segments. Described in , the Silhouette method allows to automatically choose a convenient partition of the instance set by evaluating the quality of a each possible partition resulting from the clustering process. For a given partition, if instances appear closer to another cluster than to their own, the quality measure tends to decrease, and to increase if the instances are appropriately assigned to their respective clusters.
3.3 Constrained global clustering
Once the speaker diarization is performed inside each dialogue scene, a second stage of clustering is performed in order to merge the recurring speakers.
The set of segments locally clustered as uttered by the same speaker are extracted in order to be modelled by a speaker normalized i-vector of 60 components.
The global clustering of the resulting set is processed in the same way than the local one, using Mahalanobis distance based on the covariance matrix, Ward’s aggregation criterion and the Silhouette method to extract the final partition of speakers.
However, this second step is guided, at each agglomeration step, by the structural information given by the visual segmentation of the movie into dialogue scenes as described in Section 2: the global clustering step has to prevent speakers locally hypothesized to be distinct from being assigned to the same cluster during the iterative agglomeration process.
The integration of such a constraint in the bottom-up clustering algorithm is achieved in the following way:
In the initial matrix of the distances between the i-vectors corresponding to the locally hypothesized speakers, the distance between two instances and is set to if the corresponding two speakers appear together in the same dialogue scene:
where denotes a dialogue pattern, , the set of utterances covered by the pattern and the set of utterances assigned to the speaker during the local clustering step.
The distance between the clusters and is set to if at least one instance of the first cluster is located at an infinite distance from an instance of the second one:
where and denote i-vectors corresponding to hypothesized speakers.
Figure 4 illustrates both the application of these rules at the initial step of the agglomerative process and how this “different-speakers” property is inherited by the newly created cluster. Local dialogue scenes are surrounded by dotted rectangles; each node represents the th speaker hypothesized in the th dialogue; the edges between two nodes represent their distance; the absence of edge between two nodes corresponds to an infinite distance. Merging the two closest nodes and results in an isolated cluster that inherits both from the distinction between the two speakers of the first scene and from the distinction between those of the second one: the hypothesized recurring speaker in the two scenes has indeed to be different from both the speakers he is respectively talking to.
Such a “different-speaker” property, as propagated at each step of the agglomerative process, is expected to prevent the speakers involved in a same dialogue to be prematurely clustered: the background music of a dialogue may for instance hide the inter-speaker variability and cause such an early clustering.
Moreover, the main consequence of respecting such a constraint is to block the clustering process before assigning all the instances to the same cluster. In the small example of Figure 4, only one more step of the agglomerative process could be achieved, by clustering and : the narrative structure (two dialogues with two speakers each) remains indeed compatible with such a clustering. The resulting dendogram is then split into two distinct trees.
Figure 5 shows dendograms corresponding to agglomerative clustering of local speakers. The one figuring on top is obtained in a classical way, but may be difficult to cut automatically to extract the best partition of the instances. The bottom part of the figure, obtained with the same data by integrating the “different-speakers” property to the clustering process, shows five trees corresponding to five incompatible groups of speakers; each one is made of a group of narratively consistent speakers, with possibly many occurrences of the same one.
Each of these remaining trees of compatible speakers is finally cut using the Silhouette method described in subsection 3.2 and the final partition of the instance set is obtained by the union of the partitions obtained for each tree.
However, this constrained global clustering step remains dependent of the outputs of the local one. If a single speaker is wrongly split into two clusters during the local clustering step, the two resulting utterance groups will never be merged during a global clustering embedding the “different-speakers” property. Nevertheless, even during an unconstrained clustering process, such groups would be merged lately, possibly after the best partition is reached.
4 Experiments and results
For experimental purpose, we acquired the first seasons of three tv series: Breaking Bad (abbreviated bb), Game of Thrones (got), and House of Cards (hoc). We manually annotated three episodes of each series by indicating shot cuts, similar shots, speech segments as well as the corresponding speakers.
The total amount of speech in these nine episodes represents a bit more than three hours (3:12).
A subset of six episodes (denoted dev) was used for development purpose, the remaining three ones (denoted test) being used for test purpose.
4.2 Shot cut and shot similarity detection
The evaluation of shot cut detection relies on a classical F1-score () based on recall (% of retrieved cuts among the relevant ones) and precision (% of relevant cuts among the retrieved ones). For the shot similarity detection task, an analogous F1-score is used: for each shot, the list of shots hypothesized as similar to the current one is compared to the reference list of similar shots; if both lists intersect in a non-empty set, the shot is considered as correctly paired with its list. Results on dev and test sets are reported in Table 2.
|shot cut||shot sim|
The results obtained in both the image processing tasks, particularly for the shot similarity detection one (F1-score amounting to 0.90) are thus expected to provide a firm base for guiding speaker diarization of narrative movies. Precision is slightly more important than recall, resulting in some missed similarities between shots but with fewer false positives. As a result, the dialogue patterns are slightly less covering when based on automatic similarity detection (49.70% of the part-of-speech vs 51.99% when shot similarity is manually indicated) but appear highly reliable.
4.3 Local speaker diarization
The der used to evaluate the local clustering step is computed independently in each episode dialogue before averaging the obtained scores according to each dialogue duration (single-show der, as mentioned in ). The results are reported in Table 3, when using both the reference (denoted ref.) and the automatically detected (denoted auto.) similar shots. For the sake of comparison, agglomerative clustering (denoted ac), is compared to a “naive method” relying on a strong assumption of synchronization between the audio and video streams: clustering of local utterances is performed by assigning each spoken segment the label of the current shot, assuming the two alternating shots match exactly the speaker turns.
|input auto.||input ref.|
The results obtained by performing an audio-based clustering of the utterances of each dialogue scene appear better than those obtained by applying the naive image-based method.
Moreover, the automation of the previous step, though slightly degrading performances in speaker diarization, does not really impact it, which confirms the reliability of the visual modality.
4.4 Global speaker diarization
Table 4 reports the results obtained during the clustering of the local speakers, achieving the second step of the speaker diarization process.
|input auto.||input ref.||spch ref.|
|2S||cst. 2S||2S||cst. 2S||lia||lium|
Results are given both in taking as input the local speakers hypothesized in each dialogue scene during the previous step (input auto.) as well as the real speakers manually annotated (denoted input ref.). In both cases, the second step of clustering is performed in an unconstrained way (denoted 2S), allowing any local speakers to be clustered during the agglomerative process, and in a constrained way (denoted cst. 2S), by preventing it. For the sake of comparison, the results of two standard speaker diarization tools (denoted lia, described in , and lium, described in  and ), are also reported: these tools receive in input all the spoken segments covered by the dialogue patterns.
Though still high, the der is generally reduced by integrating to the clustering process the structural information based on visual patterns. By stopping the clustering before all the instances can be gathered, the “different-speakers” property allows to cut the resulting dendogram at a suitable level, providing an early stop condition of the process, when only a few mutually exclusive groups of instances remain. By contrast, unconstrained clustering has to face the critical issue of finding the optimal partition of the instances.
Table 5 reports the average number of speakers involved in the dialogue scenes considered, as hypothesized by the different systems.
As can be seen, two systems (unconstrained 2-step clustering and lia), tend to cut the clustering dendogram at a high level, resulting in a few number of too wide classes. Conversely, lium, by cutting the tree at a low level, overestimates in two cases the number of speakers. The constrained clustering approach (cst. 2S), resulting in disjoint dendograms, offers an reasonable approximation of the number of speakers and prevents early as well as late cuts of the clustering tree.
5 Conclusion and perspectives
In this paper, we proposed to achieve speaker diarization of narrative movies by relying on the structural information they carry. By detecting similar shots, some covering patterns, typical of dialogue scenes, can be extracted and a first step of speaker diarization can be locally performed inside each dialogue boundaries. A second step of clustering, aiming at detecting the recurring speakers, is then applied to the locally hypothesized speakers: at each iteration of this global clustering process, the constraint that speakers locally assumed to be different must not be clustered is propagated; as a result, the agglomerative process is blocked far before all the instances are clustered, allowing a more convenient partition of the initial set than when applying an unconstrained approach.
Despite the coverage of the visual patterns, there still remains some sparse spoken segments outside their boundaries (near than a half of the total amount of speech). A specific study of the shot patterns involved in the dialogue scenes could allow to increase their coverage. The labelling of the remaining spoken segments could then be achieved by assigning them to the – possibly noisy – speaker models resulting from the sd process. Finally, visual information could be used during the local clustering of the dialogue utterances by exploiting the way the shots alternate with each other.
-  Pierre Clément, Thierry Bazillon, and Corinne Fredouille, “Speaker diarization of heterogeneous web video files: A preliminary study,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 4432–4435.
-  G. Friedland, H. Hung, and Chuohao Yeo, “Multi-modal speaker diarization of real-world meetings using compressed-domain video features,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, April 2009, pp. 4069–4072.
Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot,
Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa
Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnes Lisowska, Iain McCowan,
Wilfried Post, Dennis Reidsma, and Pierre Wellner,
“The ami meeting corpus: A pre-announcement,”
Proceedings of the Second International Conference on Machine Learning for Multimodal Interaction, Berlin, Heidelberg, 2006, MLMI’05, pp. 28–39, Springer-Verlag.
-  Viet-Anh Tran, Viet Bac Le, Claude Barras, and Lori Lamel, “Comparing multi-stage approaches for cross-show speaker diarization.,” in INTERSPEECH, 2011, pp. 1053–1056.
-  Meriem Bendris, Benoit Favre, Delphine Charlet, Géraldine Damnati, Gregory Senay, Rémi Auguste, and Jean Martinet, “Unsupervised face identification in tv content using audio-visual sources,” in Content-Based Multimedia Indexing (CBMI), 2013 11th International Workshop on. IEEE, 2013, pp. 243–249.
-  Hervé Bredin, “Segmentation of tv shows into scenes using speaker diarization and speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 2377–2380.
-  Irena Koprinska and Sergio Carrato, “Temporal video segmentation: A survey,” Signal processing: Image communication, vol. 16, no. 5, pp. 477–500, 2001.
-  Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 4, pp. 788–798, 2011.
-  Pierre-Michel Bousquet, Driss Matrouf, and Jean-François Bonastre, “Intersession compensation and scoring methods in the i-vectors space for speaker recognition.,” in INTERSPEECH, 2011, pp. 485–488.
Peter J Rousseeuw,
“Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,”Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987.
-  John S Boreczky and Lawrence A Rowe, “Comparison of video shot boundary detection techniques,” Journal of Electronic Imaging, vol. 5, no. 2, pp. 122–128, 1996.
-  Mickael Rouvier, Grégor Dupuy, Paul Gay, Elie Khoury, Teva Merlin, and Sylvain Meignier, “An open-source state-of-the-art toolbox for broadcast news diarization,” in INTERSPEECH, 2013, number EPFL-CONF-192762.
-  Simon Bozonnet, Nicholas WD Evans, and Corinne Fredouille, “The lia-eurecom rt’09 speaker diarization system: enhancements in speaker modelling and cluster purification,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4958–4961.
-  Sylvain Meignier and Teva Merlin, “Lium spkdiarization: an open source toolkit for diarization,” in CMU SPUD Workshop, 2010, vol. 2010.