Computational argumentation and debating technologies aim to automate the extraction, understanding and generation of argumentative discourse. This field has seen a surge in research in recent years, and involves a variety of tasks, over various domains, including legal, scientific writing and education. Much of the focus is on argumentation mining, the detection of arguments in text and their classification [Palau and Moens2009], but many other tasks are being addressed as well, including argument stance classification [Sobhani et al.2015, Bar-Haim et al.2017], the automatic generation of arguments [Bilu and Slonim2016], identification of persuasive arguments [Wei et al.2016], quality assessment [Wachsmuth et al.2017a] and more. Multiple datasets are available for such research, mostly in English, such as the Internet Argument Corpus [Walker et al.2012], that consists of numerous annotated political discussions in internet forums, ArgRewrite [Zhang et al.2017], a corpus of argumentative essay revisions, and the datasets released by IBM Research as part of the Debater Project [Rinott et al.2015, Aharoni et al.2014]. Lippi2016ArgumentationMS list several additional such datasets. Further, wachsmuth2017building have released an argument search engine over multiple debating websites, and aker2017projection have initiated the projection of some datasets to languages other than English, such as Chinese.
All of the above are based on written texts, while datasets of spoken debates, outside of the political domain, are scarce. A spoken debate differs from a written essay or discussion not only in structure and content, but also in style as in any other case of spoken vs. written language. Zhang2016ConversationalFI made available transcripts from the Intelligence Squared111http://www.intelligencesquaredus.org debating television show222http://www.cs.cornell.edu/~cristian/debates/. The transcripts of the show are available on the show’s site, and while they are of high quality, they do not match the audio recordings precisely, requiring substantial additional effort, if one wishes, for example, to use them as ASR training data.
With this paper we release a dataset of 60 audio speeches, recorded specifically for debating research purposes. We describe in detail the process of producing these speeches and their automatic and manual transcripts. This is a first batch of a larger set of recordings we intend to produce and release in the future.
2 Recording the Speeches
We recorded short speeches about debatable topics, with experienced speakers. This section describes the recording process.
Recruiting and training the speakers
Our team of speakers are all litigators or debaters, fluent or native English speakers, experienced in arguing about any given topic. The recruitment and training of the speakers included several steps. First, we interviewed potential speakers to evaluate their ability to argue about a topic when given only a short time to prepare. Then, we provided candidates with an essay to read aloud and record. Candidates were given technical guidelines to ensure high recording quality, including microphone configuration instructions and recording best-practices such as to record in a quiet environment, to use an external microphone and to maintain a fixed distance from the microphone while speaking. After listening to these recordings, we provided the speakers with feedback and repeated the process until the essay recordings were of good quality for the naked ear. Next, we provided each candidate with two motions (e.g. “we should ban boxing”) and asked them to record a spontaneous speech supporting each motion, after a 10 minute preparation.
All recordings – three per candidate (one reading and two spontaneous speeches) – were processed through automatic speech recognition and were sent to manual transcription, as described in the next sections. Comparing the automatic and manual transcripts, we computed the system’s Word Error Rate (WER, the sum of substitution, deletion and insertion error rates) for each speech, and accepted candidates for whom the WER was below a pre-defined threshold of 10%. That, to make sure that our ASR system is reasonably successful on their speeches.
The recording process
All speakers received a list of motions, each with an ID and a short name (to be easily identified by human readers), and background information extracted from Debatabase333http://idebate.org/debatabase or Wikipedia. The speakers were directed to spend up to 10 minutes reviewing the motion’s topic and preparing their arguments, and then immediately start recording themselves arguing in its favor for 4-8 minutes. The speakers were instructed not to search for further information about the topic beyond the provided description. The idea is to prevent multiple debaters who record a speech about the same topic from reaching the same resources (in particular debating websites), which may reduce the diversity of the ideas presented in the speeches. Example 1 shows a part of background information for the topic “doping in sports”.
Example 1 (Topic background information)
3 Automatic Speech Processing
Every recorded speech was automatically transcribed by a speaker-independent deep neural network ASR system. The system’s acoustic model was trained on over 1000 hours of speech from various broadband speech corpora including broadcast news shows, TED talks444https://www.ted.com/ and Intelligence Squared debates555We semi-automatically aligned the transcripts and the audio, to overcome the inconsistency problem mentioned in Section 1. We used a -gram language model with a vocabulary of 200K words, trained on several billion words that include transcripts of the above speech corpora and various written texts, such as news articles.
. We trained speaker-independent convolutional neural network (CNN) models on 40 dimensional log-mel spectra augmented with delta and double delta features. Each frame of speech is also appended with a context of 5 frames. The first CNN layer of the model has 512 nodes attached with
filters. Outputs from this layer are then processed by a second feature extraction layer, also with 512 nodes but using a set offilters. The outputs from the second CNN layer are finally passed to 5 fully connected layers with 2048 hidden units each, to predict scores for 7K context-dependent states. This speaker-independent ASR system performs on average at 8.4% WER on the speeches we release with this paper.
Once a speech has been automatically transcribed, we obtain a text in the format shown in Example 2. Each token (including sentence boundary and silence markers
~SIL ) is followed by the start and end time of its utterance, in seconds, relative to the beginning of the recording segment.
This format is the basis for two versions of the data that we release for each speech: an automatically processed “clean” ASR version, and a manually transcribed one. The steps for obtaining the former are described in Section 3.1 The production of manual transcripts is described in Section 4
Example 2 (Raw ASR output)
3.1 ASR transcripts
Removal of timing information.
Removal of non-textual tokens: Silence markers,
~SIL, appear whenever a relatively long pause has been detected in the speech; sentence boundary tags,
</s>, denote predicted beginnings and ends of sentences. These are the result of the fact that the ASR language model was trained not only on spoken language transcripts, but also on written texts that contain punctuation marks. We have experimentally determined that, for our data, these predictions are not reliable enough to be utilized for sentence splitting on their own and used a dedicated method for this purpose, as described below. We also remove tags such as %hes, denoting unspecified speaker’s hesitation, as well as other tokens denoting hesitation that were transcribed explicitly, such as ah, um or uh.
Abbreviations reformatting: The ASR-produced underscored abbreviation (initialism) format (
i_b_m) is replaced with the standard all-caps one (
Automatic punctuation and sentence splitting:
The automatically transcribed text contains no punctuation marks. In downstream tasks, such as syntactic parsing, long texts are often difficult to handle, and we consequently split the stream of ASR output into sentences. Unlike typical sentence-splitting methods, whose main goal is to disambiguate between periods that mark end-of-sentence and those denoting abbreviations, here the text contains no periods, hence a different method is required. We employed a bidirectional Long Short-term Memory (LSTM) network[Hochreiter and Schmidhuber1997] to predict commas and end-of-sentence periods over the ASR output. This neural network was trained on debate speeches, like the ones we share in this paper, and on TED talks, taken from the English side of the French-English parallel corpus from the IWSLT 2015 machine translation task [Cettolo et al.2012]. 666This is a simplified version of [Pahuja et al.2017].
Capitalization: We apply basic truecasing to the text: capitalizing sentences’ first letters and occurrences of “I”. We have experimented with more sophisticated truecasing tools and abstained from employing them to the released texts due to mixed results.
Example 3 (Clean ASR output)
4 Manual Transcription
As mentioned, the ASR process produces imperfect texts. In order to obtain a “reference” text – a precise transcript of the speech – we employ human transcribers to post-edit the automatic transcript, i.e. correct its mistakes.
Transcribers selection and training
We invited 15 candidates to train as transcribers, all of which are native or fluent English speakers, experienced in linguistic annotation tasks. As a first test, we asked them to transcribe the same four speeches, after carefully reading the guidelines. We used their outcomes for creating ground-truth transcripts: for each speech, we compared its transcripts pair-wise, listened carefully to points of differences, and created a “gold-transcript” that resolved all differences between the individual transcripts. Using these four gold-transcripts, we scored the work of the individual transcribers, and accepted as transcribers nine of the candidates whose transcripts were at least 98% accurate. They were further trained by transcribing ten speeches each, and getting feedback on them upon our review. Once done, we considered them “experienced transcribers”.
In our experience, starting from initial transcripts produced by ASR can halve the time necessary to produce reference transcripts, while maintaining similar transcript quality. This is particularly true if the ASR is highly accurate since it reduces the number of corrections the human transcriber has to make. One should be aware, however, that this procedure can introduce bias, depending on how conscientious the human transcriber is. An inexperienced or less conscientious transcriber may neglect to correct some ASR mistakes.
It is also easier for human transcribers to process shorter segments of speech, especially if they have to listen multiple times to unclear segments. Hence, to speed up the process of human transcription, the audio and transcript are first segmented by cutting them at silences longer than 500ms. Excessively long audio segments are then further divided at their longest silences, which must be at least 100ms. Note that the resulting segments do not necessarily correspond to linguistic boundaries or to where punctuation marks should be placed. Instead, in spontaneous speech, a person may pause in the middle of a sentence when faced with an increased cognitive load, e.g. when trying to recall a word. Similar methods of using ASR output as a basis for manual transcription were applied, e.g., by [Park and Zeanah2005] and [Matheson2007], for the purpose of transcribing interviews for interview-based research.
The human transcribers used Transcriber777http://trans.sourceforge.net/en/presentation.php; We used version 1.5.1, a tool for assisting manual annotation of speech signals through a graphical user interface. The tool synchronizes the text with the audio, and allows the human transcriber to review the text while listening to the audio, and easily pause, fix, annotate, and continue listening from a selected segment.
On average, the time needed for manual transcription by experienced transcribers was approximately five times the duration of the audio file. An example of the input to the tool – the output of the above-mentioned segmentation process – is presented in Example 4. The output of the post-edition, which uses the same format, is shown in Example 5.
The guidelines used for manual transcription explain how to deal with cases such as speaker hesitation, repetitions and utterance of incomplete words, what punctuation marks to use888The ASR does not produce punctuation marks; it turned out that the transcribers preferred adding them, as it made the text more readable. Punctuation also makes the texts more accessible for analysis and annotation and may be helpful for some automatic processing tasks., how to write abbreviations, numbers, etc. The main principles are that the transcripts should be accurate with respect to the source, capture as much signal as possible, and that they should maintain a uniform format that can be easily parsed in subsequent processing.999The transcription guidelines are shared with the released data.
Example 4 (Input for manual transcription)
Example 5 (Output of manual transcription)
4.1 Reference Transcripts
Some of the annotations in the post-edited transcripts are mostly useful for ASR training, as in the case of word mispronunciation and its correction (e.g. “lifes/lives”), while others contain signals that may also be useful for downstream text processing.
Our approach in producing the reference transcripts was to remove all non-textual annotations, producing a text-only version of the transcription, that can be used as-is, e.g. for argument extraction. From the Transcriber’s output, we first removed all SGML tags and merged the lines into a single stream. We then removed incomplete words and mispronounced words (replacing them with the correct pronunciation); similarly to the raw ASR post-processing, we removed annotations, hesitations, reformatted abbreviations and applied basic truecasing. Then, we detokenized the text, i.e. removed any unnecessary spaces between tokens, for example, before a punctuation mark. Lastly, we applied automatic spell-checking to detect typos and formatting errors, and sent the identified instances of possible typos for review. Example 6 shows the text segment from Example 5 after going through this cleaning.
Example 6 (Clean reference transcript)
|asr||Raw automatic transcripts|
|asr.txt||Clean automatic transcripts|
|trs.txt||Clean manual transcripts (references)|
|1||Violent video games||6||7.4|
|61||Doping in sports||5||7.7|
|482||Cultivation of tobacco||3||8.2|
|483||Freedom of speech||5||6.7|
The dataset we created was generated through the process described in the previous sections. We release all file types, including raw and clean versions, to enable research based on various signals, including audio-based ones, such as prosody or speech rate, and to allow performing different post-processing. Table 1 summarizes the files that are obtained and released for each debatable topic.
The dataset we release includes 60 speeches for 16 motions from [Rinott et al.2015], recorded by 10 different speakers.101010Currently, the list contains only a single female speaker; we are making an effort to recruit more female debaters. Table 2 provides details about the recordings included in the dataset.
There is a large variance in WER across different debate recordings, and between different speakers. The WER of any specific debate can vary depending on the degree of mismatch with the ASR acoustic and language models. Examples of mismatch include differences in speaker voice, speaking style and rate, audio capture (microphone type and placement), ambient noise, word choice and phrasing, etc. By reducing mismatch through model adaptation of speaker-dependent acoustic models, the WER can be significantly reduced. For instance, with adaptation using about 15 minutes of a speaker’s data, WER of a speech from topic 61 was reduced from 12.9% to 8.6%, and of a speech from topic 483, from 12.2% to 9.7%.
The dataset is freely available for research at https://www.research.ibm.com/haifa/dept/vst/mlta_data.shtml.
We wish to thank the many speakers and transcribers who took part in the effort of creating this dataset.
- [Aharoni et al.2014] Aharoni, E., Polnarov, A., Lavee, T., Hershcovich, D., Levy, R., Rinott, R., Gutfreund, D., and Slonim, N. (2014). A benchmark dataset for automatic detection of claims and evidence in the context of controversial topics. In Proceedings of the First Workshop on Argumentation Mining, pages 64–68. Association for Computational Linguistics (ACL).
- [Aker and Zhang2017] Aker, A. and Zhang, H. (2017). Projection of argumentative corpora from source to target languages. In Proceedings of the 4th Workshop on Argument Mining, pages 67–72.
- [Bar-Haim et al.2017] Bar-Haim, R., Edelstein, L., Jochim, C., and Slonim, N. (2017). Improving claim stance classification with lexical knowledge expansion and context utilization. In Proceedings of the 4th Workshop on Argument Mining, pages 32–38.
- [Bilu and Slonim2016] Bilu, Y. and Slonim, N. (2016). Claim synthesis via predicate recycling. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, page 525.
- [Cettolo et al.2012] Cettolo, M., Girardi, C., and Federico, M. (2012). WIT: Web inventory of transcribed and translated talks. In Proceedings of EAMT. The European Association for Machine Translation (EAMT).
- [Hochreiter and Schmidhuber1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Comput., 9(8):1735–1780.
- [Lippi and Torroni2016] Lippi, M. and Torroni, P. (2016). Argumentation mining: State of the art and emerging trends. ACM Trans. Internet Technology, 16:10:1–10:25.
- [Matheson2007] Matheson, J. L. (2007). The voice transcription technique: Use of voice recognition software to transcribe digital interview data in qualitative research. Qualitative Report, 12(4):547–560.
[Pahuja et al.2017]
Pahuja, V., Laha, A., Mirkin, S., Raykar, V., Kotlerman, L., and Lev, G.
Joint learning of correlated sequence labelling tasks using bidirectional recurrent neural networks.In Proceedings of INTERSPEECH, Stockhold, Sweden.
[Palau and Moens2009]
Palau, R. M. and Moens, M.-F.
Argumentation mining: the detection, classification and structure of
arguments in text.
Proceedings of the 12th international conference on artificial intelligence and law, pages 98–107. ACM.
- [Park and Zeanah2005] Park, J. and Zeanah, A. E. (2005). An evaluation of voice recognition software for use in interview-based research: a research note. Qualitative Research, 5(2):245–251.
- [Rinott et al.2015] Rinott, R., Dankin, L., Perez, C. A., Khapra, M. M., Aharoni, E., and Slonim, N. (2015). Show me your evidence-an automatic method for context dependent evidence detection. In Proceedings of EMNLP, pages 440–450. Association for Computational Linguistics (ACL).
- [Sobhani et al.2015] Sobhani, P., Inkpen, D., and Matwin, S. (2015). From argumentation mining to stance classification. In Proceedings of the 2nd Workshop on Argumentation Mining, ArgMining@HLT-NAACL, pages 67–77.
- [Soltau et al.2013] Soltau, H., Kuo, H., Mangu, L., Saon, G., and Beran, T. (2013). Neural network acoustic models for the DARPA RATS program. In Proceedings of Interspeech. ISCA.
- [Soltau et al.2014] Soltau, H., Saon, G., and Sainath, T. (2014). Joint training of convolutional and non-convolutional neural networks. In Proceedings of ICASSP. IEEE.
- [Wachsmuth et al.2017a] Wachsmuth, H., Naderi, N., Habernal, I., Hou, Y., Hirst, G., Gurevych, I., and Stein, B. (2017a). Argumentation quality assessment: Theory vs. practice. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 250–255.
- [Wachsmuth et al.2017b] Wachsmuth, H., Potthast, M., Al Khatib, K., Ajjour, Y., Puschmann, J., Qu, J., Dorsch, J., Morari, V., Bevendorff, J., and Stein, B. (2017b). Building an argument search engine for the web. In Proceedings of the 4th Workshop on Argument Mining, pages 49–59.
- [Walker et al.2012] Walker, M., Tree, J. F., Anand, P., Abbott, R., and King, J. (2012). A corpus for research on deliberation and debate. In Nicoletta Calzolari (Conference Chair), et al., editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may. European Language Resources Association (ELRA).
- [Wei et al.2016] Wei, Z., Liu, Y., and Li, Y. (2016). Is this post persuasive? ranking argumentative comments in the online forum. In The 54th Annual Meeting of the Association for Computational Linguistics, page 195.
- [Zhang et al.2016] Zhang, J., Kumar, R., Ravi, S., and Danescu-Niculescu-Mizil, C. (2016). Conversational flow in Oxford-style debates. In Proceedings of HLT-NAACL.
- [Zhang et al.2017] Zhang, F., B. Hashemi, H., Hwa, R., and Litman, D. (2017). A corpus of annotated revisions for studying argumentative writing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1568–1578, Vancouver, Canada, July. Association for Computational Linguistics.