GupShup: An Annotated Corpus for Abstractive Summarization of Open-Domain Code-Switched Conversations

04/17/2021
by   Laiba Mehnaz, et al.
28

Code-switching is the communication phenomenon where speakers switch between different languages during a conversation. With the widespread adoption of conversational agents and chat platforms, code-switching has become an integral part of written conversations in many multi-lingual communities worldwide. This makes it essential to develop techniques for summarizing and understanding these conversations. Towards this objective, we introduce abstractive summarization of Hindi-English code-switched conversations and develop the first code-switched conversation summarization dataset - GupShup, which contains over 6,831 conversations in Hindi-English and their corresponding human-annotated summaries in English and Hindi-English. We present a detailed account of the entire data collection and annotation processes. We analyze the dataset using various code-switching statistics. We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation. Our results show that multi-lingual mBART and multi-view seq2seq models obtain the best performances on the new dataset

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2021

ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation

Code-switching is a speech phenomenon when a speaker switches language d...
research
06/07/2021

Summary Grounded Conversation Generation

Many conversation datasets have been constructed in the recent years usi...
research
10/04/2020

Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization

Text summarization is one of the most challenging and interesting proble...
research
06/15/2018

A Dataset for Building Code-Mixed Goal Oriented Conversation Systems

There is an increasing demand for goal-oriented conversation systems whi...
research
03/01/2022

Advancing an Interdisciplinary Science of Conversation: Insights from a Large Multimodal Corpus of Human Speech

People spend a substantial portion of their lives engaged in conversatio...
research
04/10/2020

A New Dataset for Natural Language Inference from Code-mixed Conversations

Natural Language Inference (NLI) is the task of inferring the logical re...
research
03/03/2021

An Attention Based Neural Network for Code Switching Detection: English Roman Urdu

Code-switching is a common phenomenon among people with diverse lingual ...

Please sign up or login with your details

Forgot password? Click here to reset