Identifying Semantically Duplicate Questions Using Data Science Approach: A Quora Case Study

04/18/2020
by   Navedanjum Ansari, et al.
6

Identifying semantically identical questions on, Question and Answering social media platforms like Quora is exceptionally significant to ensure that the quality and the quantity of content are presented to users, based on the intent of the question and thus enriching overall user experience. Detecting duplicate questions is a challenging problem because natural language is very expressive, and a unique intent can be conveyed using different words, phrases, and sentence structuring. Machine learning and deep learning methods are known to have accomplished superior results over traditional natural language processing techniques in identifying similar texts. In this paper, taking Quora for our case study, we explored and applied different machine learning and deep learning techniques on the task of identifying duplicate questions on Quora's dataset. By using feature engineering, feature importance techniques, and experimenting with seven selected machine learning classifiers, we demonstrated that our models outperformed previous studies on this task. Xgboost model with character level term frequency and inverse term frequency is our best machine learning model that has also outperformed a few of the Deep learning baseline models. We applied deep learning techniques to model four different deep neural networks of multiple layers consisting of Glove embeddings, Long Short Term Memory, Convolution, Max pooling, Dense, Batch Normalization, Activation functions, and model merge. Our deep learning models achieved better accuracy than machine learning models. Three out of four proposed architectures outperformed the accuracy from previous machine learning and deep learning research work, two out of four models outperformed accuracy from previous deep learning study on Quora's question pair dataset, and our best model achieved accuracy of 85.82

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/18/2021

Classification of Pedagogical content using conventional machine learning and deep learning model

The advent of the Internet and a large number of digital technologies ha...
research
07/07/2023

A Natural Language Processing Approach to Malware Classification

Many different machine learning and deep learning techniques have been s...
research
03/25/2019

Question Embeddings Based on Shannon Entropy: Solving intent classification task in goal-oriented dialogue system

Question-answering systems and voice assistants are becoming major part ...
research
10/29/2019

Detect Toxic Content to Improve Online Conversations

Social media is filled with toxic content. The aim of this paper is to b...
research
04/11/2022

Comparison Analysis of Traditional Machine Learning and Deep Learning Techniques for Data and Image Classification

The purpose of the study is to analyse and compare the most common machi...
research
04/08/2022

Identifying Experts in Question & Answer Portals: A Case Study on Data Science Competencies in Reddit

The irreplaceable key to the triumph of Question Answer (Q A) plat...
research
07/27/2020

Predicting Nonlinear Seismic Response of Structural Braces Using Machine Learning

Numerical modeling of different structural materials that have highly no...

Please sign up or login with your details

Forgot password? Click here to reset