Community question answering (CQA) websites (e.g Stackoverflow, Github) have become quite popular for immediate brief answers of a given question . Software developers, architects and data scientists regularly visit the relevant forums and websites, on a day-to-day basis for referencing necessary technical contents. In addition, they often use the modified versions of code snippets from the CQA websites for solving their use cases. Hence maintaining high quality answers in those community websites is imperative for their continued relevance in the developers’ community. A common scenario for many questions in the community forums is that there are likely more than one answers for the given question [27, 15]. However, out of all the available answers, only a few of them are worthwhile in terms of technical quality and usefulness. Finding those quality answers for given questions manually is time consuming, and typically requires a community support engineer (domain expert) to read the answers and record the optimal answer (under the criteria of clarity, technical content, and structure). In addition, a standardized definition of a ”high-quality” answer on CQA websites does not exist. Thus, a system that can model high-quality answers based on their technical content, without having to be explicitly defined, is greatly desired in order to circumvent these challenges.
However, there are relatively lesser amount of research on understanding what a good quality question-answer pair is for websites like Stackoverflow or Microsoft Developer Network (MSDN) compared to generic CQA websites like Quora or Yahoo! . One of the reasons for this is caused by the the excessive technical nature of the contents in those forums. It might be impractical to predict the question-answer quality only based on language semantics. Hence, incorporating technical semantics and content have the potential to improve the answer quality modeling for those forums.
The CQA websites hold a treasure of technical contents which is a database of useful technical questions and corresponding answers on various topics. It can be exploited to further improve their functionality 
. Viewing from the machine learning perspective, the CQA websites’ language intensive question-answering datasets are rich in resources for automated Q&A modeling. Specifically recent developments in the natural language processing (NLP) space regarding learning context based representation techniques holds the promise to spearhead the field of automated question-answer quality models. One may ponder what a good quality Q&A means the answers accepted in the CQA forums are likely to point towards ground truths that an automated question-answer model should be able to exploit. Another point of interest is how practical the models are in a real-world production scenario. For example, if the latency time during inference is more that 500 ms, the model is unlikely to produce any tangible benefits for practical application.
In this vein, the work presented in this paper makes substantial progress in both these two dimensions in Q&A quality modeling.
First, the paper investigates whether the advancement of NLP techniques in a general setting have tangible benefits in technical content modeling. A whole question-answer pair is investigated for predicting the quality rather than modeling questions and answers separately. Transfer learning [21, 30] using already pre-trained model using Bidirectional Encoder Representations from Transformers (BERT) is used. The model is again pre-trained and finetuned in the community support space. We call the final model support-BERT.
Second, the paper shows that a Q&A quality model with reasonably good performance using a deep neural network can also be implemented within the specified latency. The performance of the model is evidential that our model can be deployed in real-time scenario in Azure knowledge base system.
The related pre-training and finetuning code are available from 111https://github.com/Microsoft/AzureML-BERT.
2 Related Work and Contribution
Modeling the Q&A quality in community question answering websites is not new. A number of studies have used different research questions for solving the Q&A modeling problem.
2.1 Predict Good Quality Questions
Predicting the difficulty of questions was studied in  where they used theory of formal language to create a difficulty level of a technical question from Stackoverflow. Tian et al.  proposed to solve the quality model by finding best expert users for directing the questions for answer. In addition, modeling quality questions and answers in CQA websites have been well studied.  modeled the quality of questions (based on of views and the number of up votes a question has garnered) in Stackoverflow using a topic modeling framework.  used a recommendation system to find out similar questions from a database. A semi-supervised coupled mutual reinforcement framework was proposed in  for simultaneously calculating content quality and user reputation. A number of quality metrics were studied in  for finiding high quality questions and content. A whole question answering scheme using metafeatures, e.g.,
reputations of co-answerers, relationships between reputation and answer speed, and that the probability of an answer being chosen as the best one, was studied in. In contrast to these models, support-BERT only takes the questions and answers as input, and models the quality of them as a pair.
2.2 Predict Good Quality Answers
There have been sufficient research for understanding high quality answers for general purpose question answering website like Quora or Yahoo!. Some previous works have extensively focused on understanding question quality, e.g, . Quality answer prediction has been also studied in  using web redundancy information. In addition, classical NLP techniques like textual entailment , syntactic features  and non-textual features, e.g.  have been used to predict answer quality. An ensemble of features were tried for answer qualities in . Application of deep learning for modeling answer quality is also not new. Attentive neural networks have been applied for answer selection from community websites in . Previous studies have also shown that question quality can have a significant impact on the quality of answers received . High quality questions can also drive the overall development of the community by attracting more users and fostering knowledge exchange that leads to efficient problem solving. There has also been work on discovering expert users in CQA sites, which has mainly focused on modeling expert answerers [29, 26]. Work on discovering expert users was often positioned in the context of routing questions to appropriate answerers ( [13, 12, 38]). Our model takes a question-answer pair together and outputs the quality (“accepted” or “unaccepted”) without any other meta-features. In addition, the model is structured in a transfer learning framework .
We specifically test the following three hypotheses in this paper:
A general purpose language modeling framework (that uses language semantics of Q&As itself) can be trained to model quality of question and answers.
Incorporating technical semantics of Q&A structures can model the quality better.
Although deep learning models may be more accurate for modeling the Q&A quality, due to the huge number of parameters, it is not efficient to be deployed for online question-answer quality check (or real-time question-answer quality check).
The contributions of this paper are as follows:
A state-of-the-art natural language modeling is adopted for modeling technical question-answers from MSDN. To the best of our knowledge, support-BERT is the first domain specific BERT pre-trained on MSDN community data, to transfer technical model semantics from progamming community corpora.
The original BERT-medium architecture is utilized for training on the community dataset. We find that transfer learning works surprisingly well in modeling question-answer quality for the language intensive community websites like MSDN.
The model was deployed on Azure Kubernetes Service under sub-second latency, i.e., the inference engine is real-time.
The comparison of the proposed model with respect to traditional machine learning based context-naive answer quality model is demonstrated in Fig. 1.
The community Q&A dataset used in this paper is taken from Microsoft Developer Network 222https://msdn.microsoft.com/en-us/. The dataset consists of a number of meta-features, e.g., number of upvotes, reputation of answerers, title of questions, topics that the question-answer pair belongs to, etc. The dataset consists of almost 300,000 Question-Answers, out of which 75,000 are accepted and 225,000 are not accepted. However, note that in our work only texts of the questions and answers are used without any kind of meta-features. The data was minimally preprocessed to remove stopwords, pronoun and participles.
In this paper, we use transfer learning for modeling the good quality question-answer pair from MSDN. Specifically the proposed model described below falls within the framework of Inductive self-taught learning 
. In natural language processing domain, bidirectional transformers have found a lot of attention recently for their wide-range expressibility and performance in common natural language processing tasks. Transformers have been shown to be effective in many supervised learning tasks where they were trained using different tasks and the learned weights were transferred for finetuning. Motivated by their wide range of adaption in their state-of-the art performance, we wanted to test if the Bidirectional Encoder Representations from Transformers (BERT) models can model the question-answer pair quality in community space. Two versions of experiments are carried out for the BERT modeling 1) Finetuning of already pre-trained model and 2) Pre-training + finetuning from the initial check-point. In addition, we experiment with changing a number of vocabularies specific to the MSDN community space and their effect on accuracy. Dataset used in the experiment are taken from publicly available sources namely Microsoft Developer Network (MSDN). The reason for choosing MSDN is the availability of rich text based technical questions and answers. The methods have also been compared with base-line NLP word representation techniques TF-IDF , Word2Vec . The best model from the above experiment was chosen for deployment. Moreover, the model is deployed in Azure Kubernatics Services (AKS) as stand-alone implementation.
4.1 BERT: Bidirectional Encoder Representation from Transformers
Using context based word representations for solving natural language processing tasks, e.g., machine translation, question answering and sentence completion have gained popularity in last 10 years. Training on specific NLP tasks (e.g., language modeling ) where word representations were byproducts of the NLP tasks, or direct optimization of word representations based on various hypotheses ([19, 22]
) was conducted to obtain word representations. However, previous studies for representing the words in NLP tasks have mainly utilized representation that are context-naive. More recently, research work on NLP techniques have argued learning context dependent representations. As an example, bi-directional recurrent neural network based language model is used in ELMo that achieved great performance in a number of language tasks. On the other hand, CoVe  makes use of language translation for projecting words into same embedding space based on the context information. The current state-of-the-art in machine translation, multi-task learning for NLP makes use of only attention  based neural networks such as transformers . BERT  is one such model that exploits the use of contextualized word formats and representation by pre-training the model on a masked language as well as next language prediction framework. Note that previously, because of the uncertainty of NLP models which could not predict the possibility of future words in modeling a context-specific words, bidirectional context-specific models were a combination of left to right and right to left RNN models. In order to alleviate the problem of extebsive amount of computation required for model bi-directional RNN models, BERT uses a masked word prediction as a task during pre-training, thus removing the constraints of using RNNs. In addition, the model combines next sentence prediction task as well, which encodes context dependent representation for words even in a Q&A framework. These training criteria on a large text corpus (wikipedia and bookscorpus) make BERT model ideal for best preformance on a range of natural language processing task.
4.2 BERT as Feature Extractor
The pre-trained BERT model can be used in transfer learning setting for extracting features in a new domain. In this scenario, Q&As from community support data is transformed to fixed dimensional vectors using the first few layers of the pre-trained transformer model. The extracted features are then trained and tested using a softmax layer for modeling Q&A quality.
4.3 Finetuning Support-BERT
In this experiment, pre-trained BERT model available from tensorflow hub is finetuned without any further pre-training on community support domain. The BERT enoder is appended with a softmax layer and finetuned for 3 iterations for the Q&A tasks on the MSDN dataset.
4.4 Pre-Training Support-BERT
Pre-training a BERT model from scratch is a very slow process. The MSDN dataset, containing technical questions and answers, had a size of around 300K. In order to fully leverage the technical and language semantics, we started the pre-training from the check-point available from the original BERT model. Then the network was trained for another 100K iterations on the MSDN questions and answers data. In this scenario, masked language modeling was used. The model checkpoints are saved for 20K-100K in 20K iterations progession. This pre-trained network is further finetuned using Q&A tasks on the MSDN dataset.
5 Results and Discussion
This section describes the results of running the experiments on support-BERT with different configurations. We illustrate and tabulate important results. In addition, we also discuss key observations on the results.
5.1 BERT as Feature Extractor
Using BERT as generic feature extractor did not have significant improvement on correctly identifying the quality of Q&As. Using 50K/50K training set and 50K/50K test set, the maximum accuracy achived on the test set was 0.5340. Using 50K/50K training set and 25K/75K test set, the maximum accuracy achieved on the test set was 0.5890.
5.2 Improvement Using Finetuning Support-BERT
Starting from the checkpoint of BERT model, support-BERT was finetuned for 3 epochs. The finetuning was carried out in supervised learning framework for next sentence prediction. The finetuning for 3 epochs took 3 hours on our machine. There is a visible improvement of the model performance for the prediction task as shown in TableI. In the test scenario for 1:1, accuracy increases up to 0.6966. In the more real-world scenario of 1:3 in test set, the accuracy increases up to 0.7228.
|Number of accepted/ unaccepted||Number of accepted/ unaccepted|
5.3 Comparison with Baseline Answer Quality Model
The proposed support-BERT model with transfer learning was compared with two other baseline models involving context naive language feature, namely TF-IDF and Word2Vec. Both these models performed poorly compared with support-BERT with respect to the Q&A quality prediction. The results are shown in Fig. 3.
5.4 Adding Domain Specific Words
In order to test if the performance of support-BERT is hindered by non-availability of MSDN domain related words, we added top-200 Tf-IDF words from the MSDN corpora to BERT vocabulary. Then the model was further finetuned using the dictionary with added words. However the performance did not improve in this experiment. The accuracy, precision and recall were 0.6865, 0.6957 and 0.6650 respectively. The distribution of top words in the MSDN corpora is shown in Fig.4.
5.5 Experiment Regarding Accuracy Drop vs. Number of Layers
Support-BERT has 12-layers of neural network which translates to roughly 110M parameters. In a realtime deployment setting using AzureML, it is possible that the model take a long time for inferencing the quality of Q&As. In order to understand, the behavior of support-BERT with respect to the number of layers used, we removed the trained layers starting from the last hidden layer. This experiment was carried out on the fined-tuned support BERT as described in Sec. 3.3. The result is illustrated in Fig. 5. The results demonstrate that, removing one layer has a drastic drop in the accuracy values. The accuracy drops by almost 9%. After that the accuracy drop is less (2% per layer).
However, the time for inference does not change too much with the removal of layers. In order to investigate the performance of time for inference with respect to number of support-BERT layers, we designed two experiments. The first experiment with test set containing 5K samples measures the total time taken to infer the quality for the batch as in Fig. 6(a). This involves, retrieving the stored model from disk, initialization of network graph and inference. As a test set is large enough, the effect of initialization is very small per sample. The inference time does not drop too much with lower number of layers in the network.
The second experiment involves testing with lower number of samples (300 samples in test set). The result is shown in Fig. 6(b). In this scenario, the effect of initialization during inference is very prominent. Considering the initialization time, per sample inference time is almost 300 ms. However, if we do not consider the initialization time, the per sample inference time is similar to previous experiment.
|Number of accepted/ unaccepted||Number of accepted/ unaccepted|
5.6 Improvement Using Pre-training and Finetuning Support-BERT
Starting from the checkpoint of BERT model trained on Wikipedia in an unsupervised setting, support-BERT was pre-trained on the MSDN community support data for 10 epochs. During pre-training both masked language modeling  and next sentence prediction  framework was used. Note that during pre-training, next sentence prediction involves using sentences defined by words between two full stops. The pre-training for 10 epochs took 48 hours in our machine. Finetuning the model using Q&As drastically improved the performance. In this case, next sentence prediction model involves using questions as a paragraph (involving more than one actual sentences) as first sentence and answers as a paragraph (involving more than one actual sentences) as next sentence. The results is tabulated in Table II. For test set containing acceptance unaccepted ratio as 1:1, the model was able to identify quality Q&As 82% of the time, whereas for test set containing acceptance unaccepted ratio as 1:3, the accuracy value was 0.7741.
5.7 Deployment of Support-BERT
In-house Azure Machine Learning (AzureML) services were used for evidential model deployment process. AzureML is a cloud-based environment that can be used to train, deploy, automate, manage, and track ML models that interoperates with popular open-source tools, such as PyTorch, TensorFlow, and scikit-learn. During our training process, TensorFlow-API for AzureML was used. Support-BERT was trained using Azure NC-6 Virtual Machine (VM), 1 NVIDIA Tesla K80 GPU, 6 vCPU, 56GB MEM, 12GB GPU MEM. The GPUs available in Azure NC VMs are given in TableIII. We expect that the “final” deployment will be done in more advanced GPU and the results are likely to be much faster.
The winning model after MSDN domain pre-training and Q&A specific finetuning was deployed as Azure Container Instances (ACI) on Azure Kubernetes Service (AKS). We briefly describe the deployment process following 333https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-inferencing-gpus. An Azure Machine Learning workspace was created with a python development environment with the Azure Machine Learning SDK installed. The trained model was registered to the workspace. Specifically, a registered model is a logical container for one or more files that make up the model. For example, if we have a model that’s stored in multiple files, we can register them as a single model in the workspace. After registering the files, the model can be downloaded or deployed and the files that was registered can be received. An Azure Kubernate cluster was created with GPU instance for the real-time deployment purpose with NC_6 GPU VM. For deployment purposes, the procedure given in  was followed.
After the deployment, the performance was checked for any degradation on the test data. Once deployed, the latency of new sample query was checked for multiple instances, where the average latency was found to be 110 ms. A sample question and answer from one run of inference from AzureML deployment is shown in Fig. 7.
|Size||vCPU||Memory: GiB||Temp storage (SSD) GiB||GPU||GPU memory: GiB||Max data disks||Max NICs|
In this brief paper, we presented a success of BERT model in CQA support space for modeling good quality question and answers. The proposed support-BERT model after domain specific pre-training and finetuning is an excellent candidate for fast automated decision of the quality Q&As when a new answer is proposed for a given question. We show that although the goodness of community based CQAs are not well-defined, it is possible to “mimic” expert validated rules for quality control. In addition, the model proposed in this paper is real-time, thus expediting the process of data analysis to machine learning model implementation step in a tradition data science pipeline. Future work will be directed towards validating the models for other CQA websites likestackoverflow and github. In addition, distilling the model to simpler models for inference on a CPU is also of interest. The current finetuned support-BERT model is being evaluated in integration with the Azure knowledge base initiative (providing the high quality relevant answers for questions) to enable support engineers to be more efficient.
-  (2008) Finding high-quality content in social media. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 183–194. Cited by: §2.1, §2.2.
-  (2012) Discovering value from community activity on focused question answering sites: a case study of stack overflow. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 850–858. Cited by: §2.1.
-  (2014) Neural machine translation by jointly learning to align and translate. ArXiv preprint arXiv:1409.0473. Cited by: §4.1.
-  (2015) Predicting the quality of questions on stackoverflow. In Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 32–40. Cited by: §2.2.
-  (2003) A neural probabilistic language model. Journal of Machine Learning Research 3 (Feb), pp. 1137–1155. Cited by: §4.1.
-  (2009) Learning to recognize reliable users and content in social media with coupled mutual reinforcement. In Proceedings of the 18th International Conference on World Wide Web, pp. 51–60. Cited by: §2.1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1, §4, §5.6.
-  (2014) Word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. ArXiv preprint arXiv:1402.3722. Cited by: §4.
Using syntactic features in answer reranking.
Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, Cited by: §2.2.
-  (2012) Modeling problem difficulty and expertise in stackoverflow. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work Companion, pp. 91–94. Cited by: §2.1.
-  (2005) Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 84–90. Cited by: §2.2.
-  (2011) Question routing in community question answering: putting category in its place. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 2041–2044. Cited by: §2.2.
-  (2010) Routing questions to appropriate answerers in community question answering services. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1585–1588. Cited by: §2.2.
-  (2011) Improving question recommendation by exploiting information need. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1425–1434. Cited by: §2.1.
-  (2008) Understanding and summarizing answers in community-based question answering services. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pp. 497–504. Cited by: §1.
-  Deploy a deep learning model for inference with gpu. Note: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-inferencing-gpusAccessed: 2020-02-13 Cited by: §5.7.
-  (2002) Is it the right answer?: exploiting web redundancy for answer validation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 425–432. Cited by: §2.2.
-  (2017) Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: §4.1.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119. Cited by: §4.1.
-  (2012) Evolution of experts in question answering communities. In Sixth International AAAI Conference on Weblogs and Social Media, Cited by: §1.
-  (2009) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: 1st item, §2.2, §4.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Cited by: §4.1.
-  (2018) Deep contextualized word representations. ArXiv preprint arXiv:1802.05365. Cited by: §4.1.
-  (2003) Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning, Vol. 242, pp. 133–142. Cited by: §4.
-  (2014) Great question! question quality in community Q&A. In Eighth International AAAI Conference on Weblogs and Social Media, Cited by: §2.1.
-  (2012) Finding expert users in community question answering. In Proceedings of the 21st International Conference on World Wide Web, pp. 791–798. Cited by: §2.2.
-  (2010) Evaluating and predicting answer quality in community QA. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 411–418. Cited by: §1.
-  (2015) Question/answer matching for CQA system via combining lexical and sequential information. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2013) Booming up the long tails: discovering potentially contributive users in community-based question answering services. In Seventh International AAAI Conference on Weblogs and Social Media, Cited by: §2.2.
-  (2018) A survey on deep transfer learning. In International Conference on Artificial Neural Networks, pp. 270–279. Cited by: 1st item.
-  (2013) Towards predicting the best answers in community-based question-answering services. In Seventh International AAAI Conference on Weblogs and Social Media, Cited by: §1.
-  (2013) Predicting best answerers for new questions: An approach leveraging topic modeling and collaborative voting. In International Conference on Social Informatics, pp. 55–68. Cited by: §2.1.
-  (2015) JAIST: Combining multiple features for answer selection in community question answering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 215–219. Cited by: §2.2.
-  (2013) Stackoverflow and github: associations between software development and crowdsourced knowledge. In 2013 IEEE International Conference on Social Computing, pp. 188–195. Cited by: §1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §4.1.
-  (2007) Recognizing textual entailment using a subsequence kernel method. Cited by: §2.2.
-  (2017) Attentive interactive neural networks for answer selection in community question answering. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.2.
-  (2012) A classification-based approach to question routing in community question answering. In Proceedings of the 21st International Conference on World Wide Web, pp. 783–790. Cited by: §2.2.