Functionality is a fundamental concern for customers when they decide to buy a new product. From customers’ perspective, before they purchase a product, it is natural for them to ask what the to-be-purchased one can do and cannot do. From sellers’ perspective, selling fully-functioned products can increase sales, and yet selling products with missing functions can lead to catastrophic customer dissatisfaction. From manufacturers’ perspective, missing functions reported by customers can help improve their products. In marketing, the term product is defined as “anything that can be offered to a market for attention, acquisition, use or consumption that might satisfy a want or need” . It is crucial to ensure that the functions of a product can satisfy customers’ needs. Therefore, conveying the information about functions successfully to customers is important for both manufacturers and sellers.
|Q:||Can I use this for video editing|
|A:||No, it does not support Google Play.|
|Q:||Can I make video calls to other non Apple computers ? ?|
|A:||yes you can if they have Skype , Tango , or oovoo|
|Q:||Will it be useful for music production ?|
|Q:||Can I use Microsoft Office on this MacBook Pro ?|
In e-commerce platforms, one issue to convey such information is that products cannot be physically presented to customers before purchasing. To overcome such limitation, many alternative approaches are deployed, i.e., using descriptions, pictures, and videos. However, detailed functionality information may not be readily available for the following reasons. 1) The cost of testing functions multiplied by a large number of products can be extremely high. For example, it is impossible to test so many PCs whether they can run specific high-performance PC games. 2) Some missing functions are deliberatively hidden from descriptions by sellers to avoid hurting sales.
Fortunately, functionality information can be exchanged between customers and sellers via online platforms, such as forums and community QA. This allows us to adopt an NLP-based approach to automatically sense and harvest product functions on a large scale. We formulate a novel text mining task called Function Need Recognition (or FNR for short). A function need is defined as a sequence of words indicate a function expression (e.g., “make video calls”). In this paper, we only focus on product function needs and leave satisfiability issues (e.g., whether a product can “make video calls”) to future work 222A comprehensive study of product function satisfiability can be found at AAAI-2018 . .
This task is non-trivial and the following challenges have to be addressed. First, to ensure extraction quality, corpora that are dense and accurate in product functionality information are preferred. To the best of our knowledge, there is no existing study on such a corpus to meet these requirements. Second, the number of function needs can be unlimited. How to ensure unexpected function needs can be detected is important.
We address the challenges by first identify and annotate a high-quality corpus. In particular, Amazon.com allows potential consumers to communicate with existing product owners or sellers regarding product functions via Product Community Question Answering (PCQA for short). Four (4) QA pairs talking about a laptop sold on Amazon are shown in Table I. Observe that the name of target product (to-be-purchased) can be identified using the metadata of the target product. But 4 function needs (“use for video editing”, “make video calls”, “useful for music production”, and “use Microsoft Office”) should be identified from the questions.
Given the corpus, we then formulate the problem as a sequence labeling task on questions. We propose a deep sequence labeling model called Semi-supervised Attention Network (SAN) to solve this problem. The key property of SAN is to use attention mechanism to summarize unlabeled data as side information for short labeled questions. For example, let us assume only the 1st question is in the labeled data and all other 3 questions are in unlabeled data. Then words like “use” or “video” in other 3 questions can serve as side information to help identify that “use for video editing” is a function. Also, another advantage of using unlabeled data is that the embeddings of words do not appear in labeled data can still be tuned during training. To the best of our knowledge, this is the first attempt to use attention mechanism in a semi-supervised setting.
Ii Model and Preliminary
Ii-a Model Overview
We briefly introduce the proposed Semi-supervised Attention Network (SAN) in this section. The idea of the network is to couple RNN-based sequence labeling network with attention on unlabeled data. The proposed network is illustrated in Fig. 1. The left side can be viewed as a supervised sequence labeling model. It reads in a (labeled) question and outputs label sequence , where . The right side is the semi-supervised part. A few unlabeled questions
are fed into a bank of BLSTMs (Bidirectional Long Short-Term Memory[3, 4], one for each unlabeled question) with attentions (called bank attention). The attended results are served as side information for the (labeled) question. The key point here is, given a labeled question, we need to learn the weights on how to attend (or read) unlabeled questions. Note that both supervised and semi-supervised parts share the same embedding layer. This also gives the opportunity to tune embeddings of words not appear in the labeled questions. Such a tuning is impossible in supervised settings. All unlabeled questions share the same weights for their BLSTM layers (not shown in the figure). After each word in the labeled question obtains the side formation, we feed the augmented labeled question into another BLSTM layer. Then we generate label sequence
via a softmax layer. Overall, the labeled question can leverage unlabeled questions to decide the output labels in an end-to-end manner.
Embedding Layer We pair each labeled question with a few unlabeled questions (for both the training data and the test data). Unlabeled questions are similar questions from the same category as the labeled question returned by a search engine. Let the sequence and denote the labeled question and the -th unlabeled question, respectively. Here and denote their respective lengths. When a question contains multiple sentences, we concatenate them into a single sequence. We separate the sentences by a special token EOS. We set , which covers 99.5% of lengths of labeled questions. Questions longer (shorter) than
words are truncated (padded with zeros). We can view(
, resp.) as a matrix of one-hot column vectors.is later transformed into embedded representation (, resp.). We pre-train the word embedding via skip-gram model . Then we fine-tune the embeddings when optimizing the proposed model.
BLSTM Layer The embedded question sequences ( and ) are fed into the labeled BLSTM and the unlabeled BLSTMs, respectively. We use and to denote the outputs of these BLSTM layers for the labeled question and unlabeled questions, respectively. We show important notations in Table II, which is used in the next section.
Iii Semi-supervised Attention Network
|The -th word in the labeled question|
|The -th word in an unlabeled question|
Iii-a Bank Attention
The key point of SAN is to leverage attention mechanism for semi-supervised learning. We utilize attention mechanism to synthesize side information from unlabeled data for each word in a labeled question. The idea is that words in unlabeled data may have useful information for sequence labeling when they talk about similar products. We introduce a hierarchical attention mechanism. As traditional attention mechanism, we let each word in a labeled question to attend a word in an unlabeled question. This is level 1 attention. On the higher level, we pair a labeled question with multiple related unlabeled questions. Note that different questions may not equally contribute side information to the labeled question. So we allow one word in the labeled question to attend on the results of level 1 attention on multiple questions. We use the term bank attention to refer to one word in a labeled question hierarchically attending to unlabeled questions. The details are shown in Fig. 2.
We try to get the side information for the -th word in the labeled question. We first transform the word representations of the labeled question and unlabeled question via respective fully connected layers. Then the representations are activated by :
where , , and are trainable weights. The -th word in the labeled question first obtain the attention weight for the -th word in the -th unlabeled question via a dot product. Then the weights are normalized by a softmax function:
This is the level 1 attention weights. Let denote the side information of the -th word in the labeled question for the -th unlabeled question (representation after the first-level attention). It is the weighted sum over all words in the -th unlabeled question.
Later, we have a level 2 attention over different unlabeled questions. Again we first transform the side information of the -th word for each unlabeled question:
Then the level 2 attention weights are again obtained via dot products normalized by a softmax function:
And finally the side information vector for the -th word in the labeled question (representation after level 2 attention) is:
Lastly, we concatenate with as the representation of the -th word in the question: .
Iii-B Sequence Labeling
After obtaining the representation of the labeled question with side information, we feed into another BLSTM layer. So we have two LSTM layers for the labeled question, which is similar to the stacked BLSTM  (S-BLSTM). We use S-BLSTM to obtain better sequence representation. Then we have for the labeled question sequence. We reduce the dimension of to the size of the label set via a fully connected layer:
. We output the probability distribution over labelsfor the -th question word via a softmax function:
represents all trainable parameters, including parameters in LSTM cells and word embeddings. Finally, we optimize the cross entropy loss function over the training dataset:
where represents all the training examples. is the ground truth for the -th question word and label in the -th training example. We leverage Adam optimizer  to optimize the whole network. We set the learning rate as 0.001 and keep other parameters the same as the original paper. We set the dropout rate to 0.2. The batch size is set to 256.
Iv Experimental Result
|Product||QA||% of QAs with Functions|
|Micro SD Card||283||81.27|
Iv-a Corpus Annotation, Analysis, and Preprocessing
We crawled about 1 million QA pairs from the pages of products in the electronics department from Amazon as the training corpus for skip-gram model  to obtain word embedding matrix .
We further annotated a subset of 4999 QA pairs from 18 products for model training and testing. The basic statistics of the corpus is shown in Table III. The corpus is labeled by 3 annotators independently. The general annotation guidelines are as follows:
only yes/no QAs should be labeled;
a function expression is labeled as a function target with an optional function verb;
a function target can be specific entities (e.g., “iPhone”), general entities like “video” or service providers like “AT&T”;
a function target should be labeled as token spans containing nouns, adjectives, or model numbers (e.g., “Samsung micro SD EVO”);
expressions about specific aspects or accessories are not considered as function expressions. This is because aspects or accessories are not closely related to the functionality of the product as a whole;
nouns that are subjective are not regarded as function target (e.g., the word “need” in “Can it fit my need ?”);
the optional function word can be a verb (e.g., “produce” in “produce music”) or its noun form (e.g., “production” in “music production”); we also include the adjunct word (e.g., “with” in “work with iPhone”) for extrinsic function expression;
some function expression does not have function word, e.g., “Does Skype ok on this?”;
All annotators initially agreed on their annotations (same function targets and function words) on 81% of all QA pairs. Disagreements are then resolved to reach final consensus annotations.
We observe that accessories (the last 5 products) have a higher percentage of the function need related questions than those of main products (the first 13 products). This is expected since one accessory may work with multiple devices and thus have more functions.
The annotated corpus is preprocessed using Stanford CoreNLP 333http://stanfordnlp.github.io/CoreNLP/. We have the following steps: sentence segmentation, tokenization, POS-tagging, lemmatizing and dependency parsing. The last 3 steps provide features for the Conditional Random Fields (CRF)  baseline.
We also select the most similar 5 unlabeled questions under the same category as the labeled question returned by ElasticSearch444www.elastic.co, as the question bank.
We only perform sentence segmentation and tokenization on these unlabeled questions to save preprocessing time. Lastly, multiple sentences in both labeled and unlabeled questions are concatenated together. We set the maximum length of a question to be 40. This covers 99.5% labeled questions in full length.
After preprocessing, one example contains a labeled question, 5 unlabeled questions, and one labeled answer. We shuffle all examples and select 70% for training, 10% for validation and 20% for testing. The validation set is used to avoid overfitting on the training data.
|SAN (-) BLSTM2||0.83||0.7||0.759|
We compare the following baselines with SAN:
CRF: We use Mallet555http://mallet.cs.umass.edu/ as the CRF implementation. We train a CRF model using exactly the same training data as the proposed method. We use the following manually created features:
the words within a 5-word window;
the POS tags within a 5-word window;
the number of characters;
binary indicators (camel case, digits, dashes, slashes and periods);
dependency relations for the current word obtained via dependency parsing.
We use CRF as a baseline to show the performance of a non-deep learning method.
S-BLSTM: This baseline is a traditional S-BLSTM with 2 layers (by removing the bank attention from SAN). It is a supervised baseline. We use this baseline to show that using purely supervised data is not good enough. Unlabeled data can help to improve the performance.
SAN (-) BLSTM2: This baseline does not have the second layer of BLSTM for the labeled question. We use this baseline to show that S-BLSTM works better for our problem. We use 5 unlabeled questions in both this baseline and SAN.
Result Analysis From Table IV, we can see that the proposed SAN framework performs the best on F1-score. Although CRF is a non-deep learning model, its precision is not bad since we use dependency relations as features. However, the recall of CRF is very low since it can only train weights on words appear in the training data. All deep learning models have better recalls than CRF. S-BLSTM has the best precision as it is trained using only the training data. However, its recall is relatively low. It still suffers the problem that training data can not further tune embeddings of words not appeared in the training data. SAN (-) BLSTM2 shows that the additional BLSTM layer is effective in learning better representations. Lastly, SAN significantly improves the recall by further adjusting the weights for different unlabeled questions. It only loses 0.5% on precision compared that with S-BLSTM.
V Related Work
Although CNN [19, 20] and Long Short-Term Memory (LSTM)  are both used in NLP tasks, LSTM is more commonly used in sequence labeling [21, 22]. Attention mechanism is popular in image recognition [23, 24]. It is later used in natural language processing [25, 26]. However, attention mechanism is only used in supervised settings. We adapt attention for a semi-supervised setting . Traditional semi-supervised learning uses unlabeled data as training examples  directly. Instead, we use unlabeled data as side information for labeled examples.
In this paper, we propose the task of Function Need Recognition (FNR), which is to identify function needs queried by customers. We leverage a Semi-supervised Attention Network (SAN) to solve this problem by leveraging unlabeled data as attended side information. Experiments demonstrate that the SAN is better than a number of baselines.
This work is supported in part by NSF through grants IIS-1526499, and CNS-1626432, and NSFC 61672313.
-  P. Kotler and G. Armstrong, Principles of marketing. pearson education, 2010.
H. Xu, S. Xie, L. Shu, and P. S. Yu, “Dual attention network for product
compatibility and function satisfiability analysis,” in
Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2018.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in neural information processing systems, 2013, pp. 3111–3119.
-  S. El Hihi and Y. Bengio, “Hierarchical recurrent neural networks for long-term dependencies.” in NIPS, vol. 400. Citeseer, 1995, p. 409.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields:
Probabilistic models for segmenting and labeling sequence data,” in
Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, 2001, pp. 282–289.
-  M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004, pp. 168–177.
-  J. McAuley and J. Leskovec, “Hidden factors and hidden topics: understanding rating dimensions with review text,” in Proceedings of the 7th ACM conference on Recommender systems. ACM, 2013, pp. 165–172.
-  J. J. McAuley, R. Pandey, and J. Leskovec, “Inferring networks of substitutable and complementary products,” in KDD, 2015.
-  B. Liu, Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press, 2015.
-  L. Shu, H. Xu, and B. Liu, “Lifelong learning crf for supervised aspect extraction,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vancouver, Canada: Association for Computational Linguistics, July 2017, pp. 148–154. [Online]. Available: http://aclweb.org/anthology/P17-2023
-  J. McAuley and A. Yang, “Addressing complex and subjective product-related queries with customer reviews,” in World Wide Web, 2016.
-  M. Liu, Y. Fang, D. H. Park, X. Hu, and Z. Yu, “Retrieving non-redundant questions to summarize a product review,” pp. 385–394, 2016.
-  H. Xu, S. Xie, L. Shu, and P. S. Yu, “Cer: Complementary entity recognition via knowledge expansion on large unlabeled product reviews,” in Proceedings of IEEE International Conference on Big Data, 2016.
-  H. Xu, L. Shu, J. Zhang, and P. S. Yu, “Mining compatible/incompatible entities from question and answering via yes/no answer classification using distant label expansion,” arXiv preprint arXiv:1612.04499, 2016.
-  H. Xu, L. Shu, and P. S. Yu, “Supervised complementary entity recognition with augmented key-value pairs of knowledge,” arXiv preprint arXiv:1705.10030, 2017.
-  Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014.
-  L. Shu, H. Xu, and B. Liu, “Doc: Deep open classification of text documents,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, September 2017, pp. 2901–2906. [Online]. Available: https://www.aclweb.org/anthology/D17-1313
-  K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” arXiv preprint arXiv:1503.04069, 2015.
-  G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360, 2016.
H. Larochelle and G. E. Hinton, “Learning to combine foveal glimpses with a third-order boltzmann machine,” inAdvances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, Eds. Curran Associates, Inc., 2010, pp. 1243–1251.
-  M. Denil, L. Bazzani, H. Larochelle, and N. de Freitas, “Learning where to attend with deep architectures for image tracking,” Neural computation, vol. 24, no. 8, pp. 2151–2184, 2012.
-  K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, “Teaching machines to read and comprehend,” in Advances in Neural Information Processing Systems, 2015, pp. 1693–1701.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention.” in ICML, vol. 14, 2015, pp. 77–81.
-  X. Zhu, “Semi-supervised learning literature survey,” 2005.
-  A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi-supervised learning with ladder networks,” in Advances in Neural Information Processing Systems, 2015, pp. 3546–3554.