The amount of content online in different languages is greatly increasing, and the early days of English-language dominance on the Web have given way to language pluralism online. On many large user-generated content platforms, less than half the content is in English and many users do not speak English as a native language. As Internet-penetration rates are already high in many English-speaking countries, future user growth (and the content contributed by these users) will be predominantly in non-English languages .
The multilingual Internet we have today fractures users and content across language divides. This is especially apparent on user-generated content and social media platforms where the content is contributed by end users. The majority of articles on Wikipedia, for example, exist in only one language edition . Although the English edition of Wikipedia is the largest edition of the encyclopedia, there is still a large amount of information not available in English that is available in other language editions. Users of smaller-sized language editions experience the inequality in information across languages even more acutely.
Most research in cross-language information retrieval has focused on helping individuals locate information in different languages and the performance of machine translation has greatly increased in recent years, but large questions remain about when and how to present content in a language different from the presumed native language of the user (henceforth “foreign-language content”) in user-generated content platforms. Within this article we consider the context of online user reviews and test the impact of foreign-language reviews on the perceived helpfulness of all reviews as a whole.
User product reviews are an excellent context in which to test the perceived usefulness of foreign-language content given the high importance that consumers place on the number and quality of reviews  and the well-established link between user reviews and purchasing decisions . The findings have practical relevance not only for how travel, restaurant, shopping, and other websites present foreign-language reviews to their users, but also for the general acceptability of foreign-language content and translations on social media and other user-generated content platforms. This is especially relevant when user reviews or user-generated content in general is not available in the preferred language of a user.
Using an experimental approach, we first perform a within-subjects comparison to examine the effects of including foreign-language reviews. We then perform a between-subjects comparison to understand how adding a user-interface affordance to see translations of foreign-language reviews moderates this effect. Finally, we run a further iteration of the experiment in which reviews are “pretranslated” (i.e., the user does not need to click to see the translation).
Based on the literature examined below we expect that more reviews are more helpful, but it is not clear how helpful the inclusion of foreign-language reviews is (even with the option to see translations of the reviews). Indeed, one possibility is that foreign-language reviews could confuse or distract users and thus lower the overall usefulness that consumers receive from all the product reviews. Under this thinking it would be better to hide all foreign-language reviews from users even though this means fewer overall reviews are available.
Alternatively, we may find that foreign-language reviews have a positive, helpful effect (particularly when the number of native-language reviews is low) given the findings that more reviews are generally received more positively. Even without translations, users may still be able to pick out a few words or cognates and derive useful information from the number of stars given by each reviewer. This leads to our first hypothesis that h:numberexperimental conditions with more reviews will be rated more highly.
While to the best of our knowledge no research has looked specifically at the helpfulness of foreign-language reviews, previous studies have established that the readability of reviews correlates with how helpful the reviews are perceived . This implies that h:firstwhere the number of reviews is the same between two conditions, the condition with more reviews in the user’s first language will be rated more highly.
Additional factors that increase the ability of consumers to understand the reviews should also further increase the perceived helpfulness. We specifically test the addition of an interface affordance to see translations of the reviews testing the hypothesis that h:machinetransthe option to see translations of foreign-language reviews will increase the overall usefulness that users get from the reviews.
3 Background and Case Selection
Online user reviews affect product sales, but many factors contribute to the influence that reviews have on the decision-making processes of consumers. Example factors include the trust associated with reviewers, the quantity of reviews, the quantity of negative reviews, and the timing of reviews  as well as the quality of reviews . The quantity of reviews needed for making a decision about a product likely varies from context to context, but has not been well studied. We follow the decision of Park et al.  to use six reviews in the high-review condition of their experiment based on the results of a small focus group.
Despite the fact that many platforms have reviews and other content in multiple languages, research is lacking on how to best present reviews in multiple languages (if at all) and many different approaches are currently in use on major websites. The travel review website TripAdvisor, for example, generally shows reviews in reverse chronological order (most recent reviews first), but demotes foreign-language reviews so that they appear after all reviews in the language of the locale selected by the user. Google Play, a mobile app store, in contrast, hides foreign-language reviews making them completely inaccessible (although reviews from all languages are used to calculate the average rating of an app). Twitter, Facebook, and Google Plus all provide the option to see machine translations of foreign-language posts, and Facebook has experimented with showing machine translations in place of foreign-language posts.
3.1 Case selection
In order to understand the effects of displaying foreign-language reviews to users, we designed an online experiment in which subjects compared three London bicycle tours. Tours are the third largest category of attractions in London on TripAdvisor behind Nightlife and Shopping , and they are also a good example of an experience good  where the quality and utility of the product can only be determined upon consuming it (i.e., upon taking the tour), and user reviews are especially important in such cases .
As a large, international city with tourists from all over the world, London is an appropriate setting. Non-English content has grown quickly online and now accounts for a sizable portion of user-generated content online [18, 32]. On TripAdvisor 25% of all reviews for London attractions are not in English, and 6% of tourist attractions in London have more non-English than English-language reviews .
We specifically compare the effects of showing Spanish-language reviews to supplement a smaller number of English-language reviews. The selection of languages was driven by three principal reasons. First, prior work has shown a reasonable correlation between Spanish and English language reviews . Second, Spanish is the second most spoken language in the United States where our subject pool is based and understanding the best way to handle Spanish and English reviews is important as an increasing amount of commerce in the United States is conducted in Spanish. Third, and most importantly, speakers of smaller-sized languages are more willing (or, perhaps more forced) to engage with foreign-language content , and thus, the helpfulness that English-speakers derive from foreign-language reviews can be seen as a lower bound. We should expect speakers of other languages to derive as much or more helpfulness from foreign-language content.
4 Study I
4.1 Method and approach
In our first experiment, subjects were presented with three bicycle tours in London. The experimental interface showed a name, description, small photo, and three or six user reviews for each tour. There were a total of three different names, photos, and descriptions that were randomly grouped together for each subject and shown in a random order.111The exact wording of all content, the code used to run the experiment, and the anonymous data from the experiment itself are available at http://scott.hale.us/pubs/?chi2017.
In our first iterations of the experiment, we recruited subjects using Amazon Mechanical Turk. We described the study as being about evaluating tourist attractions, and presented it similarly to the search engine optimization (SEO) tasks that are very common on Mechanical Turk. We are aware that Mechanical Turk users are not representative of the entire US population, but they are more diverse than traditional subject pools and have been used to replicate well-established experimental findings [4, 6, 12, 28, 33]. Furthermore, appropriate randomization procedures ensure subjects within the treatment and control conditions have similar demographics—something that is very different from annotating “gold-standard” datasets . This research constitutes a first step in understanding the potential effects that the inclusion of foreign-language reviews has on the user experience of a site, and we hope publication of this study will generate interest and allow us to work with a platform operator to test our findings in the wild.
To form our pool of reviews, we selected 15 reviews at random from London bicycle tours on TripAdvisor and removed tour-specific references such as the names of tour guides. The reviews on TripAdvisor of bicycle tours were overwhelmingly positive, and our pool consisted of only 4 and 5 star reviews. We (human) translated all reviews to Spanish to create a pool of 15 reviews with text in both Spanish and English.222One author is a native Spanish speaker and the other author is highly proficient. We assigned the reviews to each tour for each subject at random such that we showed one tour with 3 English reviews (the en3 condition), one tour with 3 English and 3 Spanish reviews (the en3es3 condition), and one tour with 6 English reviews (the en6 condition). Within the en3es3 condition, half of the subjects were given the option to click to see English translations of the Spanish reviews and half were not. The en3es3 condition with the option to see English translations is shown in Figure 1.
Each review consisted of a star rating (1–5 stars), a title, and review text. We manipulated the star ratings so that each tour had a mean 4.5 rating overall (i.e., the tour in the en3 condition had one 5-star, one 4.5-star, and one 4-star review while the tours in the other two conditions had two reviews with each of these ratings). Review length has previously been shown to influence consumer decisions ; so, we controlled for this by truncating all reviews at 250 characters and appending a link reading “…More” to read the review further. We did not present the names, locations, or other information about the authors of the reviews as such information has also been shown to affect the trustworthiness that users ascribe to reviews .
We measured three outcome variables: the proportion of time spent evaluating each tour, an individual rating of the likelihood of booking each tour, and, at the end of the experiment, the selection of the one tour that each subject would be most likely to book.
The first outcome was measured without any action by the subject. We tracked the amount of time spent evaluating the three tours (ignoring time spent on the consent form and answering ending questions) and calculated the proportion of time given to each condition. We looked specifically at the proportion of time that subjects spent in each condition as we expected the distribution of raw times to be skewed and long tailed as some subjects would read faster and complete the study more quickly than others.
For each tour, we asked subjects a self-reported likelihood of booking the particular tour using a slider or visual analogue scale (VAS) with the end points labeled as “very unlikely” and “very likely”. This VAS resulted in an integer value between 0 and 100 inclusive, which we treated as continuous following common statistical practices. The value was not shown to subjects, who had to simply place themselves visually along the continuum. Such VASs are very common for pain, depression, mood, and other subjective measures [29, 34] having started in psychology in the early 1920’s [19, 11] and become more widespread at the end of the 1960’s . Numerous studies have shown they are as accurate and reliable as Likert scales but have better sensitivity . In particular, VASs are better suited for repeated measurements than Likert scales . Along with the likelihood to book, we asked subjects for the importance of the tour title, photo, description, and user reviews in reaching their decision. These were measured using VASs with endpoints labeled “very unimportant” and “very important.”
After the subjects had evaluated all three tours, we asked them to indicate which one of the three tours they would be most likely to book if they were to book only one. We further asked the importance of the title, photo, description, and reviews in reaching this decision.
Once subjects completed this overall selection, we informed subjects about our interest in the effects of foreign-language reviews and asked for their self-reported abilities to read English and Spanish. The exact wording of all questions is given in the appendix.
Per our hypotheses, we predicted higher ratings for the en3es3 condition than the en3 (more reviews are more helpful, HLABEL:h:number), but that the en6 condition would be rated even higher than the en3es3 condition (when the number of reviews is the same, users will prefer more reviews in their first language, HLABEL:h:first). In our between-subjects comparisons, we expected that subjects given the option to see translations of the Spanish-language text would rate the en3es3 condition more highly than subjects not given the ability to see translations (HLABEL:h:machinetrans).
We first conducted a pilot of the experiment with 110 subjects before running the main experiment. Free-text comments from subjects indicated that our initial names for the tours were too unique and influencing subjects’ choices. We changed the names from “Easy Peddle Bike Tours,” “London Bicycle Tours,” and “Vintage Cycle Tours” in the pilot to “City Bike Tours,” “London Bicycle Tours,” and “Capital Cycle Tours” for the main experiment. We further decided to randomize the matching of tour names, tour descriptions, and tour photos for each subject. We checked all reported results using control variables for tour name, description, and photo as well as control variables for the order in which tours were presented. The pilot also identified an issue with Google Chrome automatically translating foreign-language content, which was addressed in the full experiment.
We paid subjects $0.50 USD for this 5–7 minute study. We had 533 subjects complete the study, but discarded 53 subjects who either answered attention check questions incorrectly (e.g., what type of tour is this?) or indicated that they suspected the study was about foreign-language content before we informed them. We required subjects to be US-based and have a 95% or higher acceptance rate to participate.
Most subjects indicated they did not read Spanish at all (N=283) or that they read Spanish at only a basic level (N=143). We avoided asking about language abilities in advance in order to avoid biasing our subjects as to the purpose of our study. As a result, we had too few subjects with intermediate (N=36), advanced (N=14), or native (N=4) Spanish skills to conduct robust analysis with these groups. We set aside these subjects as well as 10 non-native speakers of English and proceeded with only the 416 subjects who were native English speakers with no or basic Spanish skills.
Subjects on average spent four minutes (standard deviation 5.5) evaluating all three tours. In accordance with our first hypothesis, subjects spent a significantly smaller proportion of their time evaluating the tour in the condition with only three reviews (en3, 29% of subjects’ time on average) than either in the en3es3 condition (36% of subjects’ time) or the en6 condition (35% of subjects’ time), as shown in Figure2
a. Two-tailed t-tests showed the difference between the en3 and en3es3 conditions as well as the difference between the en3 and en6 conditions was statistically significant (and respectively) while the difference between the en3es3 and en6 conditions was not (
). The same outcomes were produced with multivariate linear regressions that also controlled for the order, titles, photos, descriptions of the tours as well as the importance given to titles, photos, and descriptions.
In contrast to the amount of time spent evaluating each tour and contrary to our first hypothesis, when asked about their individual likelihood of booking each tour, subjects were no more likely to book the en3es3 tour than the en3 tour. The en6 tour was significantly more likely to be booked than either other condition (Figure 2b). Two-tailed t-tests produced p-values less than or equal to for both comparisons as did multivariate regressions with the control variables previously mentioned.
After evaluating all three tours (one in each condition), subjects were asked to choose the one tour that they would be most likely to book. The results for this overall selection of one tour mirror those for the booking likelihood of each individual tour. Subjects were significantly more likely to select the tour presented with 6 English reviews (en6, selected by 45% of subjects) than either other tour (Figure 2c). 28% of subjects selected the en3 tour and 27% selected the en3es3 tour. Both of these values were significantly lower than the 45% selecting the en6 tour, established either with two-tailed t-tests (
for both comparisons) or with multivariate logistic regression using the control variables previously mentioned.
These results only partially support our first hypothesis that the conditions with more reviews would be rated more favorably (HLABEL:h:number). While the en6 condition was rated most favorably in accordance with the hypothesis, we observed no difference between the individual or overall booking likelihoods of the tours in the en3 and en3es3 conditions despite the additional number of reviews in the en3es3 condition. At the same time, it is important to note that neither did we observe an overall negative effect created by the Spanish-language reviews as some interface designers could fear.
The differences between the en3es3 and en6 conditions strongly support the second hypothesis that when the number of reviews was the same, the condition with more reviews in subjects’ first languages would be more highly rated (HLABEL:h:first).
Within the multilingual en3es3 condition, 218 subjects in the treatment group were given the option to see English translations of the Spanish-language reviews while another 198 subjects in the control group were shown the Spanish-language reviews with no option to see translations. Hypothesis HLABEL:h:machinetrans predicted that the subjects in the treatment group would rate the en3es3 tour more highly than the subjects in the control group.
Contrary to this hypothesis, there was no overall statistical difference between subjects in treatment and control. Subjects in the treatment group spent a slightly larger proportion of their time evaluating the tour in the en3es3 condition (37% vs. 34%, only when control variables were included), but were no more likely to book the tour.333There was no significant difference between control and treatment either in the individual ratings (69.1 vs. 69.2 points, ) or in the overall selection of the en3es3 tour as the most preferred (29% vs. 26%, ). Thus, the addition of the option to see translations of the Spanish-language reviews to the interface had no overall effect (Figure 3).
These top-line figures are the result of two important underlying facts. First, most subjects ignored the option to see translations of the Spanish-language reviews: 72% of the subjects in the treatment group did not click to see any translations. Among the 28% of subjects who did click to see one or more translations, nearly all clicked to see all three translations (88% of subjects who viewed one translation viewed all three translations).
Second, the subjects who chose to view one or more translations behaved very differently from the subjects who chose not to view any translations (Figure 4). Subjects clicking to view at least one translation rated the tour in the en3es3 condition significantly higher than subjects who did not click to view any translations. Subjects who clicked at least one translation rated the tour a mean 76 points while those not clicking a translation rated it a mean 66 points. This difference was statistically significant with both a two-tailed t-test () as well as a multivariate regression with the control variables ().
The mean rating in the en3es3 condition for subjects clicking to see at least one translation (76 points) did not differ significantly from either their ratings in the en3 condition (mean 70, ) or in the en6 condition (mean 75, ). In contrast, the mean rating in the en3es3 condition for subjects not clicking any translation (66 points) was significantly lower than the en6 condition (mean 75, ) and similar to their ratings in the en3 condition (mean 70, ).
The selection by subjects of one tour at the end of the experiment as the tour they would be most likely to book showed a similar pattern. Overall, 29% of subjects in the control condition without translation buttons picked the en3es3 tour as the one they would be most likely to book. In contrast, only 26% of subjects in the treatment condition picked the en3es3 tour. This again obscures differences in behavior between those who clicked or did not click to see translations. 22% of the subjects in the treatment group who did not click to see any translations selected the en3es3 tour as the tour they would most likely book ( compared to control) while 34% of subjects who clicked to see at least one translation selected the en3es3 tour as the tour they would be likely book ( compared to control). The difference in the overall tour choice by subjects who clicked or not was significant when tested with logistic regression including the control variables previously mentioned () but was not significant with a t-test alone (). The opposite direction of these effects suggests adding the translation buttons (perhaps further drawing attention to the fact that some of the reviews were in a foreign-language) can actually have a negative effect among some people. The free-text comments made by subjects support the idea that the mere presence of foreign-language reviews can have an effect, a point we return to in the conclusions.
Overall, adding Spanish language reviews had no effect on subjects’ individual rating of tours nor on their overall selection of one tour. The addition of buttons to show translations of the Spanish reviews also had no overall effect.
However, subjects who used the translation buttons behaved very differently from those who did not use the translation buttons. The use of the translation buttons was very divisive: generally subjects either clicked none of the translation buttons or clicked all three translation buttons. Those who clicked rated the tour more highly and were more likely to select it overall compared to those who did not click any of the translation buttons.
We have found no published figures as to how much translation options are used in the wild (e.g., on Facebook, Twitter, TripAdvisor, etc.). Our uptake rate of 28% is broadly in line with the figure an employee of one social media platform mentioned to us confidentially.
We did not hypothesize such a divergent effect for the Spanish-language content nor that the use of the translation buttons would distinguish subjects so clearly. As a result, we did not ask detailed questions about individual-level characteristics (e.g., demographics and personality), which might explain how people respond to foreign-language content. We address this point in the next study.
5 Study Ii
The results of the first study lead us back to the literature to identify what individual-level characteristics might explain who benefits from foreign-language content.
The most relevant literature we identified came from management science where scholars have studied what factors predict successful overseas placements for employees at multinational companies [3, 7, 8]. In addition to practical steps such as orientation programs, personality has been found to be important in explaining which employees terminate their overseas placements early . The Big Five personality dimensions are broad and form a good starting place for researching and theorizing the role of personality . All five dimensions have been linked to successful overseas placements  with particular importance ascribed to extraversion and openness [3, 8].
Of these, extraversion is thought to be important in helping expatriates to make friends in the new culture and thus contribute to understanding and resilience. In contrast, openness is thought to be more directly related to the attitude of the expatriate toward new experiences, and thus openness should be the most relevant personality dimension in this experiment. Based on this literature, it was hypothesized that individuals with a high level of openness would be more willing to engage with foreign-language content. Beyond standard measures of openness, Caligiuri  developed a more detailed scale of attitudinal and behavioral openness building on the work of Budner  examining tolerance of ambiguity. This scale was designed specifically to better understand the role of openness in helping expatriates adapt to new cultures.
5.1 Method and approach
We follow the same approach as Study I, but added additional questions drawn from the literature to examine personality. We specifically investigated the Big Five using the Ten Item Personality Inventory  in addition to a separate, more in-depth measure of attitudinal openness from Caligiuri .
We also introduced a third, “pretranslation” condition into our between subjects comparison. This condition showed six English reviews, but stated that three of the reviews had been translated from Spanish. It further offered a button to “See original” that, if clicked, showed the Spanish translation of the review. We randomly assigned subjects between the no translation, translation, and pretranslational conditions. We set the proportions to be 20%, 40% and 40% for each condition respectively in order to have more subjects in the translation and pretranslation conditions.
We further changed crowdsourcing platforms and ran this second studying using Prolific.ac. Prolific is designed specifically for academic studies and a number of demographic variables are available about subjects without requiring these questions to be asked in the study itself. This allowed us to keep our experiment short while still investigating demographic variables in-depth. In particular, we examined the following variables from Prolific: gender, age, highest education level, religious affiliation, political affiliation, and Caucasian/non-Caucasian ethnicity. We added to these the Big Five personality characteristics and the attitudinal openness scale.
Although Prolific has a large international pool of workers, we restricted responses to people in the United States in order to test/replicate the findings of Study I. We also required Prolific subjects to have a 90% or higher acceptance rate (the highest rate that can be required). Payments on Prolific are dominated in British Pounds, and we paid our subjects £0.60 GBP (approximately $0.80 USD at the time of the experiment) for their time. This payment is higher than Study I due to the additional time required to answer the personality questions.
300 subjects completed this second experiment, but we discarded 7 subjects who answered attention check questions incorrectly, 28 subjects who suspected the study was about language, and two subjects who had Internet connectivity issues. Once again most subjects had either no Spanish proficiency (N=125) or basic proficiency (N=113). We discarded 25 subjects with intermediate or higher levels of Spanish and two subjects who were not native speakers of English. We analyzed the remaining 236 subjects.
Per our randomization, 40 subjects saw the en3es3 condition with no translation option, 92 saw this condition with reviews in Spanish and the option to translate to English, and 104 saw this condition with reviews in English (and were told they had been translated from Spanish).
Our Prolific subjects were all born in the United States and native speakers of English. They ranged in age from 18 to 73 (mean 34), and a slightly larger proportion were male than female (58% male). Most (79%, 184 subjects) had Caucasian ethnicity, 14 had African ethnicity, 10 had East Asian ethnicity, nine of mixed ethnicity, and the remainder were of other ethnicities. Statistical tests showed similar age, gender, and Caucasian/non-Caucasian distributions across the three treatment conditions.
5.2.1 Translation conditions
The data from the Prolific subject pool showed greater similarity between the three conditions than did the data from Mechanical Turk subjects. Subjects spent a larger proportion of time in en6 than en3 (34% vs. 33%, significant only with controls ) while the proportion of time spent in en3es3 (33%) was not significantly different from the other conditions (Figure 5a).
Due to the greater similarity and smaller sample size, there were no statistically significant differences between the individual booking likelihoods or the overall booking choices. The booking likelihoods in the condition with reviews pretranslated showed high variance (sd 23.2), and the mean value was not significantly different from that of the en6 condition (71.6 vs. 73.2, ). Neither of these values differed significantly from the en3es3 condition with the option to see translations (mean 73.6, , Figure 5b).
Importantly, however, the key observation from Study I was replicated: subjects who clicked to see at least one translation of a Spanish review rated the tour in the en3es3 condition more highly than the subjects who did not click to see any translations.
A similar proportion of subjects (26%) clicked to view one or more translations when given the option. The subjects who clicked to see translations indicated they were more likely to book the en3es3 tour than those who did not click to see any translations. Subjects not clicking a translation button rated the tour a mean 70 points while those clicking to see one or more translations rated the tour a mean 84 points. This difference was statistically significant with both a t-test () and multivariate regression with control variables ().
In fact, subjects who clicked to see at least one translation in the en3es3 also rated the en3es3 tour more highly than the en6 tour (mean ratings 84 and 73, for a t-test, for a multivariate regression with control variables). In contrast, there was no significant difference between the ratings of the en3es3 tour for those not clicking to see translations and the en6 condition (mean ratings 70 and 73, for a t-test, with control variables). The choice of one tour as the most preferred tour showed a similar pattern, but none of the differences were statistically significant.
5.2.2 Individual differences
We now seek to understand what individual-level characteristics might explain which subjects found the Spanish reviews more useful as well as which subjects clicked the translation buttons. For this subsection we restricted our analysis to the 92 subjects in the translation condition.
As mentioned, subjects who clicked to see one or more translations rated the en3es3 tour significantly higher than those who did not click to see any translations. Surprisingly, however, no demographic or personality variables were associated with higher ratings of the tours.
Given that those who viewed translations and those who did not behaved so differently, we turned to analyze which subjects clicked to see one or more translations. Contrary to expectations again, however, no personality or demographic variables were significantly associated with clicking to see translations.
Perhaps unsurprisingly there was a strong relationship between the importance given to reviews and clicking to see one or more translations (84 vs. 94 points, ). The direction of this relationship could go either way: people who clicked on the translation buttons could have found the reviews more helpful or, alternatively, people who feel reviews are very important in general could have clicked to see the translations. More notably, the mean importance given to reviews in all three conditions was significantly higher among subjects clicking to see translations (86 vs. 94 points, ). This suggests the latter interpretation is more likely: the people most likely to click to see translations are people who place large importance on user reviews.
Study II reproduced the key result of Study I using a different pool of subjects. Namely, subjects who clicked to see a translation of a Spanish review rated the tour in the en3es3 condition more highly than did the subjects who did not click to see any translations. We did not ask individual demographic or personality information from Mechanical Turk subjects, but in general both subject pools are generally more educated than the US population as a whole . While Prolific respondents in this study were slightly more likely to be male, US-based users of Mechanical Turk are, in general, more likely female.444Ross et al.  found US-based Mechanical Turk workers were 37% male and 55% had a Bachelor’s degree or higher.
Despite the additional demographic and personality variables analyzed in Study II, the understanding produced of who benefits most from foreign-language content is incomplete. The action that most distinguished participations was whether they clicked to see translations. Those who did rated the tour most highly. None of the personality variables drawn from the literature had a significant effect, but the subjects clicking to see translations were those subjects who placed large importance on user reviews.
Pretranslating user-generated content to the user’s language might seem like a way for more people to benefit from foreign-language content. However, the results showed high variance in this condition. As in Study I, there were likely subjects who were influenced simply by the presence of foreign-language content. The translation buttons provide a simple way of identifying those who find the foreign-language content most useful. More generally, the results indicate that platform operators should consider tracking all types of interaction with foreign-language content and personalizing the display of that content on a per-user basis. Some people may not wish to see this content, others will want to see it but need the option of seeing translations, and still others will want to read the content in its original languages. This study concentrated on people with no or limited proficiency in Spanish as a starting point, but people with higher foreign-language proficiencies are an obvious next step. A large proportion of the human population is bilingual and the needs of these users must be considered [15, 38].
The increasing amount of content in different languages online has resulted in large differences in the information available in different languages. Within this article, we have specifically focused on user reviews online, but the findings are also relevant to how foreign-language content is presented on social media and other platforms. Our experimental findings indicate the helpfulness that people derive from foreign-language reviews is a complex topic.
At the broadest, most general level, displaying foreign-language reviews is largely neutral. Our subjects in both studies found having three English and three Spanish reviews very similar to having only three English reviews, which was contrary to our hypothesis that more reviews (even in a foreign language) would be more helpful (HLABEL:h:number). At the same time, displaying foreign-language content, even in the absence of translations, did not create the overall negative effects that some interface designers fear (and hence seek to separate content in different languages).
The finding in Study I that the tour with six English reviews was most preferred confirms the second hypothesis that consumers most value reviews in their first language (HLABEL:h:first). In general this finding would suggest interfaces should prioritize reviews in a user’s first language over other reviews. However, our interface did not provide any additional information about the reviews such as their dates/recency or about the reviewers such as their locations. Future work is needed to understand how such factors intersect with language preferences.
The context of the product being evaluated is also an area for future work. London is in an English-speaking country and therefore our subjects might have been less expecting or tolerant of non-English reviews (despite findings that a quarter of all London reviews on TripAdvisor are non-English ). In future work, we would like to repeat the experiment with a non-English destination. For example, the same experimental setup and conditions could be used with bicycle tours in Madrid rather than London. In such a case, having Spanish as a local language might result in Spanish-language reviews being more expected, tolerated, or even considered to have prestige or trustworthiness as local content  and thereby render different experimental results. Future work should also consider different products beyond tours or tourist attractions. In particular, the fact that a tour is experienced in a group setting may result in foreign-language reviews being viewed differently than when a good is individually consumed, such as a hotel room.
Native English speakers display the lowest levels of bilingualism online [10, 16, 17] and native English speakers were selected as a starting point for this research precisely so that the findings here likely provide a lower-bound on the amount of helpfulness people derive from content in other languages. It is expected that native speakers of smaller-sized languages will be more tolerant of and derive more value from foreign-language reviews (in line with findings that speakers of smaller-sized languages show higher levels of bilingualism online ). This is something that future work should examine empirically. Foreign-language content can be especially important for languages with fewer speakers online and/or where many speakers access the Internet from mobile devices and may find it more difficult to contribute longer-form content. The successful use of foreign-language content could help overcome the initial “cold start” problems of acquiring users in a new or smaller-sized language on a user-generated content platform if the foreign-language content can be used to fill gaps in content.
The most important outcome of both studies relates to subjects’ use of the option to see translations and the different and opposite value of foreign-language reviews for subjects who clicked versus subjects who did not click to see translations. Overall, 26–28% of subjects given the option to see translations of the Spanish-language reviews clicked to view at least one translation. For these subjects, the Spanish-language reviews were positively received, and the subjects were more likely to indicate they would book the tour with these reviews. In contrast, subjects who did not click to see any translations despite being given the option viewed the tour more negatively and were less likely to book the tour with Spanish-language reviews. This behavior was consistent in both Study I using Mechanical Turk and Study II using Prolific.
Although it is possible that the addition of the translation buttons themselves created this effect by drawing more attention to the fact that the reviews were not in English—something that would need to be tested with an eye tracking study—it is likely that the use of the translation buttons simply clearly separated subjects with different pre-existing attitudes toward foreign-language reviews. That is to say that the control group contained subjects with a mixture of attitudes toward foreign-language reviews but it was not possible to observe these attitudes. The introduction of the translation buttons could simply have created the opportunity for subjects to differentiate themselves according to their attitudes toward foreign-language reviews: those with pre-existing positive attitudes could have been more likely to click the translation buttons while those with pre-existing negative attitudes could have been less likely to click the translations buttons and thus produce (at least some of) the experimental effects observed.
In Study II we tested individual-level characteristics that could explain this behavior and the hypothesized pre-existing attitudes. Openness was expected to be a key individual trait, but in fact was found not to play a significant role. Indeed, none of the personality variables examined in Study II were significant. The largest predictor of using the translation buttons was a high importance on reviews.
The finding that the mere presence of foreign-language content can alter perceptions fits with branding and country-of-origin studies for products in marketing research [2, 24, 27]. For example, Leclerc et al.  found that when fictitious company names were pronounced in French rather than English, the attitudes of US subjects toward the fictitious brand and the hedonism of its products changed.
Preliminary analysis of the comments left in the optional free text questions of Study I support the idea that the mere presence of foreign-language reviews had an impact for a small number of subjects. Some subjects saw the foreign-language reviews as a positive, “because it indicates the company is flexible and multinational,” while other subjects saw the foreign-reviews negatively. In the most negative comment one subject stated “…I get enough Spanish here in the US. I’d rather not be around Spanish speaking people when I visit London.” The majority of comments left within the en3es3 condition mentioned the reviews positively without any reference to their language or made a neutral statement such as “last couple of reviews were in a foreign language.”
This study assessed the overall effects of foreign-language content and the impact of including a translation option. It took a first step toward analyzing how individual level characteristics (demographics, personality, etc.) affect people’s attitudes toward foreign-language content, but the results were not fully satisfactory and this is an important area of investigation that is severely lacking in academic research. Perhaps the best recommendation for platform designers and operators at this point is to personalize the showing of foreign-language content and translations. If users interact with foreign-language content and/or click to view translations, more foreign-language content and translations should be shown to these users if available. However, if users ignore the foreign-language content and translation buttons they are offered, it might be appropriate to hide further foreign-language content from these users. Special attention should be paid, however, to bilingual users as these users are likely to appreciate content in their multiple languages.
Future work should also explore alternative designs and evaluation measures such as adding “this review is helpful” buttons and/or having reviews of different polarities in different languages (e.g., positive reviews in Spanish but negative in English or vice versa). The roles that cultural (dis)similarity or affinity play are also unclear.
This study, as a first step in assessing the impact of including foreign-language reviews, has indicated that reactions to and the helpfulness derived from foreign-language reviews is a complex topic. We hope that this study will inspire future work on this important topic. As the quantity of reviews and other content in all languages increases, much further research is needed in order to understand when to use and how to present foreign-language content in different contexts to users with different linguistic backgrounds and attitudes.
We would like to thank our academic colleagues, friends at Meedan, and anonymous reviewers who provided helpful feedback on this research. This publication was supported by the John Fell Oxford University Press (OUP) Research Fund, The Alan Turing Institute under the EPSRC grant EP/N510129/1, the University of Oxford’s Economic and Social Research Council (ESRC) Impact Acceleration Account and Higher Education Innovation Fund (HEIF) allocation, and the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 656439.
-  Khalid I. Al‐Sulaiti and Michael J. Baker. 1998. Country of origin effects: A literature review. Marketing Intelligence & Planning 16, 3 (1998), 150–199. DOI:http://dx.doi.org/10.1108/02634509810217309
-  Winfred Arthur and Winston Bennett. 1995. The International Assignee: The Relative Importance Of Factors Perceived To Contribute To Success. Personnel Psychology 48, 1 (1995), 99–114. DOI:http://dx.doi.org/10.1111/j.1744-6570.1995.tb01748.x
-  Adam J. Berinsky, Gregory A. Huber, and Gabriel S. Lenz. 2012. Evaluating Online Labor Markets for Experimental Research: Amazon.com’s Mechanical Turk. Political Analysis 20, 3 (2012), 351–368. DOI:http://dx.doi.org/10.1093/pan/mpr057
-  Stanley Budner. 1962. Intolerance of ambiguity as a personality variable. Journal of Personality 30, 1 (1962), 29–50. DOI:http://dx.doi.org/10.1111/j.1467-6494.1962.tb02303.x
-  Michael Buhrmester, Tracy Kwang, and Samuel D. Gosling. 2011. Amazon’s Mechanical Turk: A New Source of Inexpensive, Yet High-Quality, Data? Perspectives on Psychological Science 6, 1 (2011), 3–5. DOI:http://dx.doi.org/10.1177/1745691610393980
-  Paula M. Caligiuri. 2000a. The Big Five Personality Characteristics As Predictors Of Expatriate’s Desire To Terminate The Assignment And Supervisor-Rated Performance. Personnel Psychology 53, 1 (2000), 67–88. DOI:http://dx.doi.org/10.1111/j.1744-6570.2000.tb00194.x
-  Paula M. Caligiuri. 2000b. Selecting Expatriates for Personality Characteristics: A Moderating Effect of Personality on the Relationship Between Host National Contact and Cross-cultural Adjustment. MIR: Management International Review 40, 1 (2000), 61–80. http://www.jstor.org/stable/40835867
-  Paula M Caligiuri, Rick R Jacobs, and James L Farr. 2000. The Attitudinal and Behavioral Openness Scale: Scale development and construct validation. International Journal of Intercultural Relations 24, 1 (2000), 27–46. DOI:http://dx.doi.org/10.1016/S0147-1767(99)00021-8
-  Eurobarometer. 2011. User language preferences online: Analytical report. Technical Report. Survey conducted by The Gallup Organization, Hungary upon the request of Directorate-General Information Society and Media, European Commission.
-  M. Freyd. 1923. The Graphic Rating Scale. Journal of Educational Psychology 14, 2 (1923), 83–102. DOI:http://dx.doi.org/10.1037/h0074329
-  Joseph K. Goodman, Cynthia E. Cryder, and Amar Cheema. 2013. Data Collection in a Flat World: The Strengths and Weaknesses of Mechanical Turk Samples. Journal of Behavioral Decision Making 26, 3 (2013), 213–224. DOI:http://dx.doi.org/10.1002/bdm.1753
-  Samuel D Gosling, Peter J Rentfrow, and William B Swann Jr. 2003. A very brief measure of the Big-Five personality domains. Journal of Research in Personality 37, 6 (2003), 504–528. DOI:http://dx.doi.org/10.1016/S0092-6566(03)00046-1
-  Mark Graham, Scott A. Hale, and Monica Stephens. 2012. Featured graphic: Digital Divide: The Geography of Internet Access. Environment and Planning A 44, 5 (2012), 1009–1010. http://www.envplan.com/epa/fulltext/a44/a44497.pdf
-  François Grosjean. 2010. Bilingual: Life and Reality. Harvard University Press.
-  Scott A. Hale. 2014a. Global Connectivity and Multilinguals in the Twitter Network. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). ACM, New York, 833–842. DOI:http://dx.doi.org/10.1145/2556288.2557203
-  Scott A. Hale. 2014b. Multilinguals and Wikipedia Editing. In Proceedings of the 6th Annual ACM Web Science Conference (WebSci’14). ACM, New York. http://www.scotthale.net/pubs/?websci2014
-  Scott A. Hale. 2016. User Reviews and Language: How Language Influences Ratings. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’16). ACM, New York, NY, USA, 1208–1214.
-  M. H. Hayes and D. G. Patterson. 1921. Experimental development of the graphic rating method. Psychological Bulletin 18 (1921), 98–99.
-  Brent Hecht and Darren Gergle. 2010. The Tower of Babel meets Web 2.0: User-generated content and its applications in a multilingual context. In Proceedings of the 28th International Conference on Human Factors in Computing Systems (CHI ’10). ACM, New York, NY, USA, 291–300. DOI:http://dx.doi.org/10.1145/1753326.1753370
-  Nan Hu, Ling Liu, and Jie Jennifer Zhang. 2008. Do online reviews affect product sales? The role of reviewer characteristics and temporal effects. Information Technology and Management 9, 3 (2008), 201–214. DOI:http://dx.doi.org/10.1007/s10799-008-0041-2
-  Oliver P John and Sanjay Srivastava. 1999. The Big Five trait taxonomy: History, measurement, and theoretical perspectives. In Handbook of Personality: Theory and Research (2nd ed.), Lawrence A. Pervin and Oliver P. John (Eds.). Number 1999. Guilford Press, 102–138.
-  C. R. B. Joyce, D. W. Zutshi, V. Hrubes, and R. M. Mason. 1975. Comparison of fixed interval and visual analogue scales for rating chronic pain. European Journal of Clinical Pharmacology 8, 6 (1975), 415–420. DOI:http://dx.doi.org/10.1007/BF00562315
-  Jill Gabrielle Klein. 2002. Us versus Them, or Us versus Everyone? Delineating Consumer Aversion to Foreign Goods. Journal of International Business Studies 33, 2 (2002), 345–363. http://www.jstor.org/stable/3069548
-  Lisa R. Klein. 1998. Evaluating the Potential of Interactive Media through a New Lens: Search versus Experience Goods. Journal of Business Research 41, 3 (1998), 195–203. DOI:http://dx.doi.org/10.1016/S0148-2963(97)00062-3
-  Nikolaos Korfiatis, Elena García-Bariocanal, and Salvador Sánchez-Alonso. 2012. Evaluating content quality and helpfulness of online product reviews: The interplay of review helpfulness vs. review content. Electronic Commerce Research and Applications 11, 3 (2012), 205–217. DOI:http://dx.doi.org/10.1016/j.elerap.2011.10.003
-  France Leclerc, Bernd H. Schmitt, and Laurette Dubé. 1994. Foreign Branding and Its Effects on Product Perceptions and Attitudes. Journal of Marketing Research 31, 2 (1994), 263–270. http://www.jstor.org/stable/3152198
-  Winter Mason and Siddharth Suri. 2012. Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Research Methods 44, 1 (2012), 1–23. DOI:http://dx.doi.org/10.3758/s13428-011-0124-6
-  Heather M. McCormack, David J. de L. Horne, and Simon Sheather. 2009. Clinical applications of visual analogue scales: A critical review. Psychological Medicine 18, 4 (2009), 1007–1019. DOI:http://dx.doi.org/10.1017/S0033291700009934
-  Phillip Nelson. 1970. Information and Consumer Behavior. Journal of Political Economy 78, 2 (1970), 311–329. http://www.jstor.org/stable/1830691
-  Do-Hyung Park, Jumin Lee, and Ingoo Han. 2007. The Effect of On-Line Consumer Reviews on Consumer Purchasing Intention: The Moderating Role of Involvement. International Journal of Electronic Commerce 11, 4 (2007), 125–148. DOI:http://dx.doi.org/10.2753/JEC1086-4415110405
-  Daniel Pimienta, Daniel Prado, and Álvaro Blanco. 2009. Twelve years of measuring linguistic diversity in the Internet: Balance and perspectives. (2009). http://unesdoc.unesco.org/ulis/cgi-bin/ulis.pl?catno=187016
-  Ulf-Dietrich Reips. 2000. The Web Experiment Method: Advantages, Disadvantages, and Solutions. In Psychology Experiments on the Internet, M. H. Birnbaum (Ed.). Academic Press, San Diego, CA.
-  Ulf-Dietrich Reips and Frederik Funke. 2008. Interval-level measurement with visual analogue scales in Internet-based research: VAS Generator. Behavior Research Methods 40, 3 (2008), 699–704. DOI:http://dx.doi.org/10.3758/BRM.40.3.699
-  Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zaldivar, and Bill Tomlinson. 2010. Who Are the Crowdworkers?: Shifting Demographics in Mechanical Turk. In CHI ’10 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’10). ACM, New York, NY, USA, 2863–2872. DOI:http://dx.doi.org/10.1145/1753846.1753873
-  S.W. Sen, H. Ford, D.R. Musicant, M. Graham, Oliver S.B. Keyes, and Brent Hecht. 2015a. Barriers to the Localness of Volunteered Geographic Information. In Proceedings of the 29th International Conference on Human Factors in Computing Systems (CHI 2015).
-  Shilad Sen, Margaret E. Giesel, Rebecca Gold, Benjamin Hillmann, Matt Lesicko, Samuel Naden, Jesse Russell, Zixiao (Ken) Wang, and Brent Hecht. 2015b. Turkers, Scholars, ”Arafat” and ”Peace”: Cultural Communities and Algorithmic Gold Standards. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’15). ACM, New York, NY, USA, 826–838. DOI:http://dx.doi.org/10.1145/2675133.2675285
-  Ben Steichen, M.Rami Ghorab, Alexander O’Connor, Séamus Lawless, and Vincent Wade. 2014. Towards Personalized Multilingual Information Access - Exploring the Browsing and Search Behavior of Multilingual Users. In User Modeling, Adaptation, and Personalization, Vania Dimitrova, Tsvi Kuflik, David Chin, Francesco Ricci, Peter Dolog, and Geert-Jan Houben (Eds.). Lecture Notes in Computer Science, Vol. 8538. Springer International Publishing, 435–446. DOI:http://dx.doi.org/10.1007/978-3-319-08786-3_39
-  Qiang Ye, Rob Law, and Bin Gu. 2009. The impact of online user reviews on hotel room sales. International Journal of Hospitality Management 28, 1 (2009), 180–182. DOI:http://dx.doi.org/10.1016/j.ijhm.2008.06.011
Appendix A Question Wordings
Questions asked of each tour:
What type of tour is this? Walking, Bus, Bike/Cycle (random order)
If you were to book a tour of this type, how likely would you be to book this particular tour? Visual analogue scale with end points “Very unlikely” and “Very likely” mapped to integers 0–100
In reaching this decision how important were the following factors: Title, Photo, Description, Visitor reviews. Visual analogue scales with end points “Very unimportant” and “Very important” mapped to integers 0–100
Any specific comments? (optional)
Overall selection of one tour
If you were to book one of the three tours that you have just examined, which tour would you be most likely to book? (You may click on a tour to quickly see its listing again if you like.)
How confident are you in your choice? Visual analogue scale with end points “Not at all confident” and “Very confident” mapped to integers 0–100
In making the decision to select the tour that you did, how important was each of the following elements? Title, Photo, Description, Visitor reviews. Visual analogue scales with end points “Very unimportant” and “Very important” mapped to integers 0–100
Do you have any previous experience or knowledge of any of the tours that you examined?
I have had previous experience with this tour
I have not had previous experience with this tour
Big Five (Study II only)
All questions were asked using visual analogue scales with end points “Strongly disagree” and “Strongly agree” mapped to integers 0–100. Question order was randomized for each subject. Details are given in Gosling et al. 
Openness (Study II only)
All questions were asked using visual analogue scales with end points “Strongly disagree” and “Strongly agree” mapped to integers 0–100. Question order was randomized for each subject.
I like being with unpredictable people
I like parties where I know most of the people more than ones where all or most of the people are complete strangers
What we are used to is always preferable to what is unfamiliar
I would like to live in a foreign country for a while
A person who leads an even, regular life in which few surprises or unexpected happenings arise, really has a lot to be grateful for
Other cultures fascinate me
I would prefer a foreign movie be subtitled rather than dubbed
Foreign language skills should be taught in (as early as) elementary school
Language ability questions
Prior to reading the information above, did you suspect that this study was about foreign-language reviews? Yes, No.
How would you rate your ability to read English?
I do not read English at all
I read English at a basic or beginning level
I read English at an intermediate level
I read English at an advanced level
English is my native language
How would you rate your ability to read Spanish? (same options as for English)
Foreign experience (Study II only)
Question order randomized; Answer choices: Yes, No, Prefer not to say
I am fluent in a language besides English
I have spent time living in another country
I have moved or been relocated a substantial distance (e.g., state to state or overseas)
I have studied a foreign language
Is there anything else you would like to share with us? (optional)