1 Extract Users’ Emotions from Feedback
Initially developed for marketing and political opinion mining, sentiment analysis became popular in many domains including software engineering. Mining emotional sentiment has been used, e.g., to guide developer discussions  or summarize users opinion on app features .
Sentiment analysis tools use natural language processing to extract emotions from text messages. Simple lexicon-based approaches match each text token to dictionaries containing negative or positive words. The dictionaries define sentiment scores for specific tokens, such as“I hate[-4] that u need wifi but overall the app is great[+3]”. Depending on the tool, the scores are combined, e.g., into a single value reflecting the overall emotion expressed.
Table I describes state-of-the-art sentiment analysis tools including their approaches, scores, languages supported, technology used, and licenses. For this investigation we selected SentiStrength , which is a common baseline for emotion classification used in many studies (e.g., ). Its results can be improved with domain-specific lexicons, where specific terms correlate with other emotions than in general as the term ‘bug’ in software engineering. SentiStrengthSE uses a manually adjusted version of the SentiStrength lexicon. Senti4SD is trained on manually labelled questions, answers, and comments from StackOverflow. Hence, the training data includes more software engineering specific terms. Most sentiment analysis tools only support the English language, while apps are offered and allow user to submit feedback in more than 40 languages. To analyze these languages further dictionaries need to be created.
|SentiStrength||Lexicon-based||Estimates strength of positive and negative sentiment||English (Partly tested: Dutch, Finnish, German, Italian, Russian, Spanish, Turkish)||Command-line (Java)||Commercial, closed source (Free for academic research)|
Lexicon-based, ad-hoc heuristics
|Estimates strength of positive and negative sentiment||English||Command-line (Java)||– inherits from SentiStrength –|
|Senti4SD||Estimates strength of positive and negative sentiment||English||Command-line (R)||Open source (MIT license)|
|Vader NLTK||Lexicon-based||Estimates strength of positive, neutral, negative, and compound sentiment||English||Python package||Open source (Apache license)|
|Google Natural Language API||Machine-learning (not further specified)||Classifies emotion into positive, neutral, and negative||English, Chinese, French, German, Italian, Japanese, Korean, Portuguese, Spanish||Remote API||Commercial, closed source|
|IBM Watson Tone Analyzer||Machine-learning (not further specified)||Identifies presence and estimates strength of analytical, anger, confident, fear, and tentative tones||English, French||Remote API||Commercial, closed source|
2 Emotional Patterns in App Reviews
We applied SentiStrength to 7 million app reviews , corresponding to the 25 top free and paid apps from each of the 23 categories of the US Apple App Store (December 2016). As apps can be listed in several categories, we removed duplicates and only considered apps’ main category. In total we analyzed 245 apps. We found that users’ sentiments are most negative within app reviews of the categories “photo & video” (mean: 0.4), “entertainment” (0.6), and “sports” (0.9). In contrast, we observed the most positive average sentiments within the categories “reference” (2.7), “education” (2.4), and “health & fitness” (2.3).
Reviews in app stores include a 1-5 stars rating. We found a moderate correlation between the rating and the sentiment (Pearson correlation coefficient 0.57, Spearman rank correlation 0.56). Compared to star ratings which are often skewed, sentiment scores are more fine grained (e.g., on a range from -5 to +5) and can be calibrated to specific information needs, e.g., by fine-tuning the dictionaries. Further, sentiment scores enable comparisons across different platforms. Particularly, in channels without explicit user ratings, automated sentiment analysis tools can help quickly assessing users’ overall opinion, as in social media channels (e.g., Twitter). These channels are increasingly being used by app vendors to gather feedback from users (e.g., @SpotifyCares, see https://twitter.com/spotifycares), due to their popularity and their ability to exchange information in form of screenshots or video recordings.
Analyzing the sentiment for different apps over time reveals four recurring emotional patterns , as shown in Figure 1. Each pattern can be related to specific issues or changes within the apps. For example, new app features might increase the average sentiment. The pattern consistent emotion is characterized by a stable negative, neutral, or positive sentiment over time. It can be observed within 15% of the analyzed apps, such as Spotify or Duolingo. The pattern inconsistent emotion is characterized neither by a constant nor by a clear positive or negative trend. This pattern can be observed for 62% of the analyzed apps, including WhatsApp. The pattern steady decrease/increase is characterized by a constant negative or positive sentiment trend. A constant negative trend can be observed within 10% of the analyzed apps, such as CNN or Microsoft Outlook. A positive trend can only be observed within 3% of the apps, such as AccuWeather. The pattern emotion drop/jump is characterized by a sudden change of the sentiment from negative to positive or vice versa. A change from negative to positive can be observed within 15% of the analyzed apps. A change from positive to negative can be observed within 9% of the apps, such as Google Mail or OverDrive.
3 Release Lessons
Regularly watching users’ emotions and identifying corresponding patterns is a first step towards understanding an app’s health. We additionally present five release lessons that software practitioners can apply to improve users’ emotions and prevent general negative feedback which can lead to the fall of apps .
We derived the lessons by looking at the release history, the content of user reviews, official vendor’s presentations, and technical blogs of several apps corresponding to each pattern. For each lesson we observed at least two indications (e.g., two example apps). Some of the lessons are also supported by recent studies. However, since our lessons are not the result of an in-depth empirical study, we refrain from claiming any generalizability or completeness of lessons. The lessons with their actionable recommendations should encourage and inspire practitioners consider users emotion together with the release frequency and complexity when fine-tuning their release processes.
3.1 Continuously Analyze User Feedback to Identify and React to Bug Reports and Feature Requests
Software practitioners should analyze user feedback of released app versions and react to frequently mentioned bug reports and feature requests, especially when similar features are already offered by competitors.
The majority of apps follow the inconsistent emotion pattern. These apps are affected by temporary bugs that are quickly fixed by developers. Figure 1 shows that the sentiment of WhatsApp (similar to Pandora) strongly decreases and then restores for single periods of time. In the third month, relative from the start of our analysis, nearly all users report storage issues, e.g., “Major Storage Issues”. With the release of an update that fixes the bug after a week, the issue was reported less often. In month 6, users frequently mention crashes after app start. Although these issues only appear for a short period of time, their impact on the overall sentiment is notable since they affect the majority of users who install the update.
Apps that do not react to issues reported by their users are associated with the pattern steady decrease. Microsoft Outlook’s reviews included diverse emotions until month 6, as shown in Figure 1. Reviews with positive sentiments such as “Best email app” exceeded negative reviews, leading to a stable average sentiment value around 2. The majority of negative reviews is related to issues within the apps, e.g., “Trash won’t empty” or “Bug when adding accounts”. As many competing apps exist, users apparently began to explore alternatives. A user wrote “In iOS mail, you can copy an attached excel spreadsheets within body of email but outlook doesn’t format correctly.”. Similarly another user reported: “Although Microsoft has addressed several issue, it’s still buggy at times. I just started looking for a replacement app.”.
Recommendation #1: Software practitioners should use tools (e.g., https://openreq.eu/) to classify and extract bug reports and feature requests from user feedback. Even when already using automated crash reporting tools, user feedback might include additional non-crashing bugs that cannot be automatically captured. These bugs should be clustered to determine their severity. Bugs that are frequently reported should be quickly fixed by developers before users explore alternative apps. Martin et al.  provide a broad overview of the research area app store analysis and existing approaches.
3.2 Frequently Release Small Changes
If possible, software practitioners should frequently introduce small changes to their apps instead of releasing fundamental changes at once, such as a major redesign of the user interface or the removal of app features.
For apps introducing major changes at once we observed emotion drops. For Google Mail users provided reviews with positive sentiments until the tenth month of our analysis, such as “Love it more than iPhone mail”. With an update that applied Android’s material design, that iOS users are unfamiliar with, the sentiment suddenly turns negative. Reviews including negative sentiments were often related to usability issues (e.g., “New update makes you click on each individual email to delete them.”) or to features removed (e.g., “Bring back Mark as Unread.”). A similar emotion drop can be observed as OverDrive introduced major changes within a single app update. The update included multiple bugs, as several users reported “Can’t download books to device”, “Buggy, buggy, annoying”, or “App crashes with last update”.
In case of Google Mail, the vendor reacted with weekly updates integrating features requested by users in the reviews succh as “Select multiple messages […]” and “Mark as read/unread […]”. With the release of those updates the sentiment shows a positive trend. For OverDrive, to restore the sentiment most bugs were fixed at once with a single app update after a longer period of time.
The frequency of app updates is controversially discussed in the literature. A recent study reports that frequently updated apps receive a significantly lower percentage of negative ratings . Another study only found a weak correlation, considering negative and positive ratings . However, the study shows that the types of released changes have a varying impact on app ratings. Terms and topics around bug fixes and features occur frequently in the description of impactful releases .
We recommend to consider releasing frequent and small updates to avoid surprising users with unexpected (i.e., too many or major) changes [8, 9, 10]. Further, we consider frequent releases beneficial, since bugs get fixed faster for apps with shorter release cycles . Also, studies show that app releases lead to an increased amount of ratings and reviews [5, 9, 12], allowing developers to get more feedback and better understand user needs in highly competitive and dynamic markets .
Recommendation #2: High code churn in releases, i.e., the rate at which an app’s code evolves, correlates with lower ratings  and sentiment scores. Software practitioners should use issue trackers’ and version control systems’ built-in functionality or external tools to visualize the amount of change introduced (e.g., number of user stories resolved, number of bugs fixed, lines of code implemented). Based on these measures, the severity of changes can be determined to decide whether these should be introduced in separate smaller, more frequent releases.
3.3 Pre-Release Changes to Subsets of App Users
Changes should be pre-released to subsets of app users before making these available to everyone. Spotify and Duolingo apply this lesson and are able maintain a consistent positive emotion among their users.
For initial tests, software practitioners should provide access to alpha and beta app versions to voluntary users, as Spotify does (https://bit.ly/2T00cj1). The alpha version is updated almost daily and may be affected by stability issues. The beta version is updated one week before official app releases, to discover final issues. Feedback regarding these versions cannot be provided in form of app reviews. Instead, testers email their feedback directly to the development teams as indicated during sign up for the programs. This approach aims to decouple testing (i.e., identifying and reporting bugs) from actually using and assessing the app.
Further, software practitioners should select individual users to participate in A/B-tests (e.g. as both Duolingo and Spotify do, see https://bit.ly/2FA7U0o, https://bit.ly/2FHntT3, and https://bit.ly/2VYpf8h). One group of the users temporarily receives access to new or modified app features. Duolingo states: “Every week, we test at least 10 things on a portion of our users.” (https://bit.ly/2HiB6KN).
Last, whenever possible new features should be gradually rolled out to all users so that app vendors can assess the overall impact on the emotional trend and react to unforeseen issues, e.g., by deactivating the functionality until the next app update.
Recommendation #3: Software practitioners should explore app stores’ functionality to distribute alpha and beta versions. The Apple App Stores allows to distribute these versions using TestFlight via email invite or public links (https://apple.co/1kxr08D). On Google Play, developers can similarly release their apps using the Play console (https://bit.ly/1gLkkv2). Google Play allows to advertise alpha and beta versions on the official app description page, visible to all users. After testing, changes should be gradually rolled out to assess their overall impact on users’ emotions and to be able to react to unforeseen issues.
3.4 Explain Changes to Users
Major changes, such as the increase of the minimum required system version or the removal of app features, should be announced and explained to users. Studies show that users do not pay too much attention to release notes . Instead, software practitioners should engage in conversations with users  or directly explain the changes within the app itself, e.g., using tutorials and tooltips.
We observed that apps that did not follow this lesson were affected by the pattern of steady decrease. For example, for the CNN app a user reported: “What happened to local news. I checked that every day […] please bring it back”. For Microsoft Outlook the sentiment decreased significantly when users were affected by incompatibilities with new and old iOS versions, such as “Update […] broke the app on iOS 8. Went back to using the stock mail app on my iPhone. Uninstalled it.”.
Recommendation #4: App changes should be transparent and understandable to users. Therefore, major changes software practitioners do not want to release in smaller parts should be explained to users, e.g., using app built-in tutorials. Further, users with legacy devices and system versions should be redirected to alternatives (e.g., web version) using announcements before stopping support.
3.5 Capture Implicit Feedback to Support Decisions
Software practitioners should capture implicit feedback to empirically determine whether experimental app features should be integrated. At the beginning, the taken measures should reflect a basic overall goal (e.g., maximize the number of tracks listened for music apps, see https://bit.ly/2Mka18R). Then, more complex measures can be developed.
For example, Spotify performs A/B-tests even for unfinished features to decide whether these should be further developed. For testing, changes are split into atomic parts. When changing, e.g., the navigation, one test looks into the UI while another test focuses on the content, i.e., order of menu items (https://bit.ly/2Wb0NQw).
While implicit measures help software practitioners evaluate and optimize features against comparable criteria, explicit feedback provides additional information why taken measures change . For explicit feedback where no ratings exist, the sentiment can be calculated to quickly assess users’ opinion. Further, it offers a broader understanding of the impact of changes. Users of no longer supported devices are, e.g. able to express their opinion in explicit feedback.
Recommendation #5: Software practitioners should take further steps towards data-driven requirements engineering  by integrating logging frameworks, such as AppSee or Google Analytics, into their apps. Beginning with easy measures that relate to the app’s overall goal, software practitioners should develop more complex ones by testing the impact of changes in atomic parts. The implicit measures complement explicit user feedback to support decisions which experimental features to integrate into apps.
This research was partially supported by the European Union Horizon 2020 project OpenReq under grant agreement no. 732463.
Daniel Martens received the B.Sc. and M.Sc. degree in Computer Science from the University of Hamburg. He is a Ph.D. candidate in the Applied Software Technology group at the University of Hamburg. His research interests include user feedback, data-driven software engineering, context-aware adaptive systems, crowd-sourcing, and mobile computing. Besides his academic career, Daniel Martens also worked as a software engineer where he developed more than 30 top-rated iOS applications. He is a student member of the IEEE, ACM, and German Computer Science Society (GI). The photo of Daniel Martens was taken by UHH/Sukhina.
Walid Maalej is a professor of informatics at the University of Hamburg where he leads the Applied Software Technology group. His research interests include user feedback, data-driven software engineering, context-aware adaptive systems, e-participation and crowd-sourcing, and software engineering’s impact on society. He received his Ph.D. in software engineering from the Technical University of Munich. He is currently a steering committee member of the IEEE Requirements Engineering conference and a Junior Fellow of the German Computer Science Society (GI). The photo of Prof. Walid Maalej was taken by Koepcke-Fotografie.
-  F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity detection for software development,” Empirical Softw. Engg., vol. 23, no. 3, pp. 1352–1382, Jun. 2018. [Online]. Available: https://doi.org/10.1007/s10664-017-9546-9
-  E. Guzman and W. Maalej, “How do users like this feature? a fine grained sentiment analysis of app reviews,” in 2014 IEEE 22nd International Requirements Engineering Conference (RE), Aug 2014, pp. 153–162.
-  M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and A. Kappas, “Sentiment strength detection in short informal text,” J. Am. Soc. Inf. Sci. Technol., vol. 61, no. 12, pp. 2544–2558, Dec. 2010. [Online]. Available: https://doi.org/10.1002/asi.v61:12
-  D. Martens and T. Johann, “On the emotion of users in app reviews,” in 2017 IEEE/ACM 2nd International Workshop on Emotion Awareness in Software Engineering (SEmotion), May 2017, pp. 8–14.
-  D. Pagano and W. Maalej, “User feedback in the appstore: An empirical study,” in 2013 21st IEEE International Requirements Engineering Conference (RE), July 2013, pp. 125–134.
-  G. Williams and A. Mahmoud, “Modeling user concerns in the app store: a case study on the rise and fall of yik yak,” in 2018 IEEE 26th International Requirements Engineering Conference (RE). IEEE, 2018, pp. 64–75.
-  W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, “A survey of app store analysis for software engineering,” IEEE Transactions on Software Engineering, vol. 43, no. 9, pp. 817–847, Sep. 2017.
-  S. McIlroy, N. Ali, and A. E. Hassan, “Fresh apps: an empirical study of frequently-updated mobile apps in the google play store,” Empirical Software Engineering, vol. 21, no. 3, pp. 1346–1370, Jun 2016. [Online]. Available: https://doi.org/10.1007/s10664-015-9388-2
-  W. Martin, F. Sarro, and M. Harman, “Causal impact analysis for app releases in google play,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE 2016. New York, NY, USA: ACM, 2016, pp. 435–446. [Online]. Available: http://doi.acm.org/10.1145/2950290.2950320
-  L. Guerrouj, S. Azad, and P. C. Rigby, “The influence of app churn on app success and stackoverflow discussions,” in 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 2015, pp. 321–330.
-  F. Khomh, T. Dhaliwal, Y. Zou, and B. Adams, “Do faster releases improve software quality? an empirical case study of mozilla firefox,” in 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), June 2012, pp. 179–188.
-  D. Martens and W. Maalej, “Towards understanding and detecting fake reviews in app stores,” Empirical Software Engineering, May 2019. [Online]. Available: https://doi.org/10.1007/s10664-019-09706-9
-  K. Bailey, M. Nagappan, and D. Dig, “Examining user-developer feedback loops in the ios app store,” in Proceedings of the 52nd Hawaii International Conference on System Sciences, 2019.
-  J. Garcia-Gathright, C. Hosey, B. S. Thomas, B. Carterette, and F. Diaz, “Mixed methods for evaluating user satisfaction,” in Proceedings of the 12th ACM Conference on Recommender Systems, ser. RecSys ’18. New York, NY, USA: ACM, 2018, pp. 541–542. [Online]. Available: http://doi.acm.org/10.1145/3240323.3241622
-  W. Maalej, M. Nayebi, T. Johann, and G. Ruhe, “Toward data-driven requirements engineering,” IEEE Software, vol. 33, no. 1, pp. 48–54, Jan 2016.