GUI is widely used among modern mobile apps, making it practical and easy to use. However, with the development of visual effects of GUI, five categories of UI display issues (Zhe et al., 2020) such as component occlusion, text overlap, missing image, null value and blurred screen always occur during the UI display process, especially on different mobile devices. Detecting those issues is a hard problem because most of those UI display issues are caused by many factors, especially for Android, such as different Android OS versions, device models, and screen resolutions (Wei et al., 2016). Nowadays, some practical automated testing tools like Monkey (Developers, 2012; Wetzlmaier and Ramler, 2017), Dynodroid (Machiry et al., 2013) are also widely used in industry. However, these automated tools can only spot critical crash bugs, rather than UI display issues that cannot be captured by common tools. Inspired by the fact that display bugs can be easily spotted by human eyes, we develop an automated online tool OwlEyes-Online111OwlEyes-Online is named as our approach is like the owl’s eyes to effectively spot UI display issues. And our model (nocturnal like an owl) can complement conventional automated GUI testing (diurnal like an eagle) for ensuring the robustness of the UI., which provides quick detection and localization of UI display issues from apps or GUI screenshots.
is a user-friendly web app. Developers can upload GUI screenshots or apps and receive accurate UI display issue detection results. When a developer uploads an APK, it will automatically run the app and get its screenshots, and then we use computer vision technologies to detect the UI display issues.OwlEyes-Online
builds on the CNN to identify the screenshots with issues and Grad-CAM to localize the regions with UI display issues in the screenshots for further reminding developers. Finally, it summarizes the detection and localization results, automatically generates the test report and sends it to users. Considering that the CNN needs lots of training data, we adopt a heuristic data generation method to generate the training data.
OwlEyes-Online provides a dashboard for users to upload the screenshots or apps. After analyzing an uploaded screenshot, it displays detection results in real-time. As for an app, it automatically generates a test report (issue screenshots, localization, etc.) and sends the report to the user in an email.
This paper makes the following contributions:
We implement a CNN based issue detection method and a Grad-CAM based issue localization method to detect UI display issues from GUI screenshots.
We develop a fully automated web app. Users only need to upload an APK file, and OwlEyes-Online will automatically generate test reports and send them to users. We release the implementation of OwlEyes-Online on Github (11).
An empirical study among professionals proves the value of our UI display issue detection method and OwlEyes-Online.
2. Our Fully Automated Approach
According to the features of UI display issues, we propose a fully automated UI display issue detection and localization approach. It mainly includes four parts, which are heuristic-based data generation, CNN-based issues detection, Grad CAM-based issues localization, and online inference of GUI issues. As shown in Figure 1, to improve the accuracy of our model, we use the heuristic-based data generation method to generate a number of training data. Given an APK, OwlEyes-Online
automatically runs it and collects screenshots. Then the CNN-based model classifies if they relate to any issues via the visual understanding. Once an issue is confirmed, our model can further localize its specific issue position on the UI screenshot by Grad CAM-based model to remind the developers.
2.1. Heuristic-based Data Generation
Training our proposed CNN for issues detection requires an abundance of screenshots (He et al., 2016) with UI display issues. However, there is so far no such type of open dataset, and collecting the related buggy screenshots is time- and effort-consuming. Therefore, we develop a heuristic-based data generation method for generating UI screenshots with display issues from bug-free UI images in Figure 1(a). The data generation is based on the Rico (Deka et al., 2017) dataset, which contains more than 66K unique screenshots and their JSON files (i.e., detailed run-time view hierarchy of the screenshot). With the input screenshot and its associated JSON file, we first localize all the TextView and ImageView, then randomly chooses a TextView/ImageView depending on the augmented category. Based on the coordinates and size of the TextView/ImageView, the algorithm then makes its copy and adjusts its location or size according to specific rules to generate the screenshots with corresponding UI display issues.
2.2. CNN-based Issues Detection
As the UI display issues can only be spotted via the visual information, we adopt the convolutional neural network (CNN)(LeCun et al., 1998; Krizhevsky et al., 2012), which has proven to be effective in image classification and recognition in computer vision (Simonyan and Zisserman, 2015; Szegedy et al., 2016; He et al., 2016). Figure 1
(b) shows the structure of our model, which links the convolutional layers, batch normalization layers, pooling layers, and fully connected layers. Given the input screenshot, we convert it into a certain image size with fixed width and height as the convolutional layer’s parameters consist of a set of learnable filters. After the convolutional layers, the screenshots will be abstracted as a feature graph. In order to improve the performance and stability of CNN, we add Batch Normalization (BN)(Ioffe and Szegedy, 2015)
layers after the convolutional layer and standardize the input layer by adjusting and scaling activation. After the BN layer, we add the Rectified Linear Unit (ReLU) as the activation function of the network. The last several layers are fully connected neural networks (FC) which compile the data extracted by previous layers to form the final output. Finally, we obtain the detection results through softmax(Bishop, 2006).
2.3. Grad CAM-based Issues Localization
As shown in Figure 1(c), we adopt the feature visualization method to localize the detailed position of the issues to remind the developers. We apply the Grad-CAM model for the localization of UI display issues. Gradient weighted Class Activation Mapping (Grad-CAM) is a technique for visualizing the regions of input that are “important” for predictions on CNN-based models (Selvaraju et al., 2017) . First, a screenshot with the UI display issue is input into the trained CNN model, and the category supervisor to which the image belongs is set to 1, while the rest is 0. Then the information is propagated back to the convolutional feature map of interest to obtain the Grad-CAM positioning. Through the feedback of global average pooling of the gradient, the weight
of the importance of neurons is obtained. This weight captures the importance of the feature mapof the target category
(Bug). By performing the weighted combination of the forward activation graph, we can obtain the class-discriminative localization map. Finally, the point multiplication with the backpropagation can obtain the Grad-CAM as the result of issue localization.
2.4. Online Inference of GUI Issues
We use 20,000 screenshots generated in section 2.1 to train our issue detection and localization model. Before the issue detection, we need to preprocess the APK submitted by the user online. As shown in Figure 1(d), the user provides an Android APK, and we use the dynamic analysis method to run the app automatically to obtain the screenshots. In detail, by leveraging the idea of dynamic app GUI testing (Li et al., 2017; Cai et al., 2020; Developers, 2012; Su et al., 2017), we adopt an app explorer (Li et al., 2017) to automatically explore the pages within an application through interacting with apps using random actions, e.g., clicking, scrolling, and filling in text. We also provide three testing strategies for users to choose from: Depth-First-Search (DFS) (Shwail et al., 2013), Breadth-First-Search (BFS) (Beamer et al., 2012), and random exploration (Developers, 2012).
3. Tool Implementation And Usage
OwlEyes-Online is a web app, which provides a convenient tool for users to detect and localize the UI display issues in the GUI screenshots.
3.1. Web Implementation
can automatically run applications and generate test reports for users. We customized the deep learning model in PyTorch. TheOwlEyes-Online consists of two parts: running the application automatically, feeding back the test results in real-time.
Running the app automatically: Figure 2(a) shows an example of our running the app automatically page. Users can upload the APK or its download link. In addition, we allow users to customize the exploration strategy, select the appropriate device, and some personalization settings to provide a more friendly interactive experience.
The Real-time feedback issue detection results: This page in Figure 2(b) will give real-time feedback test results while running the application automatically. On this page, we implement some functions to provide a more friendly interactive experience, including:
Click to view the localization details: In Figure 2(c), click the screenshot of the UI display issue to view the localization of it in the screenshot (in the form of a thermal graph).
Export test report: In Figure 2(d), users fill in e-mail information, and we will automatically generate test reports and send them to users. The test report includes the number of issues of the application and the screenshots of the issue and the XML corresponding to the screenshots.
3.2. Model Implementation
Our CNN model is composed of 12 convolutional layers with batch normalization, 6 pooling layers, and 4 full connection layers for classifying UI screenshot with display issues. The size of a convolutional kernel in the convolutional layer is 3
3. We set up the number of convolutional kernels as 16 for convolutional layer 1-4, 32 for convolutional layer 5-6, 64 for convolutional layer 7-8, and 128 for convolutional layer 9-12. For the pooling layers, we use the most common-used max-pooling settings(Simard et al., 2003), i.e., pooling units of size 2
2 applied with a stride(Simonyan and Zisserman, 2015). We set the number of neurons in each of the fully connected layers as 4096, 1024, 128, and 2 respectively. For data preprocessing, we rotate some UI of the horizontal screens to vertical, and resize the screens to 768 448. We implement our model based on the PyTorch (3)
framework. The model is trained in an NVIDIA GeForce RTX 2060 GPU (16G memory) with 100 epochs for about 8 hours.
3.3. Usage Scenarios
We present several examples to illustrate how developers would interact with OwlEyes-Online. In some cases, developers collect a large number of screenshots of applications (such as crowdtesting platform, automatic testing). However, these automated tools can only spot critical crash bugs, rather than UI display issues that cannot be captured by common tools. Developers can upload application screenshots to our OwlEyes-Online directly. OwlEyes-Online will analyze the screenshots and detect the UI display issue in the screenshots.
For testing whether UI display issues exist in the application, developers can directly upload an APK to our OwlEyes-Online , which will automatically explore the application and detect UI display issues. Developers can also customize the exploration method and duration and submit the e-mail information. After the issue detection, OwlEyes-Online will automatically generate the issue report and send it to the developer’s e-mail. Considering the network delay, developers can also upload an application’s download link, and OwlEyes-Online will automatically download the APK in the background for testing.
The goal of our study is to evaluate the usefulness of our platform OwlEyes-Online in terms of (i) its effectiveness in detecting and localizing UI display issues, and (ii) the usability of our OwlEyes-Online.
4.1. Effectiveness Measurement
Given the effectiveness of our OwlEyes-Online for UI display issue detection, we conduct experiments on 8K Android mobile GUI collected by one of the largest crowd-testing platforms (4). This part is also published in our previous work (Zhe et al., 2020)
Table 1 shows the performance comparison with the baselines. With OwlEyes-Online, the precision is 0.85 and the recall is 0.84. We can see that our proposed OwlEyes-Online
is much better than the baselines, i.e., 58% higher in precision and 17% higher in recall compared with the best baseline, Multilayer Perceptron (MLP). This further indicates the effectiveness ofOwlEyes-Online. Besides, it also implies that OwlEyes-Online is especially good at hunting for the buggy screenshots from candidate ones, i.e., significant improvement in recall.
4.2. Usefulness Measurement
To further assess the usefulness of our approach, we randomly sample 2,000 Android applications from F-Droid (1) and 1,000 applications from Google Play (2). Note that none of these apps appears in our training dataset. Among the 3,000 collected applications, 59% (1756/3000) applications can be successfully run with OwlEyes-Online. For the remaining 1,756 applications, an average of 8 screenshots is obtained for each application. We then feed those screenshots to our OwlEyes-Online and detect if there are any display issues. Once a display issue is spotted, we create a bug report by describing the issue attached with a buggy UI screenshot. Finally, we report them to the app development team through issue reports or emails. Our OwlEyes-Online has detected 113 UI display issues, among which 35 have been confirmed and 29 have been fixed. These fixed or confirmed bug reports further demonstrate the effectiveness and usefulness of our proposed approach in detecting UI display issues.
Regarding the user experience of our OwlEyes-Online, we create an online survey on 20 professional developers, testers, and researchers, all of whom major in computer science with more than 3 years of app testing or developing experience. 10 of them are from the industry with practical working experience222Some testers are from NVIDIA, Citibank, Sony, Baidu, Alibaba, Three Fast Online, and ByteDance.. We ask them to use our OwlEyes-Online and ask them about the usefulness of the OwlEyes-Online for their work, as well as its potential and scalability in the future. In the end, participants fill in the System Usability Scale (SUS) questionnaire (Brooke and others, 1996) (5-point Likert scale (Oda et al., 2015) from 1 (strongly disagree) to 5 (strongly agree)). The questionnaire also asks participants to select the TechLand system features that they deem most useful or least useful for the tasks.
Figure 3 summarizes the participants’ ratings of the 10 system design and usability questions in the System Usability Scale questionnaire. The upper half of figure 3 shows that participants agree or strongly agree that our system is easy to use and the features of the OwlEyes-Online system are well-devised. The lower half of figure 3 further confirms the simplicity and consistency of our OwlEyes-Online system. Furthermore, the average helpfulness of the OwlEyes-Online system for the tasks is 4.42, which indicates that participants appreciate the help of the OwlEyes-Online system in the tasks. All participants indicated that OwlEyes-Online has a good UI display issue detection effect. Among these professionals, 10 of them are working on app testing. They think OwlEyes-Online can help them localize the UI display issues more quickly. 7 Android developers said that our issue localization model helps them better localize the issue on the UI interface so that they can better repair the issue later. Among them, 4 developers hope we can further give the possible repair methods and causes of these issues. The other 3 participants who are studying GUI testing also indicated that they hope we can analyze the cause of issue in the next stage. They think that using the visual information of application screenshots is a very helpful and engaging work.
Improving the quality of mobile applications, especially in a proactive way, is of great value and always encouraged. In this demo, we show OwlEyes-Online, a fully automated UI display issue detection and localization tool. We use dynamic analysis to explore the application automatically and get its screenshots. And users can customize the exploration time, exploration strategy and so on. Then we can complete the detection and localization of UI display issues based on CNN and Grad-CAM. Finally, we automatically generate test reports and send them to users. We evaluate it from two aspects of detection accuracy and tool practicability. The OwlEyes-Online is proven to be effective in real-world practice, i.e., 64 confirmed or fixed previously undetected UI display issues from popular Android apps. It also achieves boosts of more than 17% and 23% in recall and precision compared with the best baseline. The evaluation shows that OwlEyes-Online is a good starting point for UI display issue detection.
In the future, we will further study the root cause of UI display issue. Finally, according to the issue category, we will devise a set of tools for recommending patches to developers to fix the UI display issues.
This work is supported by the National Key Research and Development Program of China under grant No.2018YFB1403400, National Natural Science Foundation of China under Grant No. 62072442, No. 62002348.
-  (2021) Note: http://f-droid.org/ Cited by: §4.2.
-  (2021) Note: http://play.google.com/store/apps/ Cited by: §4.2.
-  (2021) Note: https://pytorch.org/ Cited by: §3.2.
-  (2021) Baidu (baidu.com) is the largest chinese search service provider. its crowdsourcing test platform (test.baidu.com) is also one of the largest in china.. Note: http://test.baidu.com Cited by: §4.1.
- Direction-optimizing breadth-first search. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–10. External Links: Cited by: §2.4.
- Pattern recognition and machine learning. springer. Cited by: §2.2.
- SUS-a quick and dirty usability scale. Usability evaluation in industry 189 (194), pp. 4–7. External Links: Cited by: §4.2.
- Fastbot: a multi-agent model-based test generation system beijing bytedance network technology co., ltd.. In Proceedings of the IEEE/ACM 1st International Conference on Automation of Software Test, pp. 93–96. External Links: Cited by: §2.4.
- Rico: a mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual Symposium on User Interface Software and Technology, UIST ’17. External Links: Cited by: §2.1.
- Ui/application exerciser monkey. Cited by: §1, §2.4.
-  (2021) Github link. Note: https://github.com/franklinbill/owleyes/ Cited by: 2nd item.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , Los Alamitos, CA, USA, pp. 770–778. External Links: Cited by: §2.1, §2.2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, JMLR Workshop and Conference Proceedings, Vol. 37, pp. 448–456. Cited by: §2.2.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. External Links: Cited by: §2.2.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Cited by: §2.2.
- Droidbot: a lightweight ui-guided test input generator for android. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), pp. 23–26. External Links: Cited by: §2.4.
- Dynodroid: an input generation system for android apps. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, New York, NY, USA, pp. 224–234. External Links: Cited by: §1.
- Learning to generate pseudo-code from source code using statistical machine translation (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 574–584. External Links: Cited by: §4.2.
- Grad-cam: visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV), External Links: Cited by: §2.3.
- Probabilistic multi robot path planning in dynamic environments: a comparison between a* and dfs. International Journal of Computer Applications 975, pp. 8887. Cited by: §2.4.
- Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2, ICDAR ’03, USA, pp. 958. External Links: Cited by: §3.2.
- Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §2.2, §3.2.
- Guided, stochastic model-based gui testing of android apps. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 245–256. External Links: Cited by: §2.4.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. External Links: Cited by: §2.2.
- Taming android fragmentation: characterizing and detecting compatibility issues for android apps. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp. 226–237. External Links: Cited by: §1.
- Hybrid monkey testing: enhancing automated gui tests with random test generation. In Proceedings of the 8th ACM SIGSOFT International Workshop on Automated Software Testing, A-TEST 2017, New York, NY, USA, pp. 5–10. External Links: Cited by: §1.
- Owl eyes: spotting ui display issues via visual understanding. In 2020 35rd IEEE/ACM International Conference on Automated Software Engineering (ASE), External Links: Cited by: §1, §4.1.