AppsPred: Predicting Context-Aware Smartphone Apps using Random Forest Learning

Due to the popularity of context-awareness in the Internet of Things (IoT) and the recent advanced features in the most popular IoT device, i.e., smartphone, modeling and predicting personalized usage behavior based on relevant contexts can be highly useful in assisting them to carry out daily routines and activities. Usage patterns of different categories smartphone apps such as social networking, communication, entertainment, or daily life services related apps usually vary greatly between individuals. People use these apps differently in different contexts, such as temporal context, spatial context, individual mood and preference, work status, Internet connectivity like Wifi? status, or device related status like phone profile, battery level etc. Thus, we consider individuals' apps usage as a multi-class context-aware problem for personalized modeling and prediction. Random Forest learning is one of the most popular machine learning techniques to build a multi-class prediction model. Therefore, in this paper, we present an effective context-aware smartphone apps prediction model, and name it "AppsPred" using random forest machine learning technique that takes into account optimal number of trees based on such multi-dimensional contexts to build the resultant forest. The effectiveness of this model is examined by conducting experiments on smartphone apps usage datasets collected from individual users. The experimental results show that our AppsPred significantly outperforms other popular machine learning classification approaches like ZeroR, Naive Bayes, Decision Tree, Support Vector Machines, Logistic Regression while predicting smartphone apps in various context-aware test cases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

08/25/2019

E-MIIM: An Ensemble Learning based Context-Aware Mobile Telephony Model for Intelligent Interruption Management

Nowadays, mobile telephony interruptions in our daily life activities ar...
07/04/2018

Context Data Categories and Privacy Model for Mobile Data Collection Apps

Context-aware applications stemming from diverse fields like mobile heal...
09/02/2019

CalBehav: A Machine Learning based Personalized Calendar Behavioral Model using Time-Series Smartphone Data

The electronic calendar is a valuable resource nowadays for managing our...
01/12/2018

Predicting Smartphone Battery Life based on Comprehensive and Real-time Usage Data

Smartphones and smartphone apps have undergone an explosive growth in th...
03/28/2021

CyberLearning: Effectiveness Analysis of Machine Learning Security Modeling to Detect Cyber-Anomalies and Multi-Attacks

Detecting cyber-anomalies and attacks are becoming a rising concern thes...
08/16/2020

Prediction of Homicides in Urban Centers: A Machine Learning Approach

Relevant research has been standing out in the computing community aimin...
01/10/2021

Occupancy Detection in Room Using Sensor Data

With the advent of Internet of Thing (IoT), and ubiquitous data collecte...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, smart mobile phones are considered as one of the most popular IoT devices and have become an essential part of our everyday life. In the real world, users’ interest on “Mobile Phones” is more and more than other platforms like “Desktop Computer” or “Tablet Computer” over time Sarker (2018b). According to ITU (International Telecommunication Union), cellular network coverage has reached 96.8% of the world population, and this number even reaches 100% of the population in the developed countries, like USA, Australia, Canada, UK etc. in the world Union (2015). People use smartphones for not only the voice communication between individuals but also a variety of applications, apps in short, for different purposes like social networking, instant messaging, location tracking and transportation management, online shopping, medical appointment or eHealth services, sports and entertainment, IoT services, or real life emergency services etc. Usage patterns of such category of apps usually vary greatly between individuals in the real world. Individual users may behave differently in different contexts, such as temporal context including their work status like workday or holiday, spatial context that represents user’s particular location, e.g., office, their emotional state or mood, e.g., happy, Internet connectivity like Wifi status, or device related status like phone profile or battery level etc. in which that usage occurs. Thus, its important to study on such contextual data in order to build an effective context-aware apps prediction model.

Developing personalized context-aware methods to model different categories of smartphone apps, particularly, Gmail, Microsoft Outlook, Facebook, LinkedIn, Twitter, Youtube, Whatsapp, Skype, eHealth, Uber, Browser, Google Maps etc. utilizing contextual data is the key. Thus, we consider this issue as a multi-class context-aware prediction problem, where each individual app represents as a particular usage class. An effective machine learning based context-aware model by analyzing individuals’ usage patterns in multi-dimensional contexts mentioned above utilizing smartphone data can eventually predict future usage according to their current contexts Sarker (2018c). Such context-aware model can be used for building various data-driven intelligent systems, such as intelligent mobile recommendation system, context-aware smart apps management system, context-aware smart app searching, intelligent app notification management system etc. that intelligently assist the end mobile phone users in their daily activities Sarker (2018a). Therefore, in order to achieve our goal, in this paper, we mainly focus on modeling and predicting personalized smartphone apps usage based on relevant multi-dimensional contexts related to the corresponding users’ preferences and their own devices characteristics.

Let’s consider a real-world motivational example. Say, Alice, a smartphone user, is a post graduate research student. She has installed a large number of mobile applications on her smartphone. Homescreens of smartphones provide easy finding of the apps without additional effort in searching, which is useful to the end mobile phone users in their various day-to-day situations. However, the homescreen of her smartphone is unaware about the current contexts of her. As a result, the phone becomes unable to manage the useful apps intelligently according to her needs, as her current contexts, e.g., location, are not static, may change over time. An effective context-aware app usage model may predict her future usage based on her current contexts, allowing the particular app she currently needs to be easily accessible from the mobile homescreen. Such personalized model could be used to build a smart mobile app management system that can predict her future usage according to her current contextual information and intelligently assists her to use different categories of smartphone apps according to her needs.

In the area of contextual smartphone data analytics, both association learning

Agrawal et al. (1994), and classification learning Quinlan (1993) are the most common and popular techniques to build a data-driven prediction model. However, association learning technique, e.g., Apriori Agrawal et al. (1994) produces a large number of redundant rules that makes the context-aware prediction model more complex and ineffective Fournier-Viger and Tseng (2012) Sarker and Salim (2018). Thus, in this paper, we focus on classification techniques that can play an important role to build an effective context-aware prediction model for individual mobile phone users utilizing their smartphone apps usage data. In the area of machine learning and predictive analytics, ZeroR, Naive Bayes, Decision Tree, Support Vector Machines, Logistic Regression, and Random Forest are the most popular classification algorithms that can be used to build data-driven context-aware models Sarker et al. (2019c) Han et al. (2011). Among these techniques, tree based context-aware model is more effective to intelligently predict mobile user activity in different contexts Sarker et al. (2019c). In particular, a number of researchers Hong et al. (2009) Lee (2007) Zulkernain et al. (2010) Sarker (2019)

have used decision tree classifier to model mobile phone users’ behavior. Since we take into account our apps prediction model as a multi-class problem that includes a variety of usage classes in a number of multi-dimensional contexts, a single decision tree may cause over-fitting problem, while selecting the root node based on contexts. As a result, it may decrease the prediction accuracy of the resultant context-aware model. Thus, the research question is -

How to build an effective context-aware smartphone apps usage prediction model for personalized services?

In this paper, we present a random forest machine learning based context-aware smartphone apps prediction model, “AppsPred”

that takes into account a number of trees rather than a single decision tree. In our model, we first extract the contextual features from the training dataset and prepare the contexts to fit for the machine learning techniques. Once the contexts have been processed, we then construct a random forest on the processed training dataset to achieve our goal. The reason for constructing random forest is that it averages the output of several separate learners like single decision tree by reducing the variance in individual’s usage. However, different number of trees may give different prediction results in a random forest based model. Thus, in order to build an effective context-aware model, we take into account an

optimal number of trees that gives higher accuracy while predicting different categories of smartphone apps in different context-aware test cases. The effectiveness of this model is examined by considering the real mobile phone datasets consisting of individuals’ various app usage and corresponding contextual information.

The contributions of this work can be summarized as follows.

  • We first highlight the significance of personalized smartphone apps usage prediction modeling based on machine learning techniques. In our model, we take into account different categories of apps usages like social networking, communication, entertainment and so on in different multi-dimensional contexts related to the corresponding users’ day-to-day situations and preferences, and their own devices’ characteristics.

  • We have collected contextual apps usage datasets form individual smartphone users and present a data-driven context-aware smartphone apps prediction model, “AppsPred” using random forest learning that takes into account an optimal number of trees based on relevant multi-dimensional contexts to make the model effective.

  • Finally, we conduct experiments on the real-world collected datasets and evaluate the effectiveness of our AppsPred model for various context-aware test cases. The experimental results show that our AppsPred significantly outperforms other popular machine learning classification approaches.

The rest of the paper is organized as follows. Section 2 provides background and related work. In section 3, we define and formulate the problem addressing in this paper. In Section 4, we present our context-aware smartphone apps usage prediction model using random forest machine learning. We report the experimental results in Section 5. We also summarize a number of key points in Section 6 and finally Section 7 concludes this paper and highlights the future work.

2 Background and Related Work

In the area of contextual smartphone data analytics, both association learning and classification learning are the most common and popular machine learning techniques to build a prediction model. Association learning is the discovery of rules or patterns among a set of available items in a given dataset. Association learning technique, e.g., Apriori, is well defined in terms of the reliability and flexibility as it has the own parameters; the support, and the confidence Agrawal et al. (1994). It discovers association rules that satisfy the predefined minimum support and confidence constraints from a given dataset. Support of a rule represents the percentage of records in the dataset which carry all the items or contexts in a rule, and the confidence represents the percentage of the records that carry all the items or contexts in the rule among those records that carry the items in the antecedent of the rule. A number of researchers Mehrotra et al. (2016); Srinivasan et al. (2014); Zhu et al. (2014); Sarker et al. (2019a) have used association learning technique to mine rules capturing mobile phone users’ behavior. However, it is well known that association learning technique produces a huge number of redundant rules Fournier-Viger and Tseng (2012) that makes the rule-set unnecessarily large. In Sarker and Salim (2018), Sarker et al. have shown that traditional association learning technique produces unnecessarily rules when applying on contextual smartphone data. Thus, it is very difficult for the decision making agents to determine the most interesting ones and consequently makes the decision making process ineffective and more complex Bouker et al. (2012).

Classification is another method that is frequently used in the area of machine learning and data science for solving the prediction problems. In general, classification is defined as a learning method that maps or classifies a data instance into the corresponding class labels that are predefined in the dataset. According to

Han et al. (2011)

, data classification is a two-step process; first one is the learning step where a classification model is constructed from a given dataset; the data from which a classification function or model is learned is known as the training set, and second one is a classification step where the model is used to test or predict the class labels for a separate unseen given data; the data set that is used to test the classifying ability of the learned model or function is known as the testing set. Several popular classification algorithms such as ZeroR, Naive Bayes, Support Vector Machines, K-Nearest Neighbors, Logistic Regression, Artificial Neural Network, Decision Tree, Random Forest have been proposed to build the prediction model

Han et al. (2011).

Among these classification techniques, tree based context-aware model is more effective to predict mobile user activity in different contexts Sarker et al. (2019c). A very well-known and mostly discussed tree based technique for prediction is decision trees Quinlan (1986). The core algorithm for building decision trees called ID3 proposed by J. R. Quinlan Quinlan (1986). ID3 algorithm constructs a decision tree by employing a top-down approach in which a greedy searching through the given training dataset is used to test each attribute or context at every node. It calculates the entropy and information gain which is a statistical property that is used to select which attribute to test at each node in the tree Quinlan (1986). Based on the ID3 algorithm, a modified algorithm is proposed by Quinlan, namely C4.5 algorithm Quinlan (1993) builds decision trees from a training dataset in the similar procedure as ID3, using the concept of information gain. In particular, a number of researchers Hong et al. (2009) Lee (2007) Zulkernain et al. (2010) Sarker (2019) Sarker et al. (2017a) have used decision tree classifier to model mobile phone users’ behavior. Since we take into account our context-aware model as a multi-class problem that includes a variety of usage classes in a number of multi-dimensional contexts, a single decision tree based model may not be effective while predicting various categories of smartphone apps in different contexts. The reason is that a single decision tree may cause over-fitting problem, while selecting the root node based on contexts and consequently it may decrease the prediction accuracy of the resultant context-aware apps usage model Sarker et al. (2019b).

Unlike the above approaches and context-aware models, in this work, we present a data-driven context-aware smartphone apps prediction model “AppsPred” using random forest machine learning technique that takes into account an optimal number of trees rather than a single decision tree based on relevant multi-dimensional contexts in order to make the model effective.

3 Definitions and Problem Statement

This section introduces main notions concerning individual’s smartphone apps usage based on multi-dimensional contexts. In the following, the notion of smartphone apps usage dataset with relevant contexts, is formally stated.

Definition 1. Let be a set of contexts and the set of corresponding domains. A contextual smartphone apps usage dataset is a collection of records, where -

  1. each record is a set of pairs , where , and . For example, if represents as the context ‘user mood’, then an example of is ‘happy’.

  2. each context , also called attribute or contextual feature that may occur at most once in any record, and

  3. each record has a particular app usage of an individual user, e.g., using Microsoft Outlook.

Definition 2. Let, be a set of smartphone applications that are used by an individual user , each app represents a particular usage class for that user, which is taken into account in our multi-class context-aware problem.

In this work, individuals’ different categories of smartphones’ apps such as social networking, communication, entertainment, or other daily life services related apps are taken into account in order to build an effective context-aware apps usage prediction model using machine learning techniques.

Definition 3. Let be a set of contexts having influence on individuals to use different categories of smartphone apps mentioned above, according to their daily day-to-day situations, in the real world and the set of corresponding usage domains of an individual user . Each context and corresponding value can be used as a part of multi-dimensional contexts in our study, which can play a role to build a context-aware model according to the relevancy in apps usage.

Different contexts might have an influence on individuals in their daily life apps usages. For instance, an individual’s apps usage behavior in her ‘happy’ emotional state may be well different from her behavior when she is in a ‘sad’ emotional state, which represents an example of user mood context. Similarly, other relevant contexts may also have the influence on their usages. As such, in this work, we take into account various types of relevant contexts, such as temporal context, spatial context, individual’s mood and preference, work status, Internet connectivity, or device characteristics or status that might have influence on individuals’ usage.

Problem Statement. With the above definitions, the main problem we are addressing in this paper is stated as follows:

Given, a smartphone dataset containing different categories of apps usage history and corresponding contextual information of an individual mobile phone user. Our goal is to build an effective personalized context-aware apps prediction model based on the relevant multi-dimensional contexts related to individuals’ day-to-day situations and preferences, and their devices’ characteristics or status, in order to predict individuals’ future usage for unseen test cases. In this paper, we present a data-driven context-aware apps prediction model “AppsPred” using random forest machine learning technique, for solving this problem.

4 Materials and Methods

In this section, we present our contextual smartphone apps usage datasets and the methodology for modeling personalized context-aware apps usage using machine learning techniques.

4.1 Contextual Data Collection and Description

In general, a context is defined as anything that can be used to characterize the situation of an entity Dey (2001). In this work, we take into account a number of contexts that have an influence on individuals to make a decision for using various categories of smartphone apps in their real-world life. These are:

Temporal context: It represents time related information. This is one of the primary context having influence on smartphone usage of an individual user Sarker et al. (2018). For instance, smartphone apps usage of an individual in the morning might not be similar with her usage at night. Moreover, one’s usage behavior may differ in different time periods or hours in the real world. Thus, we consider individuals’ usage in each hour of a day while collecting the dataset.

Work status: In general, individual’s work status in the real world depends on day status like work day or holiday. Work status heavily impacts on apps usage for many individuals. For instance, one’s apps usage behavior on Saturday, say a holiday, might not be similar with her usage on other work days.

Spatial context: It represents users’ spatial information like location, which can be treated as another significant context for modeling and predicting individual’s smartphone apps. The reason is that phone usage of an individual can also be treated as a location based service Sarker et al. (2017b). Thus, understanding user mobility and corresponding context-aware model is able to provide location based services for the benefit of individual users.

User mood: Typically, user mood represents an emotional state of a human, which is mostly important related to sentiment and emotional analysis. As the emotional state of human being is not static in the real world, may change over time, it could be another significant context that impacts on individuals and to model personalized apps usage behavior. For instance, one individual typically likes to listen songs when she is in a happy mood, while likes to online messaging when she is in a sad mood.

Device status: In addition to the above contexts related to users’ day-to-day situations and preferences, individual’s device related contextual information like phone profile, phone battery level or charging status might have an influence on individuals to use smartphone apps. For instance, if the phone battery of an individual’s device gives low power signal, she typically is not interested to connect with the Internet for using an entertainment app.

Internet connectivity: This also represents device related context that connects the device with the world. As such, Internet connectivity and speed might have an impact on individuals’ smartphone usage. For instance, one individual likes to play video songs if Wifi (wireless fidelity that mainly refers to certain kinds of wireless local area networks) is available, otherwise not.

Contexts Type Example values
Temporal
Continuous
Time [24-hours-a-day]
Day [7-days-a-week]

Work status
Categorical
(binary)
Holiday [yes, no]

Spatial
Categorical
User location [home, workplace,
canteen, playground, on the way, etc.]

User mood
Categorical
Emotional state [happy, sad, normal]


Device status
Categorical
Battery level[full, medium, low]

Phone profile
Categorical
Notification [general, silent, vibration]

Internet connectivity
Categorical
(binary)
WiFi status [on, off]

Table 1: An overview of contexts in our context-aware apps usage model

Table 1 gives a detailed picture of the contexts that are used in our context-aware model to predict personalized smartphone apps. We have collected smartphone apps usage datasets that include these contextual information from different individual users. All the participants in our study were university students and have their own smartphones. Data was collected from these participants from June 2018 to October 2018. The students have created an web interface for collecting these synthetic data from different users. The multi-dimensional contextual information discussed above and their interrelated patterns or relationships are of high interests to be discovered from the collected data, for the purpose of building a data-driven context-aware smartphone apps usage prediction model.

4.2 Preprocessing of Contextual Data

As we aim to build a machine learning based context-aware apps usage prediction model, we need exploratory data analysis to observe the characteristics of contexts. Thus, the first task for modeling is to make the apps usage dataset having multi-dimensional contexts able to feed our target machine learning classification technique. So for this reason, we first remove the missing data due to anomaly raised in contextual data collection. In order to build the context-aware model, we take into account all the relevant contexts that have an influence on individuals’ usage as features, which is required in building a machine learning based model. Once the features have been identified, it is necessary to determine the features data type in the dataset. According to Table 1

, all the contexts are categorical except temporal context. Thus, we take into account temporal segments with one hour interval in our analysis. In order to fit to the machine learning techniques, it is needed to convert all the categorical contextual features into vectors. The most popular approaches are “Label Encoding” and “One Hot Encoding” to do this task. In one hot encoding, the number of features increases with a significant number, and the resulting dataset will have lots of dimensions. On the other hand, in label encoding, the feature-values converted into a particular numeric number and the number of features remains the same. As we have taken into account multi-dimensional contexts, one hot encoded features might have sparse data which are difficult to fit in our target machine learning algorithm. Moreover, it takes a lot of processing time because of increased number of data dimensions. Thus, in this work, we use label encoding technique for converting the categorical contextual data into a feature vector to fit the contexts into the model. For instance, user mood can turn [happy, sad, happy, sad, normal] into [0, 1, 0, 1, 2] using label encoding.

4.3 Contextual Random Forest Generation

Once the preprocessing of contexts has been completed, we use machine learning techniques to build our context-aware apps usage prediction model. In order to build an effective context-aware model, we use random forest learning which is one of the most popular and powerful machine learning algorithms. A random forest learning Breiman (2001)

is an ensemble classifier that mainly combines randomly feature selection

Amit and Geman (1997) and bootstrap aggregation Breiman (1996), in order to construct a collection of decision trees exhibiting controlled variation.

To generate the random forest, we take into account all the relevant contextual features discussed above. The generated random forest consists a number of decision trees that can be used a separate classifier like a single decision tree based model. At each node in a tree, features are randomly selected from the D available features in the dataset, and the node is partitioned according to the Gini index Breiman et al. (1984). For a binary split, the Gini index of a node n can be expressed as -

(1)

where is the relative proportion of examples belonging to class present in node . The best possible binary split is the one which maximizes the improvement in the Gini index.

(2)

where and are the proportions of examples in node np that are assigned to child nodes and , respectively.

Finally, we combine the generated trees to form a single learner in order to produce final outcome. For a particular test case, it calculates the votes for each outcome predicted by each separate decision tree generated in the model and takes the highest voted predicted outcome, i.e., majority voting, as the final prediction result. For instance, we generate random decision trees utilizing the given training dataset. The terminal nodes of each decision tree represents apps usage behavior classes and the edges are associated with the corresponding contexts that are used for similarity matching in prediction. For a particular test case, each random decision tree generated in the model may predict different outcomes according to the contexts in the tree. In order to make the final prediction result for that contexts, we calculate and store all the predicted output for each tree and perform the majority voting among the trees.

Since different values of the number of trees in random forest learning may give different prediction results, we determine an optimal value of based on low error rate or higher prediction results to build our target model. As more number of trees increases the overall computational cost, we take into account the lowest value of that gives higher prediction results in terms of score for a given training dataset, in order to establish the optimal value of . Thus it needs to satisfy these two functions; and , where and represents the minimum and the maximum respectively. Rather than arbitrarily assuming the number of trees in building the forest, we determine the optimal value of by iteratively varying the number. Thus, a random forest learning based context-aware model by generating optimal number of trees, is then considered as an effective apps usage model in terms of computational cost and prediction accuracy.

5 Evaluation and Experimental Results

To evaluate the effectiveness of our context-aware apps usage prediction model, we have conducted a range of experiments on the real mobile phone datasets collected by us. We have described about these datasets including the relevant contexts above. In the following, we report the experimental results utilizing these datasets and illustrate our AppsPred model with the detailed of experimental results of two users selected randomly.

5.1 Evaluation Metric

In order to measure the effectiveness of our AppsPred model in terms of prediction accuracy, we compare the predicted response with the actual response, i.e., the ground truth, and compute the accuracy in terms of:

  • Precision: It measures the ratio between the number of apps usage behaviors that are correctly predicted and the total number of apps that are predicted. If TP and FP denote true positives and false positives then the formal definition of precision is Witten et al. (1999):

    (3)
  • Recall: It measures the ratio between the number of apps usage behaviors that are correctly predicted and the total number of apps that are relevant. If TP and FN denote true positives and false negatives then the formal definition of recall is Witten et al. (1999):

    (4)
  • score: It is a measure that combines both the precision and recall defined above. It represents the harmonic mean of precision and recall. The formal definition of

    score is Witten et al. (1999):

    (5)
  • ROC value: It stands for Receiver Operating Characteristic (ROC). It can be another evaluation metric that also features on true positive rate, and false positive rate, to evaluate the machine learning classifier output quality. It summarizes the trade-off between true positive rate and false positive rate for a particular predictive model

    Witten et al. (1999).

5.2 Experimental Results and Analysis

To evaluate our AppsPred model, we employ the most popular K-fold cross validation technique in machine learning Han et al. (2011), where we use , to measure the prediction accuracy. The 10-fold cross validation breaks the given dataset into 10 sets. It trains the prediction model on 9 sets and tests it using the remaining one set. This repeats 10 times and we take a mean accuracy rate. To show the effectiveness of each machine learning classification based model, we compare the accuracy, in terms of precision, recall, score and ROC value defined above. In the following, we report the overall experimental results in different dimensions.

5.2.1 Personalized Prediction Results of our AppsPred Model

In this experiment, we show the prediction results of our context-aware smartphone apps usage model for different individual users. For this, Table 2 and Table 3 show the prediction results in terms of Precision, Recall, score and ROC value, for each individual app as a usage class utilizing the dataset DS-01 and DS-02 respectively. As our AppsPred model is personalized, we show these results utilizing individual’s datasets. If we observe Table 2 and Table 3, we see that for each app usage class, the values of Precision, Recall,

score, and ROC are significantly high, near to the maximum value 1, which ensures good prediction capability. This results also estimates that the FP rate representing instances falsely classified as a given class is ignorable. Thus, the overall experimental results in Table

2 and Table 3 show that our machine learning based context-aware model AppsPred is capable to effectively predict each app usage class of individual users according to their usage patterns in different contexts.

Apps (Class) Precision Recall Score ROC value
Gmail 0.871 0.883 0.877 0.993
Whatsapp 0.851 0.841 0.846 0.985
Readnews 0.866 0.921 0.892 0.996
LinkedIn 0.838 0.862 0.851 0.981
Music 0.852 0.848 0.851 0.991
Youtube 0.878 0.878 0.878 0.992
Facebook 0.832 0.839 0.835 0.987
Skype 0.881 0.864 0.872 0.991
Table 2: The prediction results for various apps of a sample user using our context-aware model AppsPred utilizing dataset DS-01.
Apps (Class) Precision Recall Score ROC value
Facebook 0.863 0.865 0.864 0.988
Youtube 0.903 0.895 0.899 0.993
Browser 0.855 0.875 0.865 0.991
Gmail 0.887 0.913 0.902 0.994
Whatsapp 0.858 0.832 0.845 0.991
Movie 0.901 0.857 0.878 0.987
Games 0.862 0.881 0.871 0.991
Live sports 0.887 0.902 0.894 0.993
Skype 0.911 0.883 0.897 0.991
Instagram 0.902 0.909 0.905 0.994
Read News 0.901 0.864 0.882 0.989
LinkedIn 0.866 0.887 0.876 0.989
Music 0.905 0.905 0.905 0.994
Table 3: The prediction results for various apps of a sample user using our context-aware model AppsPred utilizing dataset DS-02.

5.2.2 Effect on the Number of Trees

In this experiment, we first show the effect on the number of trees on prediction accuracy utilizing individuals’ apps usage datasets. To show the effect of the generated trees on prediction accuracy, we illustrate the detailed outcomes by varying the tree number for individual’s dataset. For this, initially we consider one decision tree and the corresponding prediction results in terms of precision and recall defined above are measured. Figure 1 presents the impact of tree numbers on prediction accuracy (up to 200 decision trees) for different datasets DS-01 and DS-02 respectively. The x-axis of the figure represents the tree numbers and y-axis represents the corresponding prediction accuracy in terms of precision and recall, for the corresponding tree numbers for different datasets.

(a) Prediction results (dataset DS-01).
(b) Prediction results (dataset DS-02).
Figure 1: Prediction results in terms of precision and recall by varying the number of trees utilizing individual’s datasets.

If we observe Figure 1, we see that the prediction results are not static, it varies by varying the number of trees. As different number of trees give different prediction results, we determine the optimal number of trees based on score that combines the precision and recall. To do this, Figure 2, shows the effect on score by varying the number of trees for these datasets. The x-axis of the figure represents the tree numbers and y-axis represents the corresponding prediction accuracy in terms of score.

Figure 2: Effect on the number of generated trees on prediction results in terms of score for selecting the optimal number of trees.

If we observe Figure 2, we can see that initially the score is low, it increases up to a certain number of trees. The reason is that a single decision tree may cause over-fitting problem and gives lower score. On the other hand, multiple trees in random forest learning are generated from different subsets of data in our context-aware model, which control the problem of over-fitting and increases the score. As a result, it improves the prediction results that have been shown in Figure 2. If we observe more, we can see according to Figure 2 that different number of trees give different scores. Thus, we select an optimal number of trees based on higher score with lower computational cost in terms of tree generation. As more number of trees increases the computational cost and make the model complex, we take into account that value as optimal for which it gives significant score with lowest computational cost. Thus, from Figure 2, we find that only 15 decision trees can produce significant result for dataset DS-01. Similarly, for dataset DS-02, an optimal number of decision trees is 10. These optimal number of trees make our AppsPred model simple and effective.

5.2.3 Effect on the Execution Time

In this experiment, we show the effect on the execution time of the number of trees utilizing individuals’ apps usage datasets. To show the effect of generated trees on execution time, we illustrate the detailed outcomes by varying the tree number for individual’s dataset. To do this, initially we consider one decision tree and the corresponding execution time is measured.

Figure 3: Effect of the number of generated trees on execution time.

Figure 3 shows the execution time taken by the context-aware model for different number of generated trees, starting from a single tree up to 200 trees, utilizing the dataset DS-01 and DS-02 respectively. The x-axis of the figure represents the tree numbers and y-axis represents the corresponding execution time, for the corresponding tree numbers for different datasets. From Figure 3, we see that if the tree size increases, it also increases the execution time that makes the model more complex. On the other hand, for small number of trees it performs efficiently.

5.2.4 Effectiveness Comparison

In this experiment, we show the effectiveness of our AppsPred model in terms of precision, recall, score, and ROC value, comparing it with some other popular classification techniques in machine learning. The comparing base methods are as follows:

ZeroR: In the area of machine learning, this is the simplest approach for predictive analytics among the classification techniques Witten and Frank (2005). According to Sarker et al. (2019c), it can be used for deciding a standard execution as a benchmark for other classification techniques. For comparison purpose, we denote ZeroR leaning based model as BM1.

Naive Bayes (NB):

This in one of the most popular classification algorithms in the area of machine learning. A naive Bayes classifier

John and Langley (1995)

is a basic probabilistic based technique, which can foresee the class membership probabilities. For comparison purpose, we denote naive Bayes learning based model as BM2.

Support Vector Machines (SVM): This is another popular classification technique used widely for various predictive analytics. In SVM Keerthi et al. (2001)

a hyperplane is chosen in the vector machine, which is a line that can take part into the variable space. For comparison purpose, we denote support vector machines based model as BM3.

Logistic Regression (LR): This is another popular probabilistic based statistical model used to solve the classification problems. Typically, logistic regression classifier Le Cessie and Van Houwelingen (1992)

estimates the probabilities using a logistic function, which is also referred to as sigmoid function. For comparison purpose, we denote logistic regression learning based model as BM4.

Decision Tree (DT): This is a very well-known and mostly discussed technique for classification and then used for predictive analytics. DT Quinlan (1986) constructs a decision tree by calculating the entropy and information gain which is a statistical property that is used to select which attribute to test at each node in the tree Quinlan (1986). For comparison purpose, we denote decision tree learning based model as BM5.

Figure 4: Effectiveness comparison with different classification based context-aware models utilizing a collection of datasets.

To illustrate the effectiveness of our AppsPred model utilizing individuals’ datasets, we show the relative comparison of precision, recall, score and ROC value. Figure 4 shows the comparison results in terms of average precision, recall, score and ROC value by considering a collection of datasets. For each classifier based approaches, we use the same training and testing set of data for the purpose of fair evaluation and comparison. If we observe Figure 4, we find that our AppsPred model consistently outperforms other classification based methods for different datasets. In particular, the AppsPred model gives the highest prediction results in terms of precision, recall, score and ROC value. The reason for getting better result is that we take into account an optimal number of trees with subset of data and take the average result for the final outcome. Thus, it reduces the variance through averaging over learners, and the randomized stages decrease correlation between distinctive learners in the model. As a result, AppsPred is more effective than other classifier based approaches, when applying on mobile phone data consisting of variety of smartphone apps usage of individuals and corresponding multi-dimensional contexts.

6 Discussion

The experimental results in Section 5 have shown that our random forest learning based context-aware apps prediction model “AppsPred” is fully personalized and adaptive to individuals’ usage behavior. Compared to the other popular classifier based approaches, the prediction accuracy in terms of precision, recall, score and ROC value, has been improved when this model is used, as shown in Figure 4. Although it requires a number of iterations to determine the optimal number of decision trees to predict the future usage in a particular context, it is effective in terms of computational cost and prediction accuracy. The following are a few key discoveries from our study.

  • To predict individuals’ apps usage behavior in a particular context, random forest learning based apps prediction model having an optimal number of decision trees is more effective than a single decision tree based model. In our experiments, we have shown the corresponding results in terms of precision, recall, score and ROC value utilizing individuals’ datasets.

  • Another important finding of our study is that a large number of trees in random forest learning based context-aware model is not always effective in terms of prediction results and computational cost. According to Figure 1 and Figure 2, for a single tree the prediction accuracy in terms of precision, recall and score is low, it increases up to a certain number of trees. After that, although it increases the computational cost in generating more trees but no significant prediction results are found.

  • We have observed a significantly lower prediction accuracy when using other classification based approaches compared to our AppsPred context-aware model. The reason is that other models cannot capture the different categories of apps usage patterns properly in multi-dimensional contexts. Consequently, these approaches have low prediction accuracy while comparing with AppsPred model that uses random forest learning with optimal number of trees to capture the usage patters more properly.

    Overall, our context-aware model AppsPred is more effective according to its prediction results with less computational cost. Although it takes higher training time than a single decision tree based model, it shows the effectiveness in terms of prediction accuracy better than a decision tree based model. Typically, a machine learning algorithm searches a space of hypotheses to find the best hypothesis for a particular problem. For a small amount of training data, a learning algorithm could identify various hypotheses providing the similar accuracy on the testing data. An ensemble of a number of single classifiers can average their prediction results and thus avoid selecting the inaccurate classifier. Thus, this random forest learning based context-aware apps usage model AppsPred is very helpful for predicting individuals’ future usage in a particular context-aware test case.

7 Conclusion

In this paper, we have presented a data-driven context-aware smartphone apps usage model AppsPred that utilizes random forest machine learning technique by taking into account an optimal number of decision trees. In order to build this personalized model, we have collected apps usage datasets from individual users and takes into account the relevant contextual features that have an influence on individuals for using various categories of apps. No assumption or prior knowledge is needed in employing our model as we select the optimal number of trees dynamically according to the data patterns, which may vary from user to user. Experimental results on the collected contextual smartphone datasets indicate that our model outperforms popular classifier based models for predicting individuals’ smartphone apps. In the paper, we have demonstrated that our AppsPred model is in general highly effective in terms of accuracy and computational cost in predicting future usage with multi-dimensional contexts. We believe that this model will be helpful to application developers to build corresponding real-life applications, in order to provide context-aware personalized services for the end users according to their needs. To assess the effectiveness of AppsPred model in application level by conducting a user survey, could be a future work.

Acknowledgment

The authors would like to thank all the participants, who are involved in this study for collecting their smartphone apps usage datasets consisting of various categories of apps and corresponding contextual information.

References

References

  • [1] R. Agrawal, R. Srikant, et al. (1994) Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. 1215, pp. 487–499. Cited by: §1, §2.
  • [2] Y. Amit and D. Geman (1997) Shape quantization and recognition with randomized trees. Neural computation 9 (7), pp. 1545–1588. Cited by: §4.3.
  • [3] S. Bouker, R. Saidi, S. B. Yahia, and E. M. Nguifo (2012) Ranking and selecting association rules based on dominance relationship. In

    Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on

    ,
    Vol. 1, pp. 658–665. Cited by: §2.
  • [4] L. Breiman, J. Friedman, R. Olshen, and C. Stone (1984) Classification and regression trees, chapman & hall. New York, USA. Cited by: §4.3.
  • [5] L. Breiman (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §4.3.
  • [6] L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §4.3.
  • [7] A. K. Dey (2001) Understanding and using context. Personal and ubiquitous computing 5 (1), pp. 4–7. Cited by: §4.1.
  • [8] P. Fournier-Viger and V. S. Tseng (2012) Mining top-k non-redundant association rules. In International Symposium on Methodologies for Intelligent Systems, pp. 31–40. Cited by: §1, §2.
  • [9] J. Han, M. Kamber, and J. Pei (2011) Data mining: concepts and techniques. Elsevier. Cited by: §1, §2, §5.2.
  • [10] J. Hong, E. Suh, J. Kim, and S. Kim (2009) Context-aware system for proactive personalized service based on context history. Expert Systems with Applications 36 (4), pp. 7448–7457. Cited by: §1, §2.
  • [11] G. H. John and P. Langley (1995) Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pp. 338–345. Cited by: §5.2.4.
  • [12] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy (2001) Improvements to platt’s smo algorithm for svm classifier design. Neural computation 13 (3), pp. 637–649. Cited by: §5.2.4.
  • [13] S. Le Cessie and J. C. Van Houwelingen (1992) Ridge estimators in logistic regression. Journal of the Royal Statistical Society: Series C (Applied Statistics) 41 (1), pp. 191–201. Cited by: §5.2.4.
  • [14] W. Lee (2007) Deploying personalized mobile services in an agent-based environment. Expert Systems with Applications 32 (4), pp. 1194–1207. Cited by: §1, §2.
  • [15] A. Mehrotra, R. Hendley, and M. Musolesi (2016) PrefMiner: mining user’s preferences for intelligent mobile notification management. In UbiComp, Cited by: §2.
  • [16] J. R. Quinlan (1986) Induction of decision trees. Machine learning 1 (1), pp. 81–106. Cited by: §2, §5.2.4.
  • [17] J. R. Quinlan (1993) C4.5: programs for machine learning. Machine Learning. Cited by: §1, §2.
  • [18] I. H. Sarker, A. Colman, and J. Han (2019) RecencyMiner: mining recency-based personalized behavior from contextual smartphone data. Journal of Big Data 6 (1), pp. 1–21. Cited by: §2.
  • [19] I. H. Sarker, A. Colman, M. A. Kabir, and J. Han (2018) Individualized time-series segmentation for mining mobile phone user behavior. The Computer Journal, Oxford University, UK 61 (3), pp. 349–368. Cited by: §4.1.
  • [20] I. H. Sarker, M. A. Kabir, A. Colman, and J. Han (2017) An improved naive bayes classifier-based noise detection technique for classifying user phone call behavior. In Proceedings of the 2017 Australian Data Mining Conference (AusDM 2017), Melbourne, Australia, Cited by: §2.
  • [21] I. H. Sarker, M. A. Kabir, A. Colman, and J. Han (2017)

    Designing architecture of a rule-based system for managing phone call interruptions

    .
    In Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers, USA, pp. 898–903. Cited by: §4.1.
  • [22] I. H. Sarker, A. Kayes, M. H. Furhad, M. M. Islam, and M. S. Islam (2019) E-miim: an ensemble-learning-based context-aware mobile telephony model for intelligent interruption management. AI & SOCIETY, pp. 1–9. Cited by: §2.
  • [23] I. H. Sarker, A. Kayes, and P. Watters (2019) Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. Journal of Big Data 6 (1), pp. 1–28. Cited by: §1, §2, §5.2.4.
  • [24] I. H. Sarker and F. D. Salim (2018) Mining user behavioral rules from smartphone data through association analysis. In Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, Australia, pp. 450–461. Cited by: §1, §2.
  • [25] I. H. Sarker (2018) BehavMiner: mining user behaviors from mobile phone data for personalized services. In Proceedings of the 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom 2018), Athens, Greece, Cited by: §1.
  • [26] I. H. Sarker (2018) Mobile data science: towards understanding data-driven intelligent mobile applications. EAI Endorsed Transactions on Scalable Information Systems 5 (19). Cited by: §1.
  • [27] I. H. Sarker (2018) Research issues in mining user behavioral rules for context-aware intelligent mobile applications. Iran Journal of Computer Science, pp. 1–11. Cited by: §1.
  • [28] I. H. Sarker (2019) A machine learning based robust prediction model for real-life mobile phone data. Internet of Things 5, pp. 180–193. Cited by: §1, §2.
  • [29] V. Srinivasan, S. Moghaddam, A. Mukherji, K. K. Rachuri, C. Xu, and E. M. Tapia (2014) Mobileminer: mining your frequent patterns on your phone. In Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, pp. 389–400. Cited by: §2.
  • [30] I. T. Union (2015) Measuring the information society. In Technical report, http://www.itu.int/en/ITU-D/Statistics/ Documents/publications/misr2015/MISR2015-w5.pdf, Cited by: §1.
  • [31] I. H. Witten, E. Frank, L. E. Trigg, M. A. Hall, G. Holmes, and S. J. Cunningham (1999) Weka: practical machine learning tools and techniques with java implementations. Cited by: item 1, item 2, item 3, item 4.
  • [32] I. H. Witten and E. Frank (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann. Cited by: §5.2.4.
  • [33] H. Zhu, E. Chen, H. Xiong, K. Yu, H. Cao, and J. Tian (2014) Mining mobile user preferences for personalized context-aware recommendation. ACM Transactions on Intelligent Systems and Technology (TIST) 5 (4), pp. 58. Cited by: §2.
  • [34] S. Zulkernain, P. Madiraju, S. I. Ahamed, and K. Stamm (2010) A mobile intelligent interruption management system.. J. UCS 16 (15), pp. 2060–2080. Cited by: §1, §2.