PANDA: Facilitating Usable AI Development

Recent advances in artificial intelligence (AI) and machine learning have created a general perception that AI could be used to solve complex problems, and in some situations over-hyped as a tool that can be so easily used. Unfortunately, the barrier to realization of mass adoption of AI on various business domains is too high because most domain experts have no background in AI. Developing AI applications involves multiple phases, namely data preparation, application modeling, and product deployment. The effort of AI research has been spent mostly on new AI models (in the model training stage) to improve the performance of benchmark tasks such as image recognition. Many other factors such as usability, efficiency and security of AI have not been well addressed, and therefore form a barrier to democratizing AI. Further, for many real world applications such as healthcare and autonomous driving, learning via huge amounts of possibility exploration is not feasible since humans are involved. In many complex applications such as healthcare, subject matter experts (e.g. Clinicians) are the ones who appreciate the importance of features that affect health, and their knowledge together with existing knowledge bases are critical to the end results. In this paper, we take a new perspective on developing AI solutions, and present a solution for making AI usable. We hope that this resolution will enable all subject matter experts (eg. Clinicians) to exploit AI like data scientists.



There are no comments yet.


page 2

page 4


Identifying Roles, Requirements and Responsibilities in Trustworthy AI Systems

Artificial Intelligence (AI) systems are being deployed around the globe...

Trinity: A No-Code AI platform for complex spatial datasets

We present a no-code Artificial Intelligence (AI) platform called Trinit...

Visualization Guidelines for Model Performance Communication Between Data Scientists and Subject Matter Experts

Presenting the complexities of a model's performance is a communication ...

How Mock Model Training Enhances User Perceptions of AI Systems

Artificial Intelligence (AI) is an integral part of our daily technology...

The Future AI in Healthcare: A Tsunami of False Alarms or a Product of Experts?

Recent significant increases in affordable and accessible computational ...

Risks of Using Non-verified Open Data: A case study on using Machine Learning techniques for predicting Pregnancy Outcomes in India

Artificial intelligence (AI) has evolved considerably in the last few ye...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in artificial intelligence (AI) and machine learning provide many opportunities in improving various applications, business practices and models. For example, AI-based solutions driven by Big Data have achieved human-level performance in computer vision and speech processing benchmarks. The availability of data has caused the rapid development of new models, whose success further fuels the interest for exploiting data in decision making. It is therefore not surprising that we see an increasing desire to exploit of AI in application areas such as finance and healthcare.

However, there is a significant barrier realizing the mass adoption of AI applications. Developing an AI application involves multiple phases, namely data preparation, application modeling, and product deployment. In fact, the effort of AI researchers was spent mostly on new AI models to improve the performance of benchmark tasks, e.g. the ImageNet competition

[39]. Many other factors such as usability, efficiency and security of AI have not been well addressed, and therefore form a barrier to democratizing AI.

We have been involved in developing the basic research, understanding and interpretation of requirements, to deployment and validation of several such applications. One example is the healthcare EMR (electronic medical record) application, where we worked with the clinicians, developed the model [55], validated the model, and integrated the application onto the production system after validation. Figure 1 shows the development pipeline of a healthcare AI systems such as disease progression modelling. Compared to devising a new algorithm on a standard benchmark problem, we face the following challenges: 1) There is no standard dataset for any application based on EMR. Data are biased, irregular and too noisy to be directly used as input for any ready-made model. Exploration for new features could be very helpful for the specified application. 2) The usage of model and parameter setting strongly depends on a detailed application. Finding a suitable solution requires both strong domain knowledge and machine learning background. 3) High stakes applications require strong reliability for the deployed product.

Figure 1: AI Development pipeline for Healthcare.

Based on our experience and observations, we examine the life-cycle of an AI application to locate the specific research topics related to the barrier from the perspective of AI and Big Data researchers, and application developers and data scientists.

Figure 2: Development life-cycle of an AI application.

Succinctly, the development process of an AI application consists of three main stages (Figure 2): 1) data preparation including acquisition, cleaning, labeling, integration and analytics; 2) specified model design for the given application; 3) product deployment which provides reliable service efficiently. Like the development of other applications, e.g. database applications, we need support from algorithms, models, tools and systems to ease the processing, reduce the cost, improve the performance and ensure the security of each stage of the development. In the remainder of this paper, we shall discuss each stage to analyze the challenges of realizing these goals while keeping in mind our goals: ease of use, effectiveness, efficiency, scalability and security.

2 Data Preparation

Industrial AI applications are often based on simple yet effective standard learning models. However, the development procedure is not trivial and often painful because of the quantity and quality of data. Unlike well-studied benchmark problems which come with a pre-defined training set, the training data of a real AI application is often not well pre-defined, cleaned or integrated. Data is so important for AI applications that it is referred to as the new oil. Currently, most datasets are created manually by domain experts or via crowdsourcing. The cleaning, integration, labeling and analyzing procedures are tedious and expensive for a large dataset.

2.1 Visualization and Interaction

Tools with good usability can improve both efficiency and effectiveness. However, an easy-to-use system requires a lot of engineering work on the interface and functionalities.

We highlight the role of data visualization. A well-designed data visualization tool greatly assists the domain experts in reviewing the data, exploring the existing large-scale data, performing collaborative annotations, and effectively offering their expertise. Some new types of data visualization tools have been developed for this purpose, e.g. the interactive visualization tool with the collaborative annotation and recommendation functionalities. In addition, some crowdsourcing platforms allow the embedded Hypertext Markup Language to visualize the data in the micro-tasks. Other related research work, including the approximate visualization, the auto-ranked visualization, collaborative exploration, resolution reduction, explore by example, result recommendations, are extensively studied to make the input of domain experts’ knowledge more friendly. Efficiency of the visualization tools is also worthy of research, especially for big datasets, as it greatly affects the user experience.

Intelligent questioning modules must be designed to ask the right questions to extract useful information from domain experts who may not be conversant in AI techniques. Such kind of modules will not only serve to bridge the knowledge gap between domain experts and AI practitioners, they will also serve as gate keepers to ensure the validity of the data. The statistical distribution and bias of the data will have significant impact on the success of AI systems as most of them are data-driven.

2.2 Cost-sensitive Acquisition

We factorize the efficiency of data preparation into two parts following the efficiency definition of a training system [14]: hardware efficiency for the speed of processing one single (or batch) sample; statistical efficiency related to the total number of samples to process. The effectiveness refers to the quality of the pre-processed data and the performance of the model trained using the data.

Figure 3: Collaborative analysis by integrating data from multiple hospitals.

To reduce the time spent for each sample, only those that require domain knowledge and can reduce the uncertainty of learned model should be presented to the domain experts. The interaction with domain experts may entail in all the processes in the pipeline where there exist uncertainties. The choice of step where interaction should be invoked and how much effort should be spent in it have to be optimized in a quantitative manner.

To reduce the total number of samples to process, we need to maximize the quantity of information instilled per sample. Then we can stop the processing early once we have enough knowledge about the data. The accuracy of the returned result is a critical issue. One way to increase the accuracy is to collect answers from different domain experts. However, in this case, some algorithms (e.g., the majority voting) have to be designed to reconcile the answers.

2.3 Data Privacy

With the rapidly growing complexity of AI tasks, AI systems require more extensive cooperation among data providers and end-users. Data from multiple sources are required to be integrated and managed as a data ecosystem. We need a storage system to support collaborative analytics where different organizations having similar applications could share the common data processing flow while maintaining the confidentiality of the data. For example, to train an accurate model for medical image analysis, e.g thoracic disease identification based on x-ray images [29], hospitals have to collaborate to construct a large labeled image dataset (Figure 3). In such a scenario, part of the data and its processing flow are required to be shared while the security for some other data and their relevant application model should be strictly protected. We have developed a rich semantic data management and storage system call ForkBase [46] based on the principle of immutability, sharing and security. Immutability ensures the traceability of data provenance. Sharing and security properties can facilitate the development for collaborative analytics.

3 Application Modelling

There is plenty of research on model architectures and training algorithms. However, implementing those ideas requires expertise knowledge about AI. Moreover, model selection and training configuration are typically done by experts with years of experience. All these together create a big barrier for AI application developers.

3.1 Model Selection

There are three different levels of AI developers, namely, AI researchers, AI beginners and domain experts. To enable all developers to train models efficiently and effectively, research on programming abstraction, resource management and user-system interaction is necessary.

Figure 4: Drag-and-drop interface for model construction.

For AI researchers, they are able to construct their own models using open-source libraries like Tensorflow 

[1]. However, it is still tedious for them to tune many hyper-parameters of the training algorithms, including learning rate, total number of training iterations, etc. In addition, they have to manage many intermediate models and results. In fact, the checkpoint files for model parameters generated during the training are large for big models such as VGG [41]. A tool with distributed hyper-parameter search, e.g. based on Bayesian optimization [42] or random search [4], is desired. A model management database with model compression would save a lot of space and time for developers.

For AI beginners, a simple, flexible, and extensible interface or programming abstraction is vital for them to get started. Many open source libraries with good programming abstractions have been developed, including Keras

111 and scikit-learn222

, which are widely used by students to learn data science and deep learning. A more convenient interface for beginners would be like drag-and-drop or plug-and-play on web pages as shown in Figure 

4, which sends the models back to the servers for training and tuning automatically.

For domain experts, they know the data well. However, they may have little knowledge about the AI models and training algorithms. Therefore, it would be better to just let them prepare the data and specify the task. To implement such a system, we need to provide built-in models and model selection algorithms. In fact, many AI applications share the similar models. For example, convolutional neural networks (CNN) 

[25] are the backbone models for image classification tasks, including vehicle classification, flower classification, food classification, etc. We can also implement other popular models (like LSTM [19], CapsuleNet[40]) as built-in models and share them for different applications. There are also multiple models for the same task. For example, InceptionNet [44], ResNet [16] and SqueezeNet [20] are all CNN models for image classification. However, they have different characteristics, where some models are more accurate but more resource hungry. Model selection is a research problem [28], which trades off between efficiency (i.e. speed and memory) and effectiveness (i.e. accuracy).

The features related to usability for the three types of AI developers are summarized in Figure 5. The features from the inner circles benefit developers in the same circle and in the outer circles.

Figure 5: Optimization scope for different AI developers.

3.2 Cost-sensitive Modelling

The recent resurgence of AI is mainly driven by deep learning, which expands traditional machine learning models with more complex structures to increase their capability of modeling data. From the statistics of a famous visual recognition challenge, ILSVRC, the number of layers of the annual winning model increases from 8 layers in 2012 to 152 layers in 2015. For example, the number of layers of deep convolutional neural networks (CNN) have reached one thousand [16]. DeepForest [57]

model stacks multiple random forests together. Models for text comprehension including question answering, typically combine many recurrent neural networks with attention modeling 

[49]. New models, like CapsuleNet [40], are also very complex in terms of the operations and number of parameters. At the same time, the training dataset size is also increasing sharply. On the one hand, big datasets are required by complex models to avoid over-fitting. On the other hand, big datasets need large models to capture the complex data regularities. Thereafter, datasets and models are affected by each other, and both grow in size and complexity. We do see better performance (i.e accuracy) as a consequence. However, we also notice the efficiency cost in terms of computation, memory and disk cost. Model compression [20] replaces some complex structures in the model architecture with simple ones. For instance, fully connected layers in CNNs are replaced with fully convolutional layers[33]. Squared convolution filters are factorized into 1-dimensional convolution filters [44]. Bottleneck convolution layers are also widely used [44]. Architecture optimization (or search) for efficiency (without deteriorating the performance) is now mainly done based on experience and trial-and-error.

To train such a cost sensitive model selected by the above process, reducing the high demand on training data processing cost is also an active research problem. There are three directions in general: reducing the processing cost per visit of training sample; reducing the number of training samples; and reducing the number of visits per training sample (i.e. number of iterations); Feature hashing and embedding methods  [13, 5] could be used to reduce the cost per training sample. Few-shot learning [11]

, meta-learning and transfer learning 

[38] are major solutions towards reducing the number of samples need to be trained. Adaptive and importance based sampling methods [12, 2] lead to faster convergence and less visits per training sample. In some extreme cases, samples which are not informative could be even removed from the training process without sacrificing model accuracy.

Genetic algorithm, reinforcement learning and model-based optimization [31] may help make some progress. Researchers have been constructing more and more complex models in recent years. However, simple models usually have the advantages of good interpretation. Hence, new models with simple structures and comparable performance, are also worth to investigate. Convergence acceleration that reduces the total number of training iterations is a difficult problem. Approaches for some special models have been proposed, e.g. increasing the batch-size[53]. It remains challenging for (asynchronous) distributed training due to gradient staleness [51] caused by communication delay.

Trading off between efficiency and effectiveness has never been easy. In practice, what to optimize depends on the expectation of the users or the requirement of the applications. Notwithstanding, it is also related to fairness or resource saving. Typically, the improvement at the final stage of training is usually very minor, e.g. from accuracy 99% to 99.2%. When the hardware resource is shared by multiple tenants as investigated by [28], the cluster administrator can stop such instances to release the GPU to train other users’ models. When the model is running on cloud platforms e.g. Amazon EC2, the running time is directly related to the fees. User expectation, application requirement, cost and the fairness are metrics to consider for the stopping criteria.

3.3 Auto-tuning Models based on Knowledge-bases

Building domain specific knowledge base has been widely accepted as the foundation of conducting domain specific analytics. However, there is no golden standard as to what kind of knowledge base should be constructed and how they should be utilized to improve the analytic model. Currently, a knowledge base is mostly used in simple tasks such as manual analytics and visualization. There is no doubt that a domain specific knowledge base should be a valuable resource for all kinds of applications. However, it is still not clear how it can directly benefit applications based on complex models such as deep learning. Intuitively, using domain specific knowledge base to improve a machine learning model is a paradox: knowledge base records how entities/features are related (usually in qualitative manner), while machine learning models tend to learn those relations from the training data (usually in quantitative manner). The main challenge of applying a knowledge base is how to balance the qualitative relations and quantitative relations. We believe that using the qualitative relations from knowledge base as a prior distribution (i.e. regularization term) could be a simple, general, feasible solution. Nowadays, typical regularization methods are mostly acting in quantitative manner. For the healthcare system mentioned in Section 1, we designed a regularization term based on healthcare domain knowledge. However, the domain knowledge used there is limited to ontology knowledge, and the regularization method designed is limited to a certain kind of classification task. Logically, using knowledge for regularization requires research from two areas. One is to build a knowledge base that can clearly describe qualitative relations among features/samples. Another is to design regularization methods that can work in a qualitative manner.

4 Model Deployment

Models are trained offline and then deployed on cloud platforms, dedicated servers or edge devices for online predictions. Most research focuses on model training. In fact, the deployment process is not any simpler than training. It involves much engineering work, e.g. fault tolerance and load balance. These are also interesting research topics.

4.1 Reliability and Interpretability

AI applications must go through a sequence of checks and validations before deployment. Once an application is deployed, we still need to monitor the performance, scale the throughput according to the demands, keep the load balance and recover nodes from failures. A one-step deployment service that combines and automates all these operations together is helpful. Besides automation, we highlight the importance of reliability and interpretability of models for the usability of model deployment.

Vertical domains like healthcare and finance have demanding requirements on the reliability of deployed applications. A simple solution is to monitor the performance and switch the working mode to human mode when AI is uncertain about some requests. Most machine learning models are soft margin based and have a self-evaluation for its accuracy (e.g. the Softmax outputs in logistics regression and most deep neural networks). However, this self-evaluation is only accurate when there is no concept drifting and the data characteristic exactly matches with the model assumptions. For example, for an application that is based on Naive Bayes model, when the input features appear are highly correlated, the real accuracy will drop significantly while its self-evaluation will have almost 100% confidence about its prediction. Therefore, the self-evaluation may not be reliable. Designing a robust model to monitor the system performance is thus necessary. For example, we may continuously check the data distribution to see if the characteristic matches with the model assumptions. We can also collect feedback from users to evaluate the performance of the deployed model.

In addition to reliability, interpretation is also important. For example, doctors often ask the question—“how is the prediction generated?". Explanations are essential for the democratization of AI on critical applications. Most complex machine learning models work like a black-box. Even for their designers, it is difficult to know the exact reason for every decision or prediction. Using black-box systems to do critical decisions or predictions could bring users a sense of distrust, violate regulation requirements and put the domain practitioners in a competitive relationship with AI solutions. All these factors are harmful for the success of AI on these valuable applications. Explanations could significantly reduce outside resistance and hence ease the usage of AI systems. In healthcare applications, researchers working on computational phenotyping [56, 18, 45] are trying to find out the explainable risk factors from the models for healthcare problems. There is a trend of research on model interpretation [3, 43, 23, 27].

We aim to design a set of general mechanisms to make AI solutions more understandable for model designers, domain experts, regulators and end-users. For model designers, the explanation could help them to refine the model architecture and training process. For domain experts, the explanation could bring more insights and hence enhance the cooperative relationship. For regulators, the explanations could help them solve legal issues and build accountability systems. For end-users, explanations increase the quality of service and promote trust.

Working towards this direction, we use a neural network as a research prototype, evaluate the importance and meaning for each neuron and analyze how they interact. Without loss of generality, this evaluation framework can be extended to any machine learning models whose data transformation process can be described as a graph (e.g. PGM and topic modeling). We conduct the evaluation via a novel concept called neuron saliency, which measures neuron efficiency in neural networks. By estimating neuron saliency, we are able to find out whether the basic unit of neural networks, namely the neuron, is contributing to the success of these models or other neurons. We first unify the neural networks in neuron representation and introduce dropout optimization for neural networks. Then two methods are proposed to estimate neuron saliency efficiently by dropout and gradient information respectively. Based on the neuron saliency, algorithms for optimizing the training of neural networks are developed, and in the meantime, a novel algorithm for model compression by dropping low saliency neurons is introduced.

4.2 Cost-sensitive Deployment

Efficiency or latency is more critical for the deployment stage than other stages as this stage is online. For example, because of the high requirement on latency (less than 1ms) during database querying, the learned index [24] has to replace the inference code from a Tensorflow implementation with hand-crafted but well optimized code, even for a simple neural network model. Optimization should be conducted from every strata [3, 50] including hardware, compiler, code, algorithm and models. GPU is excellent for training, but it costs extra time to transfer data from CPU to GPU. Hence, FPGA [15] has been applied as a replacement. To make it usable for non-FPGA programmers, we need a library with optimized operations (e.g. convolution for deep learning models) on FPGA, and a tool to convert the model trained on GPU to work on FPGA. Compilers like XLA333 and Weld[37] are designed to optimize AI and data analytic operations. Model compression that reduces the memory and computation cost for deploying models on small devices is a hot research topic [7]

. The Tensor Train line of work is an example of such kind of model reduction effort 

[36]. The challenge is how to compress the model without sacrificing the accuracy. A more challenging but preferred solution is to design a new and simpler model directly to replace current big models. For example, it is desirable to replace CNNs with a new model with good interpretability, less computation and memory cost.

Figure 6: Illustration of data distribution drift.

In terms of effectiveness, the most obvious issue is the change of data distribution as illustrated by Figure 6. When the data distribution is evolving, the model should adjust to keep its performance.

Continuous learning in nowadays deep learning context could be very challenging. If a single model is used for the prediction, the only choice here is to design a transfer learning or online learning model that can leverage the online data to refine itself. It is still not clear how a model training using stochastic gradient descent can be efficiently updated or trained with stream data with performance guarantee. An alternative solution that is commonly adopted in practice is ensemble modeling. Instead of using a single prediction model that may suffer from over-fitting and change of data distribution, ensemble modeling is a more robust solution since the final prediction are based on the output of several different models. However, simply averaging the results of multiple models can only result in a static robust model (i.e. less sensitive to over-fitting) but it still cannot adapt to the change of data distribution such as concept drift. In many cases the best model that should be trusted depends on the data distribution of the online incoming data. To get the best performance from multiple models, inference based on real-time feedback is strongly required. For both solutions, we have to optimize the cost in terms of power consumption and storage as the target devices may be mobile phones or IoTs.

4.3 Security

Nowadays, many applications are deployed on cloud platforms, e.g. Amazon EC2. Users submit their request to the cloud platforms for processing and then receive the prediction results. For such cases, we need to protect both the request (or query) data and the model to avoid leaking training data. To protect the request data, we have to encrypt it. Therefore, the models must accept encrypted data as input and generate encrypted predictions. Similar to the approaches for training over encrypted data, inference [52, 17] over encrypted data is mainly based on homomorphic encryption. Considering that the efficiency problem is more critical for inference than training, approaches with fast inference speed is necessary. To protect the model, we typically add noise to prevent users from inferring some properties of the training data. For example, users can infer the membership of a certain data sample based on the prediction accuracy and confidence. In particular, if the model is over-fitting on the training data and is very confident about a test data sample, it is likely that this sample is included in the training dataset. However, adding noise into the prediction results would affect the accuracy of the model from the user’s perspective. A research direction is to train a model with good generalization ability such that it performs equally well on both training and testing data. Then, we cannot infer the membership of the test data. In fact, it is a shared research goal from the perspective of security and machine learning training.

5 PANDA Solution

We have been developing systems towards resolving the issues discussed in this paper. We shall now discuss the PANDA architecture that we believe can address the issues highlighted in this paper.

5.1 Basic End-to-end Analytics Stack

Figure 7 shows the current stack of our systems. Healthcare is one of our primary applications. We are collaborating with multiple local hospitals, who give us the data and help to validate our results. CDAS [32] is a crowdsourcing system used by doctors to add their knowledge into the data, e.g. by labeling. DICE is a system for data integration that cleans raw EMR data based on expert defined rules. epiC [22] is our large scale batch processing engine. ForkBase444ForkBase is the second version of UStore [9], which has evolved substantially since the first implementation [46] is our data storage engine designed with rich semantics and three key properties, namely immutability, sharing and security. After labeling, cleaning and pre-processing, the data is fed into training or analytics engines. Apache SINGA [47] is a deep learning platform initiated by us, which focuses on memory and speed efficiency optimization. On top of SINGA, we have a platform called Rafiki [48], which provides training and deployment services for analytics tasks. CohAna [21] is a cohort analysis engine designed for tasks like customer churn analysis. iDat is a visualization tool for presenting the analytics results and users to explore the results. Other applications include finance data analytics and cyber-security.

Figure 7: An end-to-end analytics system stack.

5.2 Specific Challenges as Plug-ins

In the PANDA system, we aim to ease the development of AI applications when facing the aforementioned challenges. However, not all of them are shared by every application. Different applications could benefit from different workflows. Therefore, instead of building the solution for each challenge as a fixed step in the pipeline, we propose to build them as optional plug-ins.

Each plug-in consists of three basic components. We always build one simple default solution which is general and data insensitive. To resolve the specified challenges, a detector is developed to examine whether the inputs follow certain data characteristics. If the answer is true, a specific solution, which is typically data-driven, will be applied to replace the default solution.

We use the knowledge-based regularization module in the pipeline as an illustrative example. There will be a detector to examine whether the parameter set are correlated to a concept set. Once such connection is verified, the regularization term will be constructed based on the relations of concepts in the knowledge-base. Meanwhile, the default solution is just a simple L2-norm, which will be applied to most parameter sets that are without any meaningful relation to existing concepts.

5.3 Key Modules

5.3.1 Application Driven Data Exploration

We propose to build an automatic feature and sample exploration model. Given a budget and a pricing model for the dataset, the target of this model is to find an iterative and explorative data acquisition strategy to obtain the subset of data which has a price lower than the budget and leads to the best application performance. This problem can be viewed as a generalized process of active learning, which only optimizes the application performance by selecting data samples. However, if the data acquisition model can decide the set of features they should query for data samples, there are more opportunities to further reduce the data acquisition cost than simply applying active learning. This process is also different from feature engineering which aims to filter out noisy features, since in our exploration model the filter is not only cost sensitive, but also is designed to filter out as many unnecessary features as possible.

Data quality management is also necessary, but acts as a fundamental module to support the data exploration process. Data exploration can be viewed as a set of cost-sensitive schemes for data acquisition, where the data quality is managed in a proactive manner. Typical data quality management consists of evaluations of data quality (or system performance) and optimizations (e.g. data cleaning) for data quality based on the given dataset. However, its achievement is typically limited if there exist fundamental imperfections in the given dataset. Instead of passively recovering noisy features or labels and inferring them based on uninformative samples, data exploration looks for indicative features and learns only from representative and valuable samples.

5.3.2 Data Driven Model Selection

Data quality affects the model performance directly. However, although the data is well pre-processed, the model may still perform poorly. This is because, when applying a machine learning model to a dataset to solve an application, we are actually making a set of assumptions on the characteristics of the dataset. For example, using Naive Bayes model means we are assuming that the features of a data sample should be independent when given the labels, and using logistics regression means we are assuming that the labels are generated based on a linear combination of all its features. In terms of regularization, using L2-norm regularization means we are assuming a Gaussian prior over the model parameters learned through the datasets. If there is a mismatch between the real characteristics of the dataset and the model assumptions, the performance suffers. This is the critical reason why we design or select different machine learning models for different applications. Based on the above intuition, avoiding the mismatch between data characteristics and model assumption could be an efficient way to automate model designing and tuning.

For each model or processing step, we propose to build a matching evaluation to test if the input data follow the model assumption. If there is a mismatch, data transformations need to be applied. For example, if the model assumes the data should be linearly separable (e.g. SVM), but the data are not linearly separable (this evaluation can be easily obtained from the optimization result of SVM), two choices should be recommended instead of directly applying SVM: using feature engineering to pre-process the data to make it linearly separable; or applying kernel SVM which does not assume that the data are linearly separable. The examination of the matching between data characteristics and model assumption could significantly ease the model designing and tuning process or even automate it, since making the exact match is exactly one of the principles of model designing.

5.3.3 Reliable Answers for High Stake Applications

For most of the complex data analytics and decision making problems such as financial investment [8], medical treatment [30] and self-driving system [6], learning algorithms, while in progressive development, are widely believed to be able to surpass human performance in the near future. Such emerging “high stakes” applications of AI pose exacting demands on the reliability of deployed solutions. However, most of these applications rely on prevailing deep neural network models [26]. While these models may provide high prediction accuracy in the general case, they may be vulnerable to unexpected egregious errors [35, 34, 10], particularly when being applied to data points that are not well-represented in the training set. In some cases, the deep learning models are no better than random guesses on regions lacking of training points, and yet predict with high confidence. For high stakes applications, every decision matters and such irresponsible actions are definitely prohibited. Unfortunately, most deep learning models act like a black-box without much explanation, and are hard to understand even for domain experts [54]. It is not practical to prevent such failures by manually examining the logic inside a deep learning solution. Consequently, developing a deep learning solution with reliable behaviour has attracted a great deal of interest.

Instead of answering all the problems, we propose that a deep learning model should only be responsible to answer problems which it has been trained to answer, i.e. problems that lie in the reliable region. The reliable region is defined as a data distribution generalized from the training set where the deep learning model can achieve as good performance as when tested on the training set. Reliable deep learning solutions are useful for many high stakes applications. For example, a model for CT image classification with 90% accuracy cannot be applied onto any other clinical system, where the typical minimum requirement is 95%. As an alternative, by applying a model that provides 99% accuracy inside a reliable region which covers half of the patient images, the workload of radiologists could be effectively reduced by half. Developing an all-weather strategy for financial investment to significantly outperform the market is usually not practical. However, if we can build a model which significantly outperforms the market within a small reliable region, safe arbitrage can be done when the opportunity arises (i.e. market state falls within a reliable region).

6 Conclusions

There is no doubt that AI technologies will have great success in many vertical domains in the next few years. However, the mass production of AI poses many challenges for the current data analytics pipeline and other support system infrastructure, especially for critical decision making in a domain specific problem. In this paper, we review some challenges with respect to the issue of usability, efficiency and effectiveness and security in data preparation, training and product delivery phases of an AI application. Compared to the great success achieved in recent benchmark problem (e.g. CV and NLP) modeling, these challenges are not well addressed by current AI research but play a vital role in practical domain specific AI solution development. We summarize several research directions and discuss some preliminary methods. We are developing an AI platform called PANDA to resolve the aforementioned issues and support fast development of domain specific applications. We hope to make AI more usable, explainable, and scalable.