1. Introduction
Railway points are a kind of mechanical installations allowing railway trains to be guided from one track to another. They are among the key components of railway infrastructure.
A railway junction is controlled jointly by one or more ends of points. They work together to control the routes of trains. In this paper, we use the term ”a set of railway points” to indicate the entire mechanism in a railway junction.
Apart from delay and cancellation of trains, failure of points can also cause severe economic loss and casualties. Railway points count for almost half of all train derailments in the UK (Ishak et al., 2016). On the morning of 12 December, 1988, Clapham Junction rail crash ^{1}^{1}1https://en.wikipedia.org/wiki/Clapham_Junction_rail_crash killed 35 people, and injured 484 people. More than 20% of incidents in Sydney Trains rail network were caused by points failures. Maintaining railway points safe, and forecasting the incoming failure are vital tasks for reliable rail transportation.
Routine maintenance is usually performed on railway points to ensure the correctness and reliability of them. Such work is done by field engineers to inspect and test the equipment at a fixed time interval. However, this strategy cannot catch the rapid change of equipment status. For example, when extreme weather occurs, points often degrade faster than usual. As a result, they are more likely to fail soon. Instead of relying on passive routine maintenance, we could benefit more from predictive maintenance  which flexibly arranges the maintenance work according to the running condition of equipment.
Forecasting the failures is a critical step in predictive maintenance. Some research has been conducted on this topic (Camci et al., 2016; García Márquez et al., 2010; Oyebande and Renfrew, 2002; Tao and Zhao, 2015; Yilboga et al., 2010). Delicate sensors usually serve as data collectors for voltages, currents and forces in related work. Installation of sensors incurs costly labour and material expenses, as well as the possibility of sensor malfunction. Adding sensors for inservice equipment would also induce disruption to traffic. This is especially unacceptable for a large and busy rail network. These make the prediction with sensors’ data expensive, or even infeasible. On the contrary, one can easily collect heterogeneous data from other sources such as weather, movement logs, and equipment details without an additional hardware upgrade.
Gathering available data from multiple sources enriches our knowledge on the working status of points. However, this also brings extra problems. Firstly, data collected from different sources are often in incompatible formats, and they play different roles in revealing the condition of equipment. Secondly, we are not guaranteed that data are always intact  even for a single source. Actually, in most case, we can only feed incomplete data into our model. Besides, our data were collected upon 350 sets of railway points. They are possibly located in a rural area, city centre, or from a different point of view, bridges, tunnels. They can also be of various types and made by different manufacturers. These add up to the difficulties in designing models. To summarise, we are faced with three main challenges here:

How to combine information from multiple sources efficiently and effectively?

How to deal with missing data?

How to consider the distinct and shared properties between different sets of railway points simultaneously?
To address these challenges, we proposed a novel multiple kernel learning algorithms. Our method was developed based on the multiple kernel learning framework (Gönen and Alpaydın, 2011). Multiple kernel learning has attracted much attention over the last decade. It has been regarded as a promising technique for combining multiple data channels or feature subsets (Xu et al., 2010), which exactly meets our requirements. We applied different kernel mapping functions on our data from different sources. Besides, we also concatenated all the data to form a kernel so that the intersource correlations could be found. An adaptive kernel weight determined by both properties of an individual set of railway points and the missing pattern of data makes our model robust, effective and unique. The contributions of this paper can be shown in the following aspects:

We provide a universal framework to predict points’ failure with multisource data. Our data are easy to obtain for most of the rail networks over the world without a hardware upgrade, and thus could be used in many other rail networks.

Our work firstly introduces missing pattern adaptive kernel weight into existing multiple kernel learning framework.

With a sample adaptive kernel weight, our model can capture the distinct and share properties of different railway points.

We developed an optimisation algorithm to optimise the proposed model. Through random feature approximation together with minibatch gradient descent, the proposed method can be applied on large datasets.

We conducted experiments on a realworld dataset collected from a wide range of railway points over three years. The results clearly show the effectiveness of our model.
The rest of this paper is organised as follows. Section 2 presents the related work. In Section 3, we describe our data and application. The proposed adaptive multiple kernel learning is detailed in Section 4. The experiment results are shown in Section 5. Last we conclude our work in Section 6.
2. Related Work
We give a brief introduction to failure prediction of railway points and the multiple kernel learning (MKL) algorithm.
2.1. Failure Prediction of Railway Points
Knowing that railway points directly affect the capacity and reliability of rail transport, some research has been conducted on failure prediction of railway points (Camci et al., 2016; García Márquez et al., 2010; Oyebande and Renfrew, 2002; Tao and Zhao, 2015; Yilboga et al., 2010). Sensor data such as voltages, currents and forces were widely used in these works. They were collected in laboratories or from site sensors. These data would require a high sampling rate and lead to difficulties in both transmission and storage. Despite the success shown in these methods, they are impractical in real application.
Few works explored the prediction task with data from another source. Weather plays a significant role in the probability of failure
(Hassankiadeh, 2011), and has been used to predict the total number of failed turnout systems in a railway network (Wang et al., 2017). Note that this work could not locate the exact fault railway points, it only estimates the total number of failures in a large system. Apart from weather data, equipment logs are also valuable information for foreseeing the failures of related equipment
(Sipos et al., 2014). Logs can be generated by sensors, software applications and even maintenance records (Li et al., 2018), reflecting the working condition of a piece of equipment in a different view. In (Li et al., 2018), maintenance logs are used to forecast the failure between two scheduled maintenance.Many of abovemention methods used support vector machines (SVM)
(Chang and Lin, 2011) for their models. They mainly focused on data from one source. A natural extension is to use multiple kernel learning to formulate our multisource problem, and level up the performance.2.2. Multiple Kernel Learning
Similar to deep neural networks, functions defined in reproducing kernel Hilbert space (RKHS) can model highly nonlinear relationship. MKL further takes the advantages of such functions by combining them wisely. Compared to deep neural networks, MKL enjoys better interpretability while requires less training data, which is more in line with our fundamental requirements.
MKL searches for an optimal combination of kernel functions to maximise a generalised performance measure. It has been widely used in various regression and classification tasks (Bucak et al., 2014; Althloothi et al., 2014; Yeh et al., 2011; Liu et al., 2014b; Yang et al., 2012).
For sample consists of feature subsets, by applying mapping functions to each subset, it takes the form of:
(1) 
where denote feature mappings associated with predefined base kernels . Given samples with the label for , commonly used MKL can be formulated as the following convex optimisation problem (Rakotomamonjy et al., 2008):
(2)  
where is the Euclidean norm for vectors. is the weight vectors for mapped features . contains the weights for combination of base kernels. For norm of kernel weights, . is the bias term and is a regularisation parameter for
which consists of slack variables. The decision score of the classifier on a sample
is given by:(3) 
Many variants of the MKL have been proposed to improve the accuracy of MKL algorithms. A natural extension is to change the norm constraint for kernel weights to norm as in (Kloft et al., 2009). Algorithms in (Kloft et al., 2011) further simplified the optimisation procedure by adopting a closedform solution for kernel weights. In (Liu et al., 2014a)
, a binary vector was introduced for every sample to switched on/off base kernels. The optimisation problem was an integer linear programming problem. The work in
(Gönen and Alpaydin, 2008) put forward a localised MKL algorithm. They utilised a gating model for selecting the appropriate kernel function locally. A convex variant was presented in (Lei et al., 2016) and corresponding generalisation error bounds were provided.Another branch of studies focuses on improving the efficiency and scalability of MKL. In (Sonnenburg et al., 2006), they worked on a special scenario that when feature maps were sparse and can be explicitly computed. Combined with chunking optimisation, they were able to deal with large volumes of data. The work in (Rakotomamonjy and Chanda, 2014) improved the scalability of MKL through Nystrom methods to approximate the kernel matrices and used proximal gradient algorithm in optimisation. Some research was also developed for the situation when the number of kernels to be combined was very large (Afkanpour et al., 2013). Besides, many online methods for MKL were proposed recently (Shen et al., 2018; Shen and Chen, 2018; Li et al., 2017; Sahoo et al., 2014). Random feature approximation (Rahimi and Recht, 2008) is popular among these methods.
Except for the work in (Liu et al., 2015), most of the research on multiple kernel classification is based on the prerequisite that all kernels are complete, whereas in our problem, this is not true. The method in (Liu et al., 2015) cannot be scaled up to fit our dataset, and they actually treated different missing patterns equally in the test. We thus argue that this is insufficient. These inspire us to design a new algorithm that can handle a large dataset, and explore the benefits by not only dealing with different missing patterns accordingly but also treating each group of sample adaptively.
3. Problem Description
In this section, we describe our data and application. Figure 1 shows the workflow of our method.
3.1. Data Description
We collected railway points’ equipment details, maintenance logs, movement logs and failure history from Sydney Trains database in a time range from 01/01/2014 to 30/06/2017. These data are collected from 350 sets of railway points spread in a large area. We also downloaded the weather data from Australia Bureau of Meteorology^{2}^{2}2www.bom.gov.au/climate/data/ of the same time span. Below we are going to introduce their formats and features.
3.1.1. Infrastructure Failure Management System Database
Infrastructure Failure Management System (IFMS) Database stores failures of assets in Sydney Trains with timestamps. We extracted points’ failures as part of our ground truth.
3.1.2. Equipment Details
Equipment details data record the detailed parameters of every set of railway points, including Points ID, Manufacturer, Type and so on. A piece of data is presented in Figure 1(a)
. We use ”” to denote missing values. With the help of domain experts, we selected a subset of features from these columns, and they were all categorical variables. We would simply perform onehot encoding with them.
3.1.3. Maintenance Logs
Maintenance logs contain formatted historical maintenance logs of railway points. A subset of categorical features was extracted from them following advice by the domain experts. A piece of data is presented in Figure 1(b).
3.1.4. Movement Logs
Movement logs were automatically generated by Sydney Trains control system in a realtime manner. This system recorded states’ changes of the railway points with timestamps in seconds. A piece of data is shown in Figure 1(c). We only list some of the event types here. Failures are reported in logs as well. Some of the failures occurred in movement logs didn’t appear in the IFMS database, for the reason that they recovered soon and didn’t result in any significant incident. They were still real failures, and we included these failures in our ground truth. Sometimes workers were testing the points for preventative maintenance and this also generated failure logs. In this case, we ignore these failures to keep the ground truth clean.
3.1.5. Weather
Weather data were retrieved from the Australia Bureau of Meteorology. Our data were gathered from railway points spread in a large area, so weather conditions for them may vary. Our strategy was to download data from the nearest weather station according to the longitudes and latitudes provided by equipment details. Sometimes weather station would be closed for a while, and we were not able to find another station to substitute them in some situations. Some points are lack of geocoordinates in Sydney Trains system. These cause the absence of weather data. Figure 1(d) shows a piece of weather data.
3.2. Problem Formulation
With data mentioned above in hand, we are going to make use of them to fulfil the prediction task. Essentially, this is a classification task. Since our data were generated from multiple sources, they came with different formats and sample frequencies. The two most important things are how we should aggregate our data from multiple sources and label them according to failure records.
Grouping and labelling data in a daily manner is an intuitive way. However, our data are highly imbalanced in label distribution. The number of days that failures occurred is about 4200, while our data include 454237 days summing over all railway points. This would produce a dataset contains only 0.9% positive samples if we give a label ”1” to failures. Such imbalanced dataset would deteriorate the performance of the classifier.
Sydney Trains’ train timetable shows cyclic patterns following calendar weeks (Gong et al., 2018), which will pose a periodic effect on our data as well. Therefore, we grouped our data according to calendar weeks. We gave label ”1” to a week if any failure was recorded in IFMS or movement log of this week. As a result, our task is to predict whether there will be failures occur in any time of next week, depending on weather conditions, movement logs in this week and maintenance logs in a period of 35 days before next week. For maintenance logs, we extend the time range to 35 days since they were often performed based on a monthly interval. We would also incorporate equipment details, and in general, they are independent of time. Figure 3 illustrates our data aggregation and labelling strategy. After some data cleaning, we finally generated 58833 samples including 3900 positive samples.
Notice that in some cases we would lose the movement logs, for example, the influence of maintenance work. In these cases, we would only refer to logs in the IFMS database as failure indicators upon agreement with the domain experts.
4. Methodology
4.1. Feature Extraction and Partition
Although we have grouped our data according to the abovementioned criterion, we need to flatten them further to form feature vectors. For equipment details and maintenance logs data, we selected some columns following the advice of domain experts. Then we performed onehot encoding on these data. We summed up features if there are more than one maintenance records. For movement logs data, we extracted some statistical features for every day like mean of movements, variance of movements, count of movements and so on. Because there are 7 days per week, we would have 7 subsets of features for movement logs. Similarly, for weather data, we have 7 subsets for one week. This strategy could be seen in Figure 1. Such partition lets us easily handle the missing pattern in a daily format as we will introduce in detail in the next section. Table 1 summarises missing percentages of our data after such feature partition.
There are 16 feature subsets in total. By applying different kernel functions to different subsets, we can formulate our task as a multiple kernel learning problem for binary classification. In order to learn the interaction among feature subsets, we also concatenated all feature subsets to form a long vector and applied a kernel function on it. Finally, we would get 17 kernels as our inputs. We term these feature subsets channels.
The missing rates for each channel are not very high, but another fact is that 44% of our data are either missing one channel or more. Therefore, it is imperative for us to build a model that is suitable for such data.
4.2. Select Kernel Functions
After applying onehot encoding, features generated from equipment details and maintenance logs data were often very sparse. We thus directly used linear kernel for these two data channels as recommended in literature (Li et al., 2015; Fan et al., 2008)
. For the remaining data channels consist of weather and movement logs of 7 days, we applied the commonly used radial basis function (RBF) kernels. In the rare case, some channels of a sample were only partially missing. If so, we filled the missing part with means.
Data  Missing Rate  Feature Dimension  

Equipment Details  0%  450  
Maintenance Logs  13%  365  
Movement Logs  Monday  5%  30 
Tuesday  6%  30  
Wednesday  5%  30  
Thursday  5%  30  
Friday  7%  30  
Saturday  8%  30  
Sunday  10%  30  
Weather  Monday  26%  4 
Tuesday  26%  4  
Wednesday  26%  4  
Thursday  25%  4  
Friday  25%  4  
Saturday  25%  4  
Sunday  25%  4 
4.3. Missing Pattern Adaptive Multiple Kernel Learning
To work with missing channels, a straightforward way is to learn separate kernel weights for each missing pattern. However, there can be missing patterns if we have channels, so it is possible that the data cannot cover every pattern. Besides, the data for one pattern can be less and contain only one type of label. Such a strategy also ignores the relationship between missing patterns. A likely choice would be to adjust the kernel weights according to missing patterns.
In order to allow adaptive kernel combination, we firstly modify the decision function for a sample with channels into following form:
(4) 
with denotes the inner product of vectors and
(5) 
where is a binary vector generated by onehot encoding on the missing pattern for sample . We introduce with latent dimension to represent embedding matrix for missing patterns. By Eq. (5), we express the kernel weights as a second order polynomial mapping from missing patterns with the coefficients given by related inner product of vectors in V. We give a simple example here to explain how we generate . Assume we have 3 data channels but for a sample the second one is missing, then:
(6) 
The first and third ”1” mean we have first and third feature subsets for this sample. The fifth ”1” serves as a complementary feature for missing channel 2. By doing so, the absence of a channel would make its kernel weight zero and influence the kernel weights of other presented channels.
The motivation behind this is that we want to collect information from the missing pattern of each sample. Eq. (5) also indicates that the kernel weight for a channel is decided by ”seeing” the existence of other channels’ data.
With similar notation to Eq. (2), the optimisation problem after introducing adaptive kernel weight can be expressed as:
(7)  
where and are two regularisation parameters. denotes the Frobenius norm. We add a regularisation term for to prevent it from being arbitrary scaled up due to the norm constraint on .
Theorem 4.1 ().
Adopting an adaptive kernel weight in Eq.(5) would guarantee a positive semidefinite kernel for MKL.
Proof.
For fixed , one can obtain the dual form of Eq. (7):
(8) 
where denotes elementwise product of vectors. is a vector of all ones and . is given by:
(9) 
where stands for the Hadamard product. with each column vector denotes the missing pattern for sample . is a length indication vector with only th element 1. is the kernel matrix related to mapping . Following Schur product theorem (Zhang, 2006), is surely positive semidefinite. ∎
Theorem 4.1 shows the correctness of our adaptive kernel weight in theory, but this problem is hard to solve in dual form because of the complicated form of in Eq. (9).
4.4. Sample Adaptive Multiple Kernel Learning
If we train a unified model for all sets of railway points, we will possibly ignore some peculiarities of them even though we have included equipment details as features. Training separate models for each set of railway points performed even worse as we observed in initial experiments. These motivated us to modify our model so that it could be adjusted to fit each set of railway points. We revised the kernel weight in Eq.(5) into the following format for a sample :
(10) 
where we add a new vector to represent unique features of the set of railway points that generated sample .
Related Eq. (10) with Eq. (4), we observe that the term could be omitted from Eq. (10) if we set the mapping to a zero vector for missing channels. Thus we omit for simplicity of notation. If we have sets of railway points, then we will introduce with the total number of sets of railway points. Each column vector in stands for features of a set of railway points. Let be the mapping which maps to index of railway points that generated . Eq. (10) can be written into matrix form for sample :
(11) 
With given in Eq. (11), corresponding optimisation problem becomes:
(12)  
where is a regularisation parameter and is a matrix of shape containing all ones. Notice that when A is a matrix of all ones, Eq. (10) reduce to Eq.(5). In other words, when is large enough, the two models would be equivalent. This regularisation term ensures an appropriate variance of models among different sets of railway points. One can also proof that such adaptive weights also retain a positive semidefinite kernel.
4.5. Optimisation
As mentioned before, Eq.(7) and Eq.(12) are hard to optimise in dual form. What’s more, we cannot fit such large data into memory if we precompute those 17 kernel matrices. Thanks to the random feature (RF) approximation (Rahimi and Recht, 2008), we can take an explicit form of mapped features hence avoiding calculation of the kernel matrices. This also facilitates the optimisation in the primal, which is much simpler. Given and a predefined parameter , the mapped features associated with a RBF kernel could be approximated by:
(13) 
where the entries of
are drown i.i.d. from a Gaussian distribution
with bandwidth of the RBF kernel. Many variants of RF approximation have been proposed in the literature. Here we implement the Fastfood (Le et al., 2013) for its simplicity and efficiency in memory usage.Our optimisation problem can be rewritten into following form with hinge loss :
(14)  
w.r.t. 
with defined in Eq.(11), we can calculate the subgradients regarding these variables and get:
(15) 
(16)  
(17)  
(18) 
where is the index set for support vectors. is the index set of samples generated by railway points .
With gradients calculated as Eq. (15)  Eq. (18
), we adopted Minibatch gradient descent in optimisation. We trained the models for 50 epochs with a constant learning rate
and batchsize 256. Using to denote the dimension of random features for th kernel mapping, the computational complexity for calculating the gradients is , which depends linearly on batchsize and can be computed efficiently. We summarise the training process in Algorithm 1.Dataset  #instances  #failures  #railway points  #incomplete instances 

Points_All  58833  3900  350  25942 
Points_Subset  905  183  5  98 
5. Experiments
Our data were collected from 350 sets of railway points from 01/01/2014 to 30/06/2017, together with corresponding weather data downloaded from Australia Bureau of Meteorology. There are 58833 samples including 3900 failures. We named this dataset PointsAll. We also built a subset consists of data from 5 most ”vulnerable” sets of railway points, i.e. those with most failure samples, and named it PointsSubset. These datasets are imbalanced in label distribution. We have tried to weight the classes in training but saw no performance gains, so we did not adopt such strategy. Table 2 summarises the statistics of our datasets.
5.1. Baselines, Evaluation Metrics and Parameter Setting
To show the effectiveness of our approach, we conducted experiments on the following methods.

MKLZF is the norm MKL method solved by the algorithm in (Kloft et al., 2011) with absent channels filled by zeros. We conducted experiments for ranges in .

MKLMF is similar to MKLZF but with absent channels filled by the averages.

Absent Multiple Kernel Learning (AMKL) (Liu et al., 2015) is a stateoftheart method for MKL with missing kernels. We only compared with AMKL on PointsSubset because it cannot be scaled up to fit our PointsAll dataset.

Single Source Classifiers (SSC) are the classifiers applied to single source data. For weather and movement logs data, there are still 7 data channels for each source. We use our method MAMKL as the classifier. For maintenance logs, equipment details and the data channel formed by concatenating all features, we filled the missing channels with means, and then used kernel SVM (Chang and Lin, 2011) for classification because these data sources only consist of one channel.

Missing Pattern Adaptive MKL (MAMKL) is the method proposed in this paper with kernel weights given by Eq. (5).

Sample Adaptive MKL (SAMKL) is the method proposed in this paper with kernel weights determined by Eq. (10).
For fair of comparison, for all methods, we used RF approximation for RBF kernels, and we fixed the random seed to make them determined. As such, norm MKL could also be applied to our PointsAll dataset without precomputed kernels.
We used Area Under Receiver Operating Characteristic Curve (AUROC) and Area Under Precision Recall Curve (AUPRC) as our performance metrics for all the methods. For all nonconvex methods, we repeated them 10 times to report the results with means and standard deviations.
Methods  AUROC  AUPRC  

MKLZF  0.737  0.436  
0.921  0.791  
0.902  0.784  
0.920  0.789  
0.921  0.790  
MKLMF  0.646  0.289  
0.923  0.800  
0.887  0.770  
0.887  0.767  
0.906  0.780  
MVLMKL  0.6550.002  0.2920.002  
0.8520.008  0.7830.005  
0.8980.010  0.7880.015  
0.8730.006  0.7880.005  
0.8730.006  0.7880.004  
SSC  Movement Logs  0.6630.001  0.3800.001 
Weather  0.8640.035  0.7810.036  
Maintenance Logs  0.667  0.301  
Equipment Details  0.516  0.217  
All Concatenated  0.669  0.376  
AMKL  0.736  0.463  
MAMKL  0.9420.005  0.8310.016  
SAMKL  0.9470.007  0.8400.011 
Methods  AUROC  AUPRC  

MKLZF  0.699  0.218  
0.691  0.199  
0.696  0.205  
0.690  0.196  
0.692  0.197  
MKLMF  0.698  0.223  
0.684  0.204  
0.687  0.204  
0.682  0.198  
0.668  0.176  
MVLMKL  0.6780.001  0.1680.002  
0.6710.001  0.1590.001  
0.6700.001  0.1590.001  
0.6720.002  0.1580.001  
0.6740.002  0.1590.003  
SSC  Movement Logs  0.5460.010  0.0930.001 
Weather  0.6770.003  0.1970.008  
Maintenance Logs  0.567  0.098  
Equipment Details  0.517  0.085  
All Concatenated  0.622  0.133  
MAMKL  0.7210.002  0.2610.009  
SAMKL  0.7340.002  0.2700.002 
For the PointsAll dataset, we split it into 60% training data, 20% validation data and 20% test data. The linear kernel was used for the data channels from equipment details and maintenance logs. We set same bandwidth for RBF kernels on 7 data channels from weather data. The bandwidth is chosen from according to the AUROC on validation data using SVM with sum of these 7 kernels as input. is the standard deviation of weather data. The same criterion was adopted to select the parameter of RBF kernels for 7 data channels from movement logs and 1 data channel from concatenated features. The dimensions of RFs for approximating RBF kernels were set to 1024, 2048 and 2048 for movement logs, weather and concatenated features respectively. All other parameters were chosen from some appropriately large ranges based on the AUROC of related methods on validation data. For PointsSubset, we randomly selected 80% data as training set and the remaining 20% as the test set. Parameters for them were decided by 5fold crossvalidation on the training set.
5.2. Results on PointsSubset Dataset
Table 3 shows the experiment results on PointsSubset dataset. norm MKL got inferior results when , for the reason that they tended to find a sparse combination of kernels. This means our data channels carry the complementary information, so only use some of them could not produce a good result. Experiment results on SSC verify our argument that only use data from one source is not enough. The prefilling method did not perform best, because filling the missing data in advance and used them in training will possibly introduce another source of error. Although AMKL appropriately takes into account the missing pattern in trainings, it keeps a fixed kernel weight in testing. Besides, it is designed for norm MKL, so it did not perform well in our experiments. It is clear that our method outperforms other baselines in terms of both AUROC and AUPRC. We attribute the improvement to the combination of multisource data and the sample adaptive kernel weights.
5.3. Results on PointsAll Dataset
Table 4 shows the experiment results on PointsAll dataset. By training on all data, we also included some sets of railway points with only a few failure cases. The proportion of incomplete samples is also higher than that in PointsSubset. These added up to our difficulties in predicting the failures. As in Table 4, results with is often better. This means traditional MKL cannot fully exploit the merits of multiple kernels. Our method still can beat other baselines on both AUROC and AUPRC, and see improvement compared to SSC. Notice that SAMKL is much better than MAMKL in this dataset, which verifies the effectiveness of sample adaptive kernel weight. This could guarantee a reliable warning for failures predicted by our model.
For each set of railway points, the number of samples is usually less than 180. Only several failures are observed for some points. We also trained many classifiers each for one set of railway points, but the results were unsatisfactory, so we did not list them here.
6. Conclusion
We have designed a novel approach for combining incomplete multisource data to predict the failure of railway points. It was developed based on the multiple kernel learning framework but went a step further by exploiting the missing patterns and samplespecific features. With the involvement of domain experts, we grouped our data weekly and split each week into a daily format to form 17 data channels and built 17 kernels. In this format, we can express the missing patterns of samples clearly. After that, we put forward a missing pattern adaptive MKL to leverage the information carried by missing patterns. We also considered the distinct properties of each set of railway points, and further improved the prediction results by our SAMKL algorithm. Experiments show that our model can output reliable warnings for railway points, and can predict the failures precisely for those frequently failed points.
In the future, we are going to apply more kernel functions on a single data channel, and reduce the resulting extra optimisation time by parallel computing through GPU.
Acknowledgements.
The authors greatly appreciate the financial support from the Rail Manufacturing Cooperative Research Centre (funded jointly by participating rail organisations and the Australian Federal Government’s Business Cooperative Research Centres Program) through Project R3.7.2  Big data analytics for conditionbased monitoring and maintenance.References
 (1)

Afkanpour et al. (2013)
Arash Afkanpour,
András György, Csaba
Szepesvári, and Michael Bowling.
2013.
A randomized mirror descent algorithm for large
scale multiple kernel learning. In
Proc. 30th International Conference on Machine Learning
. 374–382.  Althloothi et al. (2014) Salah Althloothi, Mohammad H Mahoor, Xiao Zhang, and Richard M Voyles. 2014. Human activity recognition using multifeatures and multiple kernel learning. Pattern Recognition 47, 5 (2014), 1800–1812.
 Bucak et al. (2014) Serhat S Bucak, Rong Jin, and Anil K Jain. 2014. Multiple kernel learning for visual object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2014), 1354–1369.
 Camci et al. (2016) Fatih Camci, Omer Faruk Eker, Saim Başkan, and Savas Konur. 2016. Comparison of sensors and methodologies for effective prognostics on railway turnout systems. Proc. Institution of Mechanical Engineers, Part F: Journal of Rail and Rapid Transit 230, 1 (2016), 24–42.
 Chang and Lin (2011) ChihChung Chang and ChihJen Lin. 2011. LIBSVM: a library for support vector machines. ACM TIST 2, 3 (2011), 27.
 Fan et al. (2008) RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. 2008. LIBLINEAR: A library for large linear classification. JMLR 9 (2008), 1871–1874.
 García Márquez et al. (2010) Fausto Pedro García Márquez, Clive Roberts, and Andrew M Tobias. 2010. Railway point mechanisms: condition monitoring and fault detection. Proc. Institution of Mechanical Engineers, Part F: Journal of Rail and Rapid Transit 224, 1 (2010), 35–44.
 Gönen and Alpaydin (2008) Mehmet Gönen and Ethem Alpaydin. 2008. Localized multiple kernel learning. In Proc. 25th International Conference on Machine Learning. ACM, 352–359.
 Gönen and Alpaydın (2011) Mehmet Gönen and Ethem Alpaydın. 2011. Multiple kernel learning algorithms. JMLR 12, Jul (2011), 2211–2268.
 Gong et al. (2018) Yongshun Gong, Zhibin Li, Jian Zhang, Wei Liu, Yu Zheng, and Christina Kirsch. 2018. Networkwide Crowd Flow Prediction of Sydney Trains via Customized Online Nonnegative Matrix Factorization. In Proc. 27th ACM International Conference on Information and Knowledge Management. ACM, 1243–1252.
 Hassankiadeh (2011) Seyedahmad Jalili Hassankiadeh. 2011. Failure analysis of railway switches and crossings for the purpose of preventive maintenance. Transport Science (2011).
 Ishak et al. (2016) Muhammad Fitri Ishak, Serdar Dindar, and Sakdirat Kaewunruen. 2016. Safetybased maintenance for geometry restoration of railway turnout systems in various operational environments. In Proc. 21st National Convention on Civil Engineering.
 Kloft et al. (2009) Marius Kloft, Ulf Brefeld, Pavel Laskov, KlausRobert Müller, Alexander Zien, and Sören Sonnenburg. 2009. Efficient and accurate lpnorm multiple kernel learning. In Advances in Neural Information Processing Systems. 997–1005.
 Kloft et al. (2011) Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. 2011. Lpnorm multiple kernel learning. JMLR 12, Mar (2011), 953–997.
 Le et al. (2013) Quoc Le, Tamás Sarlós, and Alex Smola. 2013. Fastfoodapproximating kernel expansions in loglinear time. In Proc. 30th International Conference on Machine Learning, Vol. 85.
 Lei et al. (2016) Yunwen Lei, Alexander Binder, Urun Dogan, and Marius Kloft. 2016. Localized multiple kernel learning a convex approach. In Proc. 8th Asian Conference on Machine Learning. 81–96.

Li
et al. (2017)
Xiang Li, Bin Gu,
Shuang Ao, Huaimin Wang, and
Charles X Ling. 2017.
Triply stochastic gradients on multiple kernel
learning. In
Proc. 33rd Conference on Uncertainty in Artificial Intelligence
.  Li et al. (2015) Xiang Li, Huaimin Wang, Bin Gu, and Charles X Ling. 2015. Data sparseness in linear SVM. In Proc. 24th International Joint Conference on Artificial Intelligence. 3628–3634.
 Li et al. (2018) Zhibin Li, Jian Zhang, Qiang Wu, and Christina Kirsch. 2018. Fieldregularised factorization machines for mining the maintenance logs of equipment. In Australasian Joint Conference on Artificial Intelligence. Springer, 172–183.
 Liu et al. (2014b) Fayao Liu, Luping Zhou, Chunhua Shen, and Jianping Yin. 2014b. Multiple kernel learning in the primal for multimodal Alzheimer’s disease classification. IEEE Journal of Biomedical and Health Informatics 18, 3 (2014), 984–990.
 Liu et al. (2015) Xinwang Liu, Lei Wang, Jianping Yin, Yong Dou, and Jian Zhang. 2015. Absent multiple kernel learning. In Proc. 29th AAAI Conference on Artificial Intelligence. 2807–2813.
 Liu et al. (2014a) Xinwang Liu, Lei Wang, Jian Zhang, and Jianping Yin. 2014a. SampleAdaptive Multiple Kernel Learning. In Proc. 28th AAAI Conference on Artificial Intelligence. 1975–1981.
 Oyebande and Renfrew (2002) BO Oyebande and AC Renfrew. 2002. Condition monitoring of railway electric point machines. IEE Proc. Electric Power Applications 149, 6 (2002), 465–473.
 Rahimi and Recht (2008) Ali Rahimi and Benjamin Recht. 2008. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems. 1177–1184.
 Rakotomamonjy et al. (2008) Alain Rakotomamonjy, Francis R Bach, Stéphane Canu, and Yves Grandvalet. 2008. SimpleMKL. JMLR 9, Nov (2008), 2491–2521.
 Rakotomamonjy and Chanda (2014) Alain Rakotomamonjy and Sukalpa Chanda. 2014. Lpnorm multiple kernel learning with lowrank kernels. Neurocomputing 143 (2014), 68–79.
 Sahoo et al. (2014) Doyen Sahoo, Steven CH Hoi, and Bin Li. 2014. Online multiple kernel regression. In Proc. 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 293–302.
 Shen and Chen (2018) Yanning Shen and Tianyi Chen. 2018. Online ensemble multikernel learning adaptive to nonstationary and adversarial environments. In Proc. 21st International Conference on Artificial Intelligence and Statistics, Vol. 84.
 Shen et al. (2018) Yanning Shen, Tianyi Chen, and Georgios B Giannakis. 2018. Online multikernel learning with orthogonal random features. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6289–6293.
 Sipos et al. (2014) Ruben Sipos, Dmitriy Fradkin, Fabian Moerchen, and Zhuang Wang. 2014. Logbased predictive maintenance. In Proc. 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1867–1876.
 Sonnenburg et al. (2006) Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. 2006. Large scale multiple kernel learning. JMLR 7, Jul (2006), 1531–1565.
 Tao and Zhao (2015) Hanqing Tao and Yang Zhao. 2015. Intelligent fault prediction of railway switch based on improved least squares support vector machine. Metallurgical and Mining Industry 7, 10 (2015), 69–75.

Wang
et al. (2017)
Guang Wang, Tianhua Xu,
Tao Tang, Tangming Yuan, and
Haifeng Wang. 2017.
A Bayesian network model for prediction of weatherrelated failures in railway turnout systems.
Expert Systems with Applications 69 (2017), 247–256.  Xu et al. (2015) Chang Xu, Dacheng Tao, and Chao Xu. 2015. Multiview learning with incomplete views. IEEE Transactions on Image Processing 24, 12 (2015), 5812–5825.
 Xu et al. (2010) Zenglin Xu, Rong Jin, Haiqin Yang, Irwin King, and Michael R Lyu. 2010. Simple and efficient multiple kernel learning by group lasso. In Proc. 27th International Conference on Machine Learning. Omnipress, 1175–1182.
 Yang et al. (2012) Jingjing Yang, Yonghong Tian, LingYu Duan, Tiejun Huang, and Wen Gao. 2012. Groupsensitive multiple kernel learning for object recognition. IEEE Transactions on Image Processing 21, 5 (2012), 2838–2852.
 Yeh et al. (2011) ChiYuan Yeh, ChiWei Huang, and ShieJue Lee. 2011. A multiplekernel support vector regression approach for stock market price forecasting. Expert Systems with Applications 38, 3 (2011), 2177–2186.
 Yilboga et al. (2010) Halis Yilboga, Ömer Faruk Eker, Adem Güçlü, and Fatih Camci. 2010. Failure prediction on railway turnouts using time delay neural networks. In 2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications. IEEE, 134–137.
 Zhang (2006) Fuzhen Zhang. 2006. The Schur complement and its applications. Vol. 4. Springer Science & Business Media.
Comments
There are no comments yet.