Significant progress has been made in natural language processing (NLP) and supervised machine learning (ML) algorithms over the past two decades. NLP successes include machine translation, speech/emotion/sentiment recognition, machine reading, and social media mining . Hence, NLP is beginning to become widely used in real-world applications that include either speech or text. Supervised ML algorithms excel at modeling the data-label relationship while maximizing performance and minimizing energy consumption and latency.
Supervised ML algorithms train on data (features) and label pairs to model the application of interest and predict labels. The label involves semantic information. Palatucci et al. 
use this information through vector representations of words to find the novel class within the dataset. Karpathy and Fei-Fei generate figure captions based on the collective use of image datasets and word embeddings. Such studies indicate that data features and semantic relationships correlate well. However, current supervised ML algorithms do not utilize such correlations in the decisionmaking (prediction) process. Their decisions are only based on the feature-label relationship, while neglecting significant information hidden in the labels, i.e., meaning-based (semantic) relationships among labels. Thus, they are not able to exploit synergies between the feature and semantic spaces.
In this article, we show the above synergies can be exploited to improve the prediction performance of ML algorithms. Our method, called SECRET, combines vector representations of labels in the semantic space with available data in the feature space within various operations (e.g., ML hyperparameter optimization and confidence score computation) to make the final decisions (assign labels to datapoints). Since SECRET does not target any particular ML algorithm or data structure, it is widely applicable.
The main contributions of this article are as follows:
We introduce a dual-space ML decision process called SECRET. It combines the new dimension (semantic space) with the traditional (single-space) classifiers that operate in the feature space. Thus, SECRET not only utilizes available data-label pairs, but also takes advantage of meaning-based (semantic) relationships among labels to perform classification for a given real-world task.
We demonstrate the general applicability of SECRET on various supervised ML algorithms and a wide range of datasets for various real-world tasks.
We demonstrate the advantages of SECRET’s new dimension (semantic space) through detailed comparisons with traditional ML approaches that have the same processing and information (except semantic) resources.
We compare the semantic space ML model with traditional approaches. We shed light on how SECRET builds the semantic space component and its impact on overall classification performance.
The remainder of the article is organized as follows. Section 2 provides background information on supervised ML algorithms, Bayesian optimization, and semantic vector representation of words. Section 3 provides the motivation behind SECRET’s dual-space ML decision process. Section 4 introduces the methodologies underpinning the SECRET architecture, data processing, hyperparameter tuning, ML algorithm training in the feature space and semantic space, confidence score calculation, and decision process. Section 5 presents experimental results and provides comparisons with traditional ML approaches. Section 6 presents related work from the literature and points out the novelty of SECRET. Finally, Section 7 discusses future research directions and concludes the article.
In this section, we discuss background material that will help with understanding of the rest of the article. We first discuss supervised ML classifiers. Then we introduce Bayesian optimization for hyperparameter tuning and semantic vector representation of words.
2.1 Supervised Learning Algorithms
Supervised learning algorithms model the relationship between given data and corresponding labels. They take the data (features) as input and map them to available labels while satisfying an objective (e.g., minimizing error, maximizing within-class similarity, etc.). Then the model offers generalizability by predicting labels for unseen data.
Supervised learning algorithms target two main problems: classification and regression. Classification problems have categorical outputs, such as ‘normal find’ and ‘metastases.’ Regression problems have continuous (numerical) outputs. In this study, due to their widespread success in modeling data-label relationships , 
An MLP is a fully-connected feedforward neural network. It has input, hidden, and output layers of neurons. Neurons are the smallest computational units that feed input data to a nonlinear activation function. Neurons in one layer are connected to neurons in the next layer through links that have weights. The backpropagation algorithm is used to update the weights in the training phase. RF is an ensemble method whose constituents are decision trees. The decision trees are built by splitting each node using the most informative feature chosen from a random subset of features. RF makes a decision based on majority vote (classification) or weighted average (regression) of the outputs of the constituent decision trees.
2.2 Bayesian Optimization for Hyperparameter Tuning
The selected set of hyperparameter values has a direct impact on classification/regression performance. Hand-tuning, random search, grid search, and Bayesian optimization are commonly used methods for finding the best set of hyperparameter values. In this work, we adopt Bayesian optimization as it is known, in general, to provide an unbiased analysis and higher classification/regression performance, while requiring a small number of iterations due to the utilization of results from past iterations .
Bayesian optimization integrates exploration and exploitation. It starts with a prior belief over the unknown objective function. It then evaluates the optimization goal function with available data (target hyperparameter values chosen for the iteration). Based on input data and the corresponding optimization goal outputs, it updates the beliefs and selects the next set of hyperparameter values to be evaluated. The process is repeated until a maximum number of iterations is reached .
2.3 Semantic Vector Models of Words
Semantic vector models assign a compact real-valued vector to each word in a dictionary. The vector captures the word’s semantic relationships with the remaining words in the dictionary. Words with close meanings are represented by closely-spaced vectors in the semantic space. Some of the algorithms that derive semantic vector word representations are Skip-gram and Continuous Bag-of-Words (CBOW) architectures of word2vec , GloVe , vLBL , ivLBL , Hellinger PCA 12].
GloVe is an unsupervised method. It uses the co-occurrence ratio of words within a pre-specified window length to obtain the word vectors. Use of this ratio enhances the distinction between two relevant words or a relevant word and an irrelevant one. The GloVe algorithm is based on weighted least squares regression. As shown in Eq. 1, it aims to minimize the difference between the scalar product of the two word vectors and the logarithm of their co-occurrence value. Weights are used to avoid dominance (overweighting) by both very frequent and rare co-occurrences. The corresponding weighting function is shown in Eq. 2. In , has been found to yield good results.
ML algorithms are widely used for making real-world decisions. Based on their objective, ML algorithms can be grouped into four categories: information-based, similarity-based, probability-based, and error-based. The choice of algorithm depends on the application of interest and dataset characteristics.
Speech and text data involve semantic relationships between data instances. These relationships collectively account for the semantic space. NLP targets the semantic space and corresponding classification tasks with word embeddings. Word embeddings (semantic vector representations) are used to model meaning-based relationships among words in a compact vector form. These relationships are captured through distances between real-valued word vectors. Words with close meanings are located nearby in the semantic space. Semantic vector representations are used in a wide range of NLP applications , such as document indexing , text classification , , question answering , and speech recognition .
Various features (characteristics) can be extracted from numerical, categorical, and graph-based data and correlated with corresponding circumstances (labels) of interest in supervised ML. The features constitute the feature space. As an example, in healthcare, the labels may be disease names, therapy methods, or health states, whereas in the chemical industry, the labels may be chemical names, model simulation states, or stability test results. Although labels differ from one application to another, they all lead to some action being taken based on the assigned label. The action can be reporting an anomaly, continuing the process, switching states, scaling parameters, etc. Since the assigned labels impact future actions, they need to be interpretable by either humans or machines. This means that the labels also carry semantic information. However, current supervised classifiers do not take advantage of this semantic information. Therefore, as shown in Fig. 1, the feature and semantic spaces stay far apart in the galaxy representing all information sources for the classification task. Consider a dataset that has ‘calm sleep,’ ‘REM sleep,’ and ‘stress situation’ as labels. As depicted in Fig. 2, current supervised ML algorithms will result in the same data-label model even if we replace the labels with ‘class 1,’ ‘class 2,’ and ‘class 3.’ However, ‘calm sleep’ and ‘REM sleep’ are semantically more similar but less similar to ‘stress situation.’ It would be advantageous to exploit this semantic relationship during classification.
SECRET addresses the above problem through a dual-space classification approach. As shown in Fig. (a)a, traditional supervised learning operates in the feature space. SECRET, on the other hand, also incorporates class affinity and dissimilarity information into the decision process, as shown by the ‘Semantic space’ block in Fig. (b)b. This property enables SECRET to make informed decisions on class labels, thus enhancing its overall classification performance. As an example, consider the UCI Contraceptive Method Choice Dataset . It has three classes: ’no_use,‘ ’long_term_methods,‘ and ’short_term_methods.’ As shown in Fig. 4, SECRET, built with MLP and RF, is able to detect the ’long_term_methods‘ class (Fig. (c)c) whereas the traditional supervised learning (feature space) approach, which uses MLP (Fig. (b)b), is unable to in six out of the ten folds. One of the reasons behind this improvement is the use of a different ML algorithm. However, as demonstrated later through experimental results, SECRET not only outperforms traditional classifiers, but also ensemble methods. This result indicates that the dataset is heterogeneous in terms of data characteristics corresponding to different classes. While ’no_use‘ is easily distinguishable (Fig. (b)b), ’long_term_methods‘ and ’short_term_methods’ have a large semantic affinity as indicated by the squared Euclidean distances in Fig. 5. Since the traditional feature space classifier does not adjust its decisionmaking process based on class affinity/dissimilarity, it assumes a comparable difference between data characteristics corresponding to different classes. As a result, it fails to notice the ’long_term_methods.‘ On the other hand, SECRET discovers the affinity between ’long_term_methods‘ and ’short_term_methods’ through the semantic space (Fig. 5) and focuses on distinguishing between these two classes. SECRET jointly optimizes the hyperparameters and makes the final decision (labeling) by integrating information from both the feature and semantic spaces. It is able to deliver higher classification performance relative to traditional approaches because of its reliance on a richer semantic+feature space. Although this example only exploits the semantic+feature space, there may be other as-yet-undiscovered spaces that could also be integrated into SECRET in a similar fashion. This is depicted by the third spacecraft in Fig. 1.
In this section, we describe SECRET’s data processing and dual-space classification procedure in detail.
4.1 The SECRET Architecture
SECRET integrates information from two sources: feature space and semantic space.
The feature space includes data, extracted features (if available), and the corresponding labels.
The semantic space includes meaning-based relationships among labels in the form of real-valued word vectors.
As shown in Fig. (a)a, traditional supervised learning operates in the feature space. It uses the features to model the data-label relationship. On the other hand, SECRET not only uses data available in the feature space, but also integrates meaning-based relationships among labels (semantic space) into the decision process, as shown in Fig. (b)b. SECRET takes training data, training labels, and their vector representations to develop a model that is used to predict the label for the test data. Thus, it requires vector representations of the training labels as an additional input, relative to the traditional supervised learning approach. Vector representations are obtained using semantic vector generation algorithms (see Section 2.3) that are trained with a large number of documents. Depending on the available computational resources, SECRET can be implemented with either pre-trained semantic vectors that are available on the web , , , or specially-trained semantic vectors obtained from a given corpus. Neither implementation needs the involvement of an expert, unlike the case of labeling data in supervised learning.
The novelty of SECRET is that it enables interaction between the two spaces while constructing the classifiers and regressors. The hyperparameter-tuning stage of SECRET includes the corresponding classifier and regressor from the feature space and semantic space, respectively. In Fig. (b)b, the interaction is depicted by the arrow in between the feature space and semantic space. Hence, the hyperparameter values of the semantic (feature) space are not aimed at maximizing the classification (regression) performance of the semantic space regressor (classifier), but that of the overall SECRET architecture. However, the interaction does not only take place during hyperparameter tuning. Unlike the traditional approaches, the classifier and regressor do not make individual decisions. Both provide confidence scores for each label. This information is used by SECRET to predict the label for a new query data instance. We explain each block in detail next.
4.2 Data Processing
Data processing is an important part of any ML decision process. Data in the raw form require:
feature encoding, and
SECRET targets these operations in the ‘Data Processing’ block in Fig. (b)b.
Denoising depends on the application of interest, signal properties (e.g., sampling frequency, range, etc.), and noise source (e.g., sensor artifacts, environmental conditions, and user faults). It needs to be implemented separately for each application and signal. For more details, we refer the readers to surveys in , , , .
Outlier elimination is aimed at removing or replacing the data that are out of signal range. Outliers can be detected either manually (through expert knowledge) or using statistical properties of the data. A survey of outlier detection methods can be found in.
The feature extraction stage extracts informative values from the data to enhance the decision process. It depends on the chosen ML algorithm, application of interest, and the available data. In general, ML algorithms benefit from extracted features in terms of classification performance. Although neural network based algorithms take raw data as input, they extract features in multiple layers that are hardcoded in the design stage 
. Though these feature extraction layers may need less processing relative to other ML algorithms (e.g., random forest, support vector machine, AdaBoost, k-nearest neighbors), they do not eliminate the need for feature extraction.
Many ML algorithms require numerical data for training and decisionmaking. However, datasets might also include categorical features. Feature encoding is targeted at such features and replaces them with numerical values. One-hot encoding is a widely-used method for categorical feature transformation. It adds a column (feature) for each categorical state and assigns ‘1’ to the column corresponding to the state of the feature of interest and ‘0’ to the rest.
The data normalization or standardization stage is targeted at features with different scales. It has a significant impact on classification/regression performance. Normalization or standardization is done before hyperparameter tuning. Normalization brings feature values to within a specific range, whereas standardization transforms them to have zero mean and unit variance. Whether to use normalization or standardization depends on the feature characteristics.
4.3 Hyperparameter Tuning
SECRET performs hyperparameter tuning through Bayesian optimization. By integrating exploration and exploitation, Bayesian optimization outputs the set of hyperparameter values that maximizes the optimization goal function. This function indicates the overall performance of the chosen supervised ML algorithm. Therefore, it guides Bayesian optimization to find the right set of hyperparameter values in order to enhance the performance of real-world decision processes.
The pseudocode for the hyperparameter tuning stage is shown in Algorithm 1. Following preprocessing of training and validation data, the Gaussian Process (GP) of Bayesian optimization is initialized. Bayesian optimization takes hyperparameters (as variables, not their values), their ranges, and optimization goal function as input. The hyperparameters depend on the chosen ML algorithm. For example, whereas the total number of trees may be a hyperparameter for the RF algorithm, the number of layers and neurons in each layer may be hyperparameters for the MLP algorithm. The optimization goal function reflects the purpose of the task being performed. Depending on whether SECRET is implemented on top of a traditional supervised (feature space) classifier or built from the ground up, the optimization goal function takes into account either available feature space and semantic space hyperparameter values or both semantic and feature space hyperparameter values. The function outputs performance metrics, such as accuracy, F1 score, etc., based on training and validation data. Throughout the paper, we show implementation of SECRET on top of a traditional supervised classifier to be able to compare SECRET with the classification algorithms in the literature. Therefore, in Algorithm 1, the feature space and semantic space algorithms are trained with already assigned hyperparameter values and acquisition function outputs (semantic space hyperparameter values), respectively. Then the feature space and semantic space confidence scores are calculated and labels are assigned (See Section 4.5 for details). At the end of the operation, the optimization goal function outputs the performance metric that needs to be maximized. This output is used to update the beliefs and obtain the next set of semantic space hyperparameter values. The above process is repeated with this new set. When the maximum number of iterations (BOiter) is reached, the process is stopped and the set of hyperparameter values (SSHyp) that lead to the highest validation set performance is selected for application to the test set.
4.4 Training of ML Models and Inference
The data-label relationships exist in different forms in the feature and semantic spaces and are captured through ML algorithms. The feature space does not take into account the meaning-based relationships among labels. However, the semantic space takes into account the affinity and dissimilarity information between labels that is captured in a vector form. Hence, whereas the feature space decision process maps data to the label with the help of a classifier, the semantic space decision process relies on a regressor. The choice of the regressor has a direct impact on SECRET’s performance. Thus, for a fixed feature space classifier, SECRET carries out performance analyses with various regressors on the training and validation data and selects the one that maps data to the labels the best. After finding the best set of hyperparameter values in both spaces through joint optimization, SECRET trains the ML algorithms. The feature space classifier is trained with the selected hyperparameter values, training data, and training labels. The semantic space regressor is trained with the selected hyperparameter values, training data, and vector representations of the labels. Following the training stage, inference is performed on the test data and confidence scores are obtained. The operations corresponding to this stage are shown on lines 3 through 6 in Algorithm 2.
4.5 Confidence Score Computation and Decision
The inference stage outputs the confidence scores for each data instance for both spaces. Feature space confidence score () computation depends on the chosen ML algorithm. For example, the confidence score of the RF classifier is the class probabilities. In other words, is the ratio of the number of decision trees assigning the class of interest and the total number of trees. However, the confidence score of the MLP classifer is the output of the activation function in the outermost (final) layer. On the other hand, the semantic space confidence score, , is based on distance, in line with the main motivation behind semantic vector representations. is computed through the inverse ratio of the distance between the assigned vector and the label vector that is normalized by the total distance from the assigned vector to all label vectors. The corresponding operations are shown on lines 7 through 9 in Algorithm 2. , , and represent the dimension of the semantic word vector, semantic vector of the class label, and total number of classes, respectively. refers to additive shift. In line of Algorithm 2, is used to avoid divergence of the algorithm when the assigned vector (regressor output) overlaps with the vector of the class label. The value of needs to be smaller than the minimum difference between vectors of the assigned label and the class label. However, in the hyperparameter tuning stage (line in Algorithm 1), is assigned and divergence is permitted to avoid overfitting. Since divergence blocks label assignment, overlap between the assigned label and class label degrades classification performance of the validation set significantly. Therefore, the hyperparameters corresponding to this case are not selected (line through line in Algorithm 1) and overfitting is avoided. In summary, while assigning to in the hyperparameter tuning stage is beneficial for preventing overfitting, a nonzero value in the decisionmaking process is needed to avoid divergence.
|Dataset||Abbreviation||# Instances||# Features||# Classes||Class Labels|
|UCI Connectionist Bench (Sonar, Mines vs. Rocks) Dataset||sonar||208||60||2||Rock, Metal cylinder|
|UCI Chess (King-Rook vs. King) Dataset||chess||28056||6||18||Draw, Zero, One, Two, Three, Four, Five, Six, Seven, Eight, Nine, Ten, Eleven, Twelve, Thirteen, Fourteen, Fifteen, Sixteen|
|UCI Cardiotocography Dataset||cardio||2126||21||10||Calm sleep, REM sleep, Calm vigilance, Active vigilance, Shift pattern, Stress situation, Vagal stimulation, Largely vagal stimulation, Pathological state, Suspect pattern|
|UCI ILPD (Indian Liver Patient Dataset)||liver||583||10||2||Liver patient, Not liver patient|
|UCI Nursery Dataset||nursery||12960||8||5||Not recommended, Recommended, Very recommended, Priority, Special Priority|
|UCI Breast Cancer Wisconsin (Diagnostic) Dataset||wdbc||569||30||2||Benign, Malignant|
|UCI Contraceptive Method Choice Dataset||cmc||1473||9||3||No use, Short-term methods, Long-term methods|
|UCI Letter Recognition Dataset||letter||20000||16||26||A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z|
|UCI Lymphography Dataset||lymph||148||18||4||Normal find, Metastases, Malign lymph, Fibrosis|
|UCI Statlog (Heart) Dataset||heart||270||13||2||Absence, Presence|
The overall confidence score is computed by taking the average of and for each class. In the decision stage, each data instance in the test set is assigned the label of the class that has the highest overall confidence score (line in Algorithm 2). Following the labeling stage, SECRET’s classification performance is assessed through accuracy () and F1 score metrics computed using Eq. 3 and Eq. 4, respectively. Accuracy depicts the ratio of the number of correctly classified instances and the total number of instances. However, the F1 score indicates the fraction of correctly classified instances for each class within the dataset.
5 Experimental Results and Discussion
In this section, we present the experimental results for SECRET and provide comparisons with traditional supervised classifiers and ensemble methods. Then, we analyze the effect of feature-semantic space variations on SECRET’s classification performance.
SECRET’s flexible design ensures applicability to a broad spectrum of real-world classification tasks. We analyze its performance on datasets for ten different applications, ranging from biomedical disease diagnosis to sonar-based object detection. Table I describes these datasets and their characteristics. The datasets are taken from the UCI Machine Learning Repository . They focus on the classification task.
The UCI Connectionist Bench Dataset is based on sonar signals that are reflected from a rock or metal cylinder. It discriminates between these two obstacles. The UCI Chess Dataset is built using the king-rook and rook positions on a chessboard. It is targeted at the depth of a win. The UCI Cardiotography Dataset is formed using cardiotography features that are based on fetal heartrate and uterine contraction. It focuses on classification of ten different fetal morphologic patterns. The UCI Indian Liver Patient Dataset is composed of patient records, such as age, gender, total bilirubin, total protein, albumin, etc. It is aimed at diagnosing liver disease. The UCI Nursery Dataset includes parental occupation, child’s nursery condition, family structure, and the family’s social, health, and financial status as features. It is targeted at ranking of nursery school applications. The UCI Breast Cancer Wisconsin Dataset is built using images of a fine needle aspiration of the breast. It focuses on classifying the cell nucleus as malignant or benign. The UCI Contraceptive Method Choice Dataset is composed of demographic and socio-economic information of married women. It is aimed at identifying their contraceptive method choices. The UCI Letter Recognition Dataset is formed using black-and-white image pixels of letters of the English alphabet. It is targeted at classifying each of the 26 letters. The UCI Lymphography Dataset includes features extracted from lymphography images. It is aimed at classifying different types of lymph nodes. The Statlog Dataset is composed of physiologic signal and demographic information of patients. It is aimed at predicting the absence or presence of heart disease.
5.2 Supervised Classifier vs. SECRET
We hypothesize that feature space information is not the only source of information that can be used for classification. In order to test this hypothesis, we compare the classification performance of the supervised classifier with that of SECRET. The supervised classifier uses feature space information to model the data-label relationship and predict labels of unlabeled data instances. On the other hand, SECRET fuses the feature and semantic space information to predict labels. By analyzing the two approaches, we aim to identify the impact of semantic space information on classification performance. In order to minimize dependency on an ML algorithm, we use two classifiers (RF and MLP) and their regressor versions. Decision trees in RF are information-based; however, MLP is error-based. Second, in order to avoid a biased evaluation of classification performance, we compute both accuracy and F1 scores by comparing the predicted labels with actual ones in the test set. Accuracy reflects the percentage of correctly classified samples within the test set. However, the F1 score incorporates precision and recall values, which are computed from the false positive and false negative values for each class, and then taking the average. Therefore, accuracy and F1 score assess the classification performance from different perspectives.
We implement SECRET with a Bayesian optimization framework  (used for determining the number of neurons in an MLP with a single hidden layer and the number of trees in RF), scikit-learn , and 50-dimensional GloVe vectors  (pretrained with Wikipedia 2014 + Gigaword 5). We calculate semantic space confidence score through equations shown in line of Algorithm 1 and line of Algorithm 2. Based on our analyses (see the conditions mentioned in Section 4.5), is assigned in the experiments. For classification performance analyses, we use stratified -fold sampling to each dataset and report the average accuracy and F1 score values.
Fig. 7 shows the legend for the plots that depict classification performance of the traditional feature space approach and SECRET. The starting points of the arrows indicate accuracy and F1 scores of the feature-space classifiers. The ending points indicate the impact of including semantic space information on classification. The numbers above/below the arrows show the percentage improvement. The arrow sizes are scaled accordingly. Dataset names are ordered based on the amount of change in classification performance using SECRET.
Fig. 8 shows the accuracy and F1 scores of the traditional MLP classifier and SECRET with the format shown in Fig. 7. In this case, SECRET integrates semantic information into the MLP classifier with the help of either an RF or MLP regressor. It chooses the type of regressor based on validation set performance, builds the overall classifier using both the semantic and feature spaces, and makes the final decision on the test labels. The color of arrows in Fig. 8 indicates the chosen regressor type. If both black and grey colored arrows are shown for a dataset, then the validation set classification performance is inconclusive in determining the better regressor. For the lymph dataset, the size of the training set is not sufficient to train the model and test on the validation set. We present results with both regressors. For the wdbc dataset, both regressors performed equally well on the validation set with only a 0.1% difference in accuracy and F1 score values. Therefore, we again present both. The chess, cmc, and cardio datasets show over 4.4% and 6.7% improvements in accuracy and F1 scores, respectively. While the liver dataset has only a 0.3% improvement in accuracy, it has the highest F1 score improvement of 13.5% among all datasets. This shows the importance of analyzing classification performance from different perspectives. Although accuracy is not able to capture the increase in precision and recall values (decrease in false positive and false negative rates), the F1 score shines light on this information. Moreover, except for the nursery and heart datasets, we observe that the arrows point to the right, indicating that SECRET improves accuracy as well as the F1 score. The amount of improvement depends on dataset characteristics, feature space classifier, and chosen semantic space regressor.
Fig. 9 shows the classification performance of the traditional RF classifier and SECRET implemented with an MLP or RF regressor. For the letter dataset, both regressors performed equally well on the validation set. Therefore, we present results with both regressors. In this experimental setup, SECRET shows over 4.4% accuracy and 4.5% F1 score improvements for the chess and sonar datasets. Furthermore, although the size of the lymph dataset leads to an inconclusive choice for the regressor type, we observe a 2.8% accuracy and 7.9% F1 score improvement with the MLP regressor. For the liver dataset, as in the case of Fig. 8, a 4.2% increase in the F1 score points to the positive impact of the semantic space information on decreasing the false positive and false negative rates, thus increasing the precision and recall values. Overall, the arrows in Fig. 8 and Fig. 9 point to the right, thus demonstrating classification performance enhancement with SECRET.
5.3 Ensemble Method vs. SECRET
We saw in the previous section that SECRET outperforms traditional supervised ML classifiers. However, the traditional classifier also can be made more robust by using an ensemble method. In this section, we compare ensemble methods with SECRET to show that the semantic space offers a different type of information source that pays rich dividends. In order to have a fair comparison between traditional ensemble methods and SECRET, we replace the red ’Semantic Space‘ block in Fig. (b)b with a ’Feature Space‘ block. The corresponding block diagram for the ensemble method is shown in Fig. 10. The ensemble method is composed of only feature space classifiers. In the experiments, we provide the same amount of processing, hyperparameter tuning, and decisionmaking resources to the two approaches. The only difference is that only the feature space information is used in the ensemble method, whereas both the feature and semantic space information is used in SECRET. We analyze the ensembles (formed with MLP and RF algorithms) and compare them with SECRET next.
Fig. 11 shows the accuracy and F1 scores of the traditional ensemble method and SECRET on the ten datasets. The ensemble is built using an MLP classifier whose performance is maximized with the best set of hyperparameter values. Then this classification performance is enhanced by combining the classifier with another MLP with hyperparameter values that maximize the overall performance of the ensemble. SECRET is built in the same way. However, the feature space classifier is replaced with a regressor that models the data and semantic vector relationship. For nine datasets, SECRET achieves a 0.4 to 12.6% higher accuracy and a 0.4 to 13.8% higher F1 score relative to the ensemble method. As in the experiments described in Section 5.2, while the liver dataset has a 1.0% increase in accuracy with SECRET, it obtains the highest F1 score improvement of 13.8%. For the nursery dataset, both approaches show comparable classification performance. Overall, as indicated by rightward-pointing arrows, SECRET can be seen to outperform the ensemble method.
Fig. 12 shows individual and relative classification performance of the MLP-RF ensemble and SECRET. In six of the datasets (liver, chess, sonar, cmc, heart, and cardio), SECRET improves the classification performance, whereas in the rest, SECRET either obtains the same or less than 0.7% lower performance relative to the ensemble method. While SECRET improves the F1 score by 1.6 to 9.6% for the six datasets, the ensemble method only outperforms SECRET by 0.3 to 0.7% in three datasets.
Fig. 13 presents accuracy and F1 scores of the RF-MLP ensemble and SECRET. If we had not implemented SECRET on top of a traditional supervised (feature space) classifier, but built it from ground up, the MLP-RF ensemble would yield the same results as RF-MLP. However, since we would like to compare SECRET with the traditional approach, the hyperparameter values are determined by also taking into account the assigned hyperparameter values of the feature space block. Since SECRET determines the semantic space hyperparameters using joint information from the two spaces, for a fair comparison, we provide the same opportunity to the ensemble method while determining the hyperparameter values of the second feature space block. Therefore, while RF hyperparameter values take advantage of the knowledge of MLP hyperparameter values in Fig. 12, MLP hyperparameter values take advantage of the knowledge of RF hyperparamenter values in Fig. 13. Due to its size, we were not able to determine the regressor type for the lymph dataset. Therefore, we present both results of SECRET with RF and MLP regressors. However, use of one regressor or the other leads to a significant classification performance improvement or degradation. Due to this instability, we do not use the lymph dataset to come to a conclusion. For the wdbc and liver datasets, the ensemble method has a higher accuracy and F1 score by 0.1 to 0.3%, whereas SECRET has a 0.4 to 7.6% accuracy and 1.2 to 7.5% F1 score improvement on the remaining six datasets. It is remarkable that the maximum amount of performance improvement with the ensemble method is still smaller than the minimum performance improvement with SECRET. For the heart dataset, both approaches have the same result.
Fig. 14 shows the experimental results for the RF-RF ensemble and SECRET. As in the case of Fig. 13, the lymph dataset leads to an inconclusive result due to its size (lymph is the smallest among the all studied datasets) when comparing the classification performance of the ensemble method and SECRET. However, except for the lymph and heart datasets, SECRET provides up to 12.0% accuracy and F1 score improvements on the remaining eight datasets.
From the above experiments, we can conclude that SECRET leads to either significantly higher or comparable classification performance with respect to the ensemble method.
|Approach||Average Variance of RF Node Depth||Overall Variance||Classification Performance|
|no use||long-term methods||short-term methods||of RF Node Depth||Accuracy (%)||F1 Score (%)|
|(built on top of MLP)|
|(built on top of RF)|
|(built on top of MLP)|
|(built on top of RF)|
5.4 RF Decision Node Depth
The data features and semantic relationships among labels correlate well , . SECRET takes advantage of this correlation with the help of its ‘semantic space’ component. On the other hand, traditional approaches only utilize data features to maximize class separability, confidence score, etc., of the classification task. While a dataset includes semantically similar and dissimilar class labels, therefore ‘easy-to-classify’ and ‘difficult-to-classify’ samples with respect to each other, traditional supervised learning approaches do not utilize this information while building the classifiers. They utilize data features, not the semantic relationships. However, SECRET benefits from both data features and semantic relationships among labels with joint use of the feature and semantic spaces. If the labels are semantically similar, the data features are also expected to be similar , . This leads to ‘difficult-to-classify’ data instances and requires more focused (deeper) distingushing between the classes. In Sections 5.2 and 5.3, we verified SECRET’s superiority over the traditional approaches through detailed classification performance analyses. In this section, we provide insight into how SECRET’s semantic space RF models differ from the traditional feature space ones. Since SECRET uses meaning-based relationships among labels, it is able to assess ‘easy-to-classify’ and ‘difficult-to-classify’ classes. We expect SECRET to adjust the RF decision node depths according to both semantic relationships among labels and data characteristics, and traditional approaches to adjust only according to data characteristics. Therefore, we hypothesize that the decision node depth for different classes varies more in SECRET compared to the traditional approaches as SECRET is able to divide the classes into ‘easy-to-classify’ and ‘difficult-to-classify’ groups and focus on ‘difficult-to-classify’ classes in deeper nodes with the help of its semantic space component.
We carry out RF decision node depth experiments on six datasets (cmc, chess, lymph, cardio, nursery, and letter) to validate our hypothesis. The remaining four datasets (sonar, liver, wdbc, and heart) have two classes. When one class is assigned, the other class also gets distinguished. Therefore, the standard deviation of the decision node depth for these four datasets tends to zero, which is not informative. For the six datasets that include three or more classes, we take each decision tree in the RF model and assess the decision nodes, their depth, and assigned classes. Within a tree, we calculate the average decision node depth for each class. We repeat this process for each tree to assess the overall average decision node depth for the RF model.
As an example, Table II shows RF decision node depth variance and classification performance on the cmc dataset for both the traditional approaches and SECRET. While the traditional approaches assign ‘no use,’ ‘long-term methods,’ and ‘short-term methods’ at closer node depths by taking data characteristics into account, SECRET uses both the data characteristics and semantic relationships among labels (Fig. 5). As the ‘no use’ class is located farther away (in Euclidean distance) from the ‘long-term methods,’ and ‘short-term methods’ classes, SECRET assigns ‘no use’ to shallower depths and focuses on details to distinguish ‘long-term methods,’ and ‘short-term methods’ at deeper nodes. As a result of SECRET’s directed attention to ‘easy-to-classify’ and ‘difficult-to-classify’ classes, it outperforms the traditional approaches, as shown in the right column of Table II. For the remaining datasets, we carried out the same analyses. Fig. 15 shows the overall standard deviation of RF decision node depth for ‘Traditional classifier,’ ‘Traditional ensemble,’ and ‘SECRET.’ In Fig. (a)a and Fig. (b)b, ‘Traditional classifier’ represents the variance of RF model’s decision node depth. In Fig. (a)a, ’Traditional Ensemble’ and ‘SECRET’ represent variance of decision node depth of RF models that are built on top of the MLP model, as shown in Fig. (b)b and Fig. 10, respectively. As opposed to Fig. (a)a, in Fig. (b)b, ’Traditional Ensemble’ and ‘SECRET’ are built on top of an RF model. In five of the datasets (except lymph), we observe a larger variance in the overall decision node depth of SECRET compared to the traditional approaches. In line with this observation, SECRET obtains up to 11.5% and 13.5% accuracy and F1 score improvements, respectively, over the traditional classifier and up to 7.3% improvement in both accuracy and F1 score over the traditional ensemble method depicted in Fig. (a)a. For the other case shown in Fig. (b)b, SECRET obtains up to and accuracy and F1 score improvements, respectively, over the traditional classifier and up to improvement in both accuracy and F1 score over the traditional ensemble method. For the letter dataset, we observe comparable performance (maximum decrease in accuracy/F1 score) with the traditional approaches. For the lymph dataset, while RF node depth variance is smaller for SECRET, we observe to improvement over the traditional approaches. This is inconclusive. As we also obtain inconclusive results throughout Section 5 due to its size, we do not discuss the lymph dataset.
Overall, a larger variance in RF node depth indicates that SECRET is distinguishing ‘easy-to-classify’ and ‘difficult-to-classify’ cases more clearly than the traditional approaches and focusing on detailed properties at deeper nodes to separate the ‘difficult-to-classify’ cases further. As a result, we observe an enhancement in classification performance with SECRET. This is commensurate with our hypothesis.
6 Related Work
Feature space approaches map data (image, text, audio, physiological signal, etc.) to discrete labels without considering the label relationships. The semantic space, however, maps data to vector representations of labels to capture the meaning information within the labels. We focus on related studies in both spaces next.
Enhancing classification performance of the ML algorithms has been a well-targeted area of research for decades. Various approaches have been proposed. These include data augmentation , data generation , boosting , ensemble learning , and dimensionality reduction . In addition to these promising techniques, various ML algorithms (information-based, similarity-based, probability-based, and error-based ) and architectures have been designed. Specifically, for big data, neural network models  have revolutionized the classification task due to their ability to model complex data-label relationships. Although these algorithms and techniques have made significant contributions to enhancing classification performance, they all operate in the feature space. SECRET close the gap between the feature and semantic spaces.
If we change our perspective and look at the related work in the semantic space, we observe that word representations have been widely used in NLP applications. Liu et al.  proposed a novel task-oriented word embedding method to assess the salient word for text classification task. All analyses are carried out in the semantic space. Kusner et al.  introduced a novel distance metric (Word Mover’s Distance) to effectively model the text documents with a set of word vectors. Vector representations act as features in a traditonal classification task and are mapped to a pre-defined set of labels with the k-Nearest Neighbor algorithm. This is a feature space approach since word vectors are used as features and mapped to a specific set of labels, without considering the meaning-based relationships among labels. Bordes et al.  targeted question answering by representing the question in a vector form in the semantic space and mapping it to the answer again in the semantic space. Bengio and Heigold  go far away from the semantic space by training vector representations of words without considering their meaning relationships, but targeting how similar the words sound. Vectors of sound-alike (not semantically similar) words have a smaller Euclidean distance between them. Palatucci et al.  and Socher et al.  carried out zero-shot learning by mapping real-world data to semantic vector representations of words. Karpathy and Fei-Fei  obtained figure captions using image datasets and word embeddings. The approaches presented in , , and  are limited to the semantic space. They only targeted correlations between data features and semantic relationships.
Overall, the above-mentioned approaches have had a significant influence on the development of NLP applications; however, they exploit either the feature space or the semantic space when performing classification. SECRET integrates these two spaces. Thus, SECRET is differentiated from previous work and looks at real-world classification tasks in a new way.
7 Conclusion and Future Work
In this article, we introduced a new dimension (semantic space) to the feature space based decisionmaking employed in ML algorithms and encapsulated it in a dual-space classification approach called SECRET. As opposed to traditional approaches, SECRET maps data to labels while integrating meaning-based relationships among labels. We analyzed SECRET’s classification performance on ten datasets representing different real-world applications. Compared to traditional supervised learning, SECRET achieved up to 13.9% accuracy and 13.5% F1 score improvements. Compared to ensemble methods, SECRET achieved up to 12.6% accuracy and 13.8% F1 score improvements. We also took a step toward understanding how SECRET builds the semantic space component and its impact on overall classification performance. We posit that, in future work, further improvements in SECRET’s overall classification performance and feature/semantic space characteristics can be made as follows. First, further analyses of different datasets are needed to support extensive applicability of SECRET. Second, although MLP and RF are well-known supervised ML algorithms, other ML algorithms need to be analyzed in this context. Third, semantic vectors could be trained specially for SECRET and the corresponding application of interest, as done in the case of intrinsic and extrinsic analyses in NLP , . Finally, in addition to the feature and semantic spaces, other information sources for classification should be explored.
-  J. Hirschberg and C. D. Manning, “Advances in natural language processing,” Science, vol. 349, no. 6245, pp. 261–266, 2015.
-  M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell, “Zero-shot learning with semantic output codes,” in Proc. Advances in Neural Inf. Process. Syst., 2009, pp. 1410–1418.
-  A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in
-  K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
-  L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
-  J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures,” J. Machine Learning Research, vol. 28, 2013.
-  B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proc. IEEE, vol. 104, no. 1, pp. 148–175, 2016.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
-  J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” in Proc. Conf. Empirical Methods in Natural Language Process., 2014, pp. 1532–1543.
A. Mnih and K. Kavukcuoglu, “Learning word embeddings efficiently with noise-contrastive estimation,” inProc. Advances in Neural Inf. Process. Syst., 2013, pp. 2265–2273.
-  R. Lebret and R. Collobert, “Word emdeddings through Hellinger PCA,” arXiv preprint arXiv:1312.5542, 2013.
-  T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Proc. Eleventh Annual Conf. Int. Speech Commun. Association, 2010.
-  J. D. Kelleher, B. Mac Namee, and A. D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics. MIT Press, 2015.
-  M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in Proc. Int. Conf. Machine Learning, 2015, pp. 957–966.
-  Q. Liu, H. Huang, Y. Gao, X. Wei, Y. Tian, and L. Liu, “Task-oriented word embedding for text classification,” in Proc. Int. Conf. Computational Linguistics, 2018, pp. 2023–2032.
-  F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
-  A. Bordes, J. Weston, and N. Usunier, “Open question answering with weakly supervised embedding models,” in Proc. Joint European Conf. Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp. 165–180.
-  S. Bengio and G. Heigold, “Word embeddings for speech recognition,” in Proc. Fifteenth Annual Conf. Int. Speech Commun. Association, 2014, pp. 1053–1057.
-  D. Dua and E. Karra Taniskidou, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
-  “GloVe pre-trained word vectors,” https://nlp.stanford.edu/projects/glove/, accessed: 02-10-2019.
-  “word2vec pre-trained word vectors,” https://code.google.com/archive/p/word2vec/, accessed: 02-10-2019.
-  “word2vec pre-trained word vectors for biomedical applications,” http://bio.nlplab.org, accessed: 02-10-2019.
-  M. C. Motwani, M. C. Gadiya, R. C. Motwani, and F. C. Harris, “Survey of image denoising techniques,” in Proc. Int. Pervasive Signal Process. Conf. Exhibition, vol. 2004, 2004, pp. 27–30.
-  S. L. Joshi, R. A. Vatti, and R. V. Tornekar, “A survey on ECG signal denoising techniques,” in Proc. IEEE Int. Conf. Commun. Syst. Network Technologies, 2013, pp. 60–64.
-  A. Kandaswamy, V. Krishnaveni, S. Jayaraman, N. Malmurugan, and K. Ramadoss, “Removal of ocular artifacts from EEG - A survey,” Institution of Electron. Telecommunication Eng. J. Research, vol. 51, no. 2, pp. 121–130, 2005.
-  J. Mohan, V. Krishnaveni, and Y. Guo, “A survey on the magnetic resonance image denoising methods,” Biomedical Signal Process. Control, vol. 9, pp. 56–69, 2014.
-  V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, no. 2, pp. 85–126, 2004.
“Feature engineering for deep learning,”https://www.ibm.com/developerworks/community/blogs/jfp/entry/Feature_Engineering_For_Deep_Learning?lang=en, accessed: 02-11-2019.
-  “Bayesian optimization framework,” https://github.com/fmfn/BayesianOptimization, accessed: 03-06-2019.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” J. Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
-  D. A. Van Dyk and X.-L. Meng, “The art of data augmentation,” J. Computational and Graphical Statistics, vol. 10, no. 1, pp. 1–50, 2001.
-  H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” in Proc. IEEE Int. Joint Conf. on Neural Networks, 2008, pp. 1322–1328.
-  Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,” Journal-Japanese Society For Artificial Intelligence, vol. 14, no. 771-780, p. 1612, 1999.
-  T. G. Dietterich et al., “Ensemble learning,” The Handbook of Brain Theory and Neural Networks, vol. 2, pp. 110–125, 2002.
-  L. Van Der Maaten, E. Postma, and J. Van den Herik, “Dimensionality reduction: A comparative,” J. Machine Learning Research, vol. 10, no. 66-71, p. 13, 2009.
-  M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K. Asari, “A state-of-the-art survey on deep learning theory and architectures,” Electronics, vol. 8, no. 3, p. 292, 2019.
-  R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-shot learning through cross-modal transfer,” in Proc. Advances in Neural Information Processing Syst., 2013, pp. 935–943.
-  P. Resnik and J. Lin, “Evaluation of NLP systems,” The Handbook of Computational Linguistics and Natural Language Processing, vol. 57, pp. 271–295, 2010.
-  M. Zhai, J. Tan, and J. D. Choi, “Intrinsic and extrinsic evaluations of word embeddings.” in Proc. Association for the Advancement of Artificial Intell., 2016, pp. 4282–4283.