## I Introduction

Wireless networks are rapidly growing in size, are becoming increasingly distributed, and are granted access to increasingly wider frequency spectrum. In the next generation of wireless cellular networks, namely 5G, tens of small cells, hundreds of mobile users demanding ultra-high data rates, and thousands of Internet-of-Things (IoT) and machine-type communication (MTC) devices will be all operating within the coverage of one single cell [3gpp-5g-2]. Furthermore, 5G systems will be deployed across the extensive available frequency spectrum including the bands below 6 GHz, the radio frequency (RF) band, as well as bands around 30 GHz, the millimeter wave (mm-wave) band [3gpp-5g-3]. However, one of the grand challenges in deploying 5G systems is to ensure the coexistence with current 4G systems in the foreseeable future. Tight interworking between 4G and 5G and dynamic spectrum sharing between these two systems are key to the smooth migration towards 5G [lin2019debunking]. This necessitates decentralized and real-time data-driven techniques for user management, including cell association, in cellular networks that can be properly scaled with the massive number of users in future systems.

In general, in heterogeneous networks, including coexisting 4G/5G networks in sub-6 GHz bands, classical methods for cell association based only on received signal power are not interference aware and hence, may lead to huge traffic load imbalance [hossain2014evolution]. To this end, various solutions have been proposed in the litrature in order to address interference awareness [sangiamwong2011investigation], traffic load awareness [guvenc2011capacity], and resource awareness [oh2012cell]. However, such network-controlled centralized algorithms for user management, which would need to be coordinated between 4G and 5G networks as well, introduce extra delay to the system and may not guarantee quality of service (QoS) constraints especially for the cell-edge users. Instead of taking such a network-controlled approach, we model the cell association problem as a boundary detection problem and propose a scalable data-driven solution in the physical layer for this problem by utilizing data that can be collected by a large number of low-cost spectrum sensors deployed in the field. Deploying such spectrum sensors has been considered in various scenarios such as cognitive radio networks [cabric2004implementation, ghasemi2008spectrum, sepidband2015cmos].

The problem of coverage detection has been modeled and studied as a boundary detection problem in various contexts including sensor networks [wang2006boundary] and cognitive radio networks [yang2010cooperative, ding2013kernel]. However, in order to make this model applicable to cellular networks, and in particular to coexisting 4G/5G networks, in this paper, we extend the model in several important respects. We consider two base stations (BS) which can have different powers and coverage. Also, the declaration by each of the sensors in the field can have three different possibilities, i.e., whether there is a coverage in which case the stronger BS is declared or there is no sufficiently strong coverage from either of BSs. The analysis can be extended to cases with more than two BSs and, consequently, more than three possibilities for the declared labels in a straightforward way.

Of particular relevance to our paper is the work of [ding2013kernel]

, where the boundary detection problem has been tackled using kernel-based methods. While kernel methods are popular modeling tools in machine learning and statistics

[friedman2001elements], they suffer from scalability issues, i.e., they scale poorly with respect to the number of sensors. We will use randomized features [rahimi2007random, rahimi2009weighted] to reduce the computational cost of kernel methods. In particular, we adapt the data-driven random feature method of [wang2019general] for the considered boundary detection problem. We show in numerical experiments that the data-driven solution consistently outperforms the data-independent methods (in prediction accuracy) and enjoys a much lower training cost compared to kernel methods.## Ii Problem Statement

In this section, we describe the boundary detection problem for the case of two base stations (BSs). Note that we refer to the cellular transmitter, which is often called eNB in 4G networks and gNB in 5G networks, as the BS regardless of whether it belongs to 4G or 5G network. The described idea here is generalizable to more than two BSs and the simplification is only for the presentation clarity. Let us now consider Fig. 1 as an example of the boundary detection problem with two BSs (namely, BS1 and BS2). Each BS has a ground-truth coverage boundary between its corresponding covered area and the rest of the plane. As illustrated in Fig. 1, the ground-truth boundaries may be highly nonlinear (irregular). Irregular radio coverage model (for the case of one BS) was proposed by [ding2013kernel] to generalize the setup of [yang2010cooperative]. This was motivated by the fact that signal attenuation due to obstructions (e.g., buildings) can affect the shape of the boundaries. In the considered cases with more than one BS, the interference from the neighboring cells, especially in areas in between also referred to as cell edge, depends on the power and the traffic in these cells. Consequently, this becomes another factor contributing to the non-linearity of the coverage boundaries.

Consider now spectrum sensors, randomly distributed in a 2D area. Sensor is located at . Each sensor is equipped with an energy detector, identifying the coverage from transmitters. In this example, this identification at a particular sensor can lead to three possibilities: (i) the sensor senses strong coverage from BS1 (blue/plus points); (ii) the sensor senses strong coverage from BS2 (green/circle points); (iii) the sensor does not declare a sufficiently strong coverage from either BSs (red/square points). As illustrated in Fig. 1, some of the declarations are prone to errors due to the hardware constraints of the spectrum sensors and radio channel randomness. Indeed, no knowledge is assumed about the accuracy of these identifications.

###### Remark 1

Note that our approach is independent of how one defines the term coverage. For instance, it can be defined in terms of a threshold on the signal-to-interference-plus-noise ratio (SINR) averaged over a certain amount of time in a certain band or averaged over a collection of different bands. However, this is not the focus of our paper. In other words, we model the cell association problem as a nonlinear boundary detection problem and take a machine-learning based approach to solve this problem. This approach is robust with respect to erroneous declaration by individual sensor nodes and the irregular shape of cell boundaries regardless of the specific criteria for these declarations.

Let us now denote by the declaration of sensor , assumed to have three possibilities . In practice, the boundaries are unknown and the objective is to find boundary candidates resulting in minimum detection errors. In machine learning (ML), this problem is equivalent to solving a -class classification such that the following mis-classification error

is minimized, where is the indicator function and

is the predicted coverage by the classifier based on the training data provided by sensors, i.e., the set

. The above objective function is non-smooth, and the problem is re-formulated as minimizing a risk functional , defined as [friedman2001elements](1) |

where

is a specific loss function (e.g.,

for Support Vector Machine (SVM) when

is binary), and the expectation is taken with respect to the data distribution . As is unknown, we can only minimize the empirical risk , instead of the true risk , and calculate the gap between the two using standard arguments from measures of function space complexity (e.g., Vapnik–Chervonenkis (VC) dimension [friedman2001elements], Rademacher complexity [bartlett2002rademacher], etc). To minimize the risk functional, we need to assume a function class for . Although current ML literature has focused on deep learning methods for such modeling, they often involve a huge number of parameters. This is an unnecessary complication for the boundary detection problem which is already in a low-dimensional manifold (in this case sensors are located in plane). Another popular approach in ML and Statistics is kernel method [hofmann2008kernel], where(2) |

and is a symmetric positive-definite function called a kernel. The coefficients are unknown and will be learned via empirical risk minimization. Kernel-based boundary detection was proposed by [ding2013kernel] for the case of one BS. However, off-the-shelf kernel methods are not suitable to solve the problem since they scale prohibitively with respect to the number of sensors . In particular, minimizing the empirical risk (i.e., optimizing over ) with kernel methods requires in space and in time [friedman2001elements]. Also, the choice of kernel to use for modeling is a key decision, which naturally depends on data. In this paper, we are interested in the following problem:

###### Problem 1

Propose a data-driven solution for kernel selection that improves the boundary detection, i.e., one that dominates data-independent kernel methods for the task of boundary detection.

## Iii Data-Driven Random Features

Despite the popularity of kernel methods for approximation, their poor scalability with respect to the size of data has limited their application in large-scale learning. To improve the computational efficiency, we use randomized approximation [rahimi2007random], focusing on kernels of the form

(3) |

where is an activation function (also called basis), and are independent samples (called random features) from a given distribution (Monte-Carlo sampling).

A wide variety of kernels can be approximated via (3) (see e.g., [yang2014random]). Table I presents a number of common kernel functions and their corresponding sampling distributions . Observe that these kernels are generally defined for , but in this paper as the sensors are located in a 2D area. (respectively, ) denotes the -th element of the vector (respectively,

). Unbiased kernel estimators are formed with random features sampled from these distributions and evaluated on a cosine feature map, except for the linear kernel where

.[.5] Kernel | ||
---|---|---|

[.5] Gaussian | ||

[.5] Linear | ||

[.5] Laplacian | ||

[.5] Cauchy |

Since the main focus of this paper is on the Gaussian kernel, note that we can use the following approximation

where

come from a multi-variate Gaussian distribution

andcome from a uniform distribution

. In this case, the function class will take the form(4) |

where the unknown parameters will be learned by minimizing the empirical risk (1). The above approximation is also called a shallow network [rahimi2009weighted], provably far more efficient to train compared to (2) [rudi2016generalization]. More specifically, the training would now require in time, which is linear with respect to sensors and significantly smaller than (kernel methods) when .

Input: sensor data , an integer , an integer

, variance

.(5) |

(6) |

(7) |

(8) |

Output: The transformed sensor data matrix .

### Iii-a Risk Minimization with 3 Classes

The logistic loss has the form in (1), designed for binary classification. We follow the one-versus-all principle for the multi-class classification problem at hand. Since we have three classes , we decompose the problem into three binary classification problems (though two binary classifiers would be enough). Each binary classifier detects one class against the other two. Returning to Fig. 1 as an example, if a binary classifier declares that a point is not red, and another binary classifier declares that the point is not green, we can easily assign that point to the blue category.

Data-driven random features: We will build on model (4) for each binary classifier. A major impediment in exploiting the model is the fact that it implicitly transforms the inputs (location of sensors) to a space with a higher dimension, which results in detection algorithms that are not computationally viable. In other words, the dimension of the new space () is much larger than the original space (). The underlying mathematical intuition is that Monte Carlo sampling of (random features) is purely random, and as a result one only approximates well when is large, translating into risk minimization over a high-dimensional space. We will use data-driven randomization to improve the modeling quality.

In particular, we employ a general data-driven score for sampling random features introduced in [wang2019general], where the initial distribution is re-weighted according to the score function

for each , where the matrix is a task specific positive semi-definite matrix. Here, we will focus on the case where , which will recover the energy-based exploration of random features (EERF) [shahrampour2018data]. The method is outlined in Algorithm 1. By running this algorithm, we transform (hypothetically) each sensor location from to using the matrix

, which is the algorithm output. The transformed data will then be fed into a logistic regression model as discussed in Section

III-A.The modification is *data-driven* in the sense that it explicitly takes into account sensors data in the sampling stage as well as in the detection stage. This is in contrast to classical methods (e.g., [rahimi2009weighted]) that are data-independent in the sampling phase.

## Iv Numerical Experiments

We create an artificial dataset according to Fig. 1. We randomly distributed sensors in a 2D grid for training and considered two ground-truth boundaries as presented in the figure. As we can see, the training data (provided by sensors) include a number of false declarations according to the ground-truth coverage boundaries; therefore, the prediction (using new data) cannot be %100 accurate, i.e., we can only approximate the true boundaries up to some error.

Benchmark Algorithms: Using a multi-class logistic regression with transformed feature vector , we compared two scenarios for sampling : data-independent vs. data-dependent. The data-independent methods include random kitchen sinks (RKS) [rahimi2009weighted] and orthogonal random features (ORF) [felix2016orthogonal], which has been proved to be superior that plain random features. The data-dependent case is simply DDRF in Algorithm 1.

1) RKS [rahimi2009weighted] with sampled from Gaussian distribution and sampled from uniform distribution approximates a Gaussian kernel with kernel width . We use this set of random features to transform to . The feature map we use for RKS is .

2) ORF [felix2016orthogonal] with sampled from Gaussian distribution

and modified through QR decomposition approximates a Gaussian kernel with kernel width

. We use this set of orthogonal random features to transform to . The dimension difference of orthogonal random features is resulted from the two dimensional feature map we use, which is .Practical considerations: The variance of random features is set to be the inverse of mean-distance of -th nearest neighbour (in Euclidean distance) following [felix2016orthogonal]. According to this rule, in this simulation . The pool size for data-dependent sampling is when random features are used in the classification.

Performance: The detection accuracy in Fig. 2

and the standard error in Table

II are averaged over 30 simulations. As we can observe in Fig. 2, the data-dependent method dominates the plain random features in terms of the detection accuracy on the test sensor data ( sensor data). Although ORF can perform on par with DDRF in the saturated regime, DDRF still shows a significant boost in accuracy in the sparse regime. For example, when , the accuracy of DDRF is signifcantly better than RKS ( compared to according to Table II).[.5] Algorithm | DDRF | RKS | ORF |
---|---|---|---|

[.5] | 0.82 (0.011) | 0.60 (0.027) | 0.68 (0.019) |

[.5] | 0.90 (0.002) | 0.71 (0.027) | 0.85 (0.010) |

[.5] | 0.90 (0.0008) | 0.85 (0.011) | 0.90 (0.002) |

[.5] | 0.91 (0.0006) | 0.88 (0.010) | 0.91 (0.0005) |

Time cost: The training time of the three random-feature based algorithms are tabulated in Table III. The table also includes the time cost of kernel logistic regression. The training time of random-feature based methods is substantially lower than kernel logistic regression. In particular, the training time of DDRF is roughly %15.6 of the kernel algorithm. Note that this simulation is only on sensors, and the difference will be much more significant for larger values of .

[.5] Algorithm | DDRF | RKS | ORF | Kernel |

[.5] Time Cost | 0.036 | 0.032 | 0.043 | 0.23 |

## V Conclusion

In this paper, we considered the problem of cell association for cellular users. It is shown how this problem can be modeled as nonlinear boundary detection problem. We then proposed a scalable solution by using randomized shallow networks which utilize data that can be collected by a large number of low-cost spectrum sensors deployed in the field. We also showed how to exploit the power of data-driven modeling in order to reduce the computational cost of training in the proposed solution.

The solution to the boundary detection problem, discussed in this paper, essentially splits the users into two categories, namely, cell-edge users and cell-center users. Eventually, the cell-edge users need to be associated to one of the BSs. The association of cell-edge users to the BSs can be dynamic (real-time). In other words, depending on the network traffic and more importantly the density of nearby users assigned to each of the BSs, the cell association can be different. Developing real-time classification algorithms to assign cell-edge users to the BSs dynamically is an interesting direction for future work. Furthermore, from a practical point of view, providing a test-bed consisting of radio transmitters and spectrum sensors deployed in the field in order to collect data for training and testing the proposed ML algorithms is another direction for future work.