I Introduction
Recently, D2D communication has been extensively studied to provide better user experience. To implement this technology, one of the key issues is how to share licensed spectrum efficiently without degrading CUs’ performance greatly. We consider a cooperative D2D communication scheme, which exploits the advantages of cooperative relay and D2D communication [1]. The basic idea is that DTs act as relays for CUs in exchange for the transmission opportunities on the CUs’ channels. Thus, a winwin situation is achieved, which motivates CUs to share their spectrum with D2D pairs even if they have no surplus resource.
Most existing works[1, 2, 3, 4] assume complete information, such as channel state information (CSI). However, collecting global information incurs heavy overhead, and thus may be not practical in largescale networks. Besides, some information may be difficult to acquire, such as the CSI between CUs and DTs. Moreover, the latency requirement of some applications is stringent, such as D2Dbased vehicletovehicle communications. These facts motivate us to study distributed resource allocation scheme with incomplete information, where agents make decisions independently based on local information.
Game theory provides a framework to study the interactions of autonomous agents. There have been many game theoretical solutions in D2D networks [5]. In our context, CUs have preferences over D2D pairs and vice versa. Matching theory offers a suitable tool to study the cooperation between competitive CUs and competitive D2D pairs. There have been some matchingbased resource allocation schemes for D2D communication[6, 7, 8]. In this paper, we formulate the problem of pairing CUs with D2D pairs as a onetoone matching game to seek a stable matching.
In the literature, authors of [9] have considered the incomplete information scenario, but do not investigate the pairing problem. Besides, similar cooperative scheme has been studied in cognitive radio networks recently[10, 11, 12, 13, 14, 15], where secondary users (SUs) relay primary users’ (PUs) traffic for rewards of the transmission opportunities. Some works adopt auction[10], dynamic Bayesian game[11], and Stackelberg game[12] to tackle the incomplete information. Moreover, the authors of [13] consider the incomplete information in the matching game model. However, above works [10, 11, 12, 13] assume PU has the knowledge of the relay rates, which depend on the SUs’ local information. In practice, such information is usually not known globally. In this paper, we consider a stronger incomplete information scenario, where CUs have no knowledge of the relay rates provided by the D2D pairs. The authors of [14, 15] consider the similar information assumption, but only consider single PU case. Instead, we consider the case with multiple CUs and multiple D2D pairs.
This paper focuses on the uplink resource sharing with incomplete information, because mobile devices are more likely to need help due to limited power budget. We formulate the pairing problem as a onetoone matching game, based on the interaction between each CU and each D2D pair. Such interaction is described by Nash bargaining solution (NBS). Because the relay rates are unknown, CUs cannot establish preferences over D2D pairs. Thus, traditional matching algorithms, such as GaleShapley (GS) algorithm, are not suitable for our scenario. To the best of our knowledge, it is the first attempt to address the matching game with unknown preference
. To overcome the difficulty, we convert the matching game to an equivalent noncooperative game. At each period, each CU selects a D2D pair and a corresponding time allocation, and obtains a payoff as feedback. Based on the feedback, we propose a learning algorithm, which is proven to converge to a stable matching in probability. Moreover, the corresponding time allocation converges to the result of NBS with probability 1.
Ii System Model and Problem Formulation
Iia System Model
We consider uplink resource sharing of a single cell with a base station (BS) denoted by and CUs. The set of CUs is denoted by . Besides, there are D2D pairs, and the set of them is denoted by . Each D2D pair contains one DT and one D2D receiver (DR). In this paper, we assume . However, the proposed algorithm can be applied to the case where . CU has been assigned to one cellular channel, namely channel . There is no dedicated channel for D2D pairs. Therefore, D2D pairs relay the uplink traffic in exchange for access to the cellular channels.
We assume that each CU is assisted by at most one D2D pair, and each D2D pair can relay at most one CU due to limited battery capacity[2]. Similar to [14, 15, 13], we take the decodeandforward protocol with parallel channel coding [16] as an example. When CU cooperates with D2D pair , the normalized frame consists of three phases, as shown in Fig.1. The first two phases both last and are used for the relay transmission for the CU. Specifically, CU broadcasts its data with power to the BS and DT at first. Then, DT forwards received signal to the BS with power . The third phase lasts and is used by DT to transmit its data with power to DR . We refer to as time allocation.
The expected rate of CU in direct link is
(1) 
where is the channel gain from CU to the BS and denotes the noise power.
For simplicity, we assume every DT can decode all the CUs’ data in the first phase. Thus, cooperating with D2D pair , the rate of CU in the first two phases is
(2) 
where is the channel gain from DT to the BS on channel . Let , and thus with time allocation , the expected rate of CU during the entire frame is . Moreover, the expected rate of D2D pair during the entire frame is given by
(3) 
where is the channel gain of D2D pair on channel . Assume that for each D2D link, the channel gains are i.i.d. across all the channels. Thus, we have , and the value of is denoted by .
Information Assumption: CU only knows and has no knowledge of and , and D2D pair only knows . After cooperating with D2D pair at period , CU gets a sample following a fixed unknown distribution.
IiB Matching Based Framework
IiB1 Bargaining Game for CUD2D Pair
To incentivize CU and D2D pair to cooperate mutually, a bargaining game is used to characterize the interaction between them. If CU cooperates with D2D pair , the CU’s utility and the D2D pair’s utility are defined as
(4)  
(5) 
We use NBS as the bargaining outcome to determine the time allocation, and thus the cooperation satisfies some useful properties and is beneficial for both sides. Hence, based on the concept of NBS[17], the time allocation is given by the following problem
(6a)  
s.t.  (6b) 
where and are the CU’s and the D2D pair’s utilities respectively if they fail to reach an agreement. It is natural to set . Thus, problem (6) is coincident with proportional fairness scheme. Constraint (6b) guarantees that both sides have incentive to participate in the cooperation. Solving problem (6), the optimal time allocation is given by
(7) 
where . Based on (7), the D2D pair with higher relay rate can obtain larger transmission time. Moreover, it is easy to verify that is an increasing function of , which reflects the fact that the CU prefers to cooperate with the D2D pair offering higher relay rate. We will use and interchangeably afterwards. When the problem (6) is infeasible, for convenience, we still let be the associated time allocation, and thus have in this case.
IiB2 Matching Game Model
CU and D2D pair can only be paired when they agree to cooperate mutually. Therefore, it is reasonable to model the pairing problem between the set of CUs and the set of D2D pairs as a onetoone matching game under twosided preferences. CU prefers D2D pair to D2D pair (i.e., ), if . Similarly, D2D pair prefers CU to CU (i.e., ), if , which is equivalent to . Besides, if , D2D pair is acceptable to CU , which is denoted by .
Mathematically, a matching is a function , such that if and only if , and , , for . Note that implies that user is unmatched. We aim to seek a stable matching (SM), which is the major solution concept in matching game and defined as follows [18].
Definition 1
Let be a matching. A CUD2D pair is a blocking pair if , and . is individually rational if . Thus, is stable if it is individually rational and there is no blocking pair.
SM captures the preferences of both sides and CUs will only be matched with acceptable D2D pairs in SM. The existence of SM is guaranteed[18]. The challenge is that each CU cannot establish its preference due to the unavailability of . Thus, the traditional GS algorithm [18] cannot be used to seek SMs.
Iii Learning for Matching with Incomplete Information
To overcome the difficulty, CU has to learn its preference from the interactions with D2D pairs. To this end, we convert the above matching game to an equivalent noncooperative game, which enables us to exploit the rich learning techniques designed for noncooperative game.
Iiia Equivalent Noncooperative Game Model
We convert the matching game to a noncooperative game . Due to the priority of CUs on licensed spectrum, we let CUs be the players to propose to D2D pairs. The action of CU is to select a D2D pair , which means CU proposes to cooperate with D2D pair with time allocation . Each CU can refuse to cooperate with any D2D pairs, which is denoted by action . Hence, the action set of CU is . Given an action profile , each D2D pair selects the CU offering the maximal time allocation among the CUs proposing to it and rejects the others. If more than one CUs offers the maximal time allocation, the D2D pair will choose one of them based on a predefined rule. The CU chosen by D2D pair is denoted by ^{1}^{1}1Mathematically, the choice function of D2D pair can be represented as , where is the set of CU proposing to D2D pair and is the bias assigned to CU . The bias is determined by the predefined rule, and satisfies that if , must hold. , which can reflect the preference of D2D pair . Thus, the utility of CU is:
(8) 
where is the action profile of all the CUs except CU , and is an arbitrarily small number and denotes the negotiation cost. Assume is sufficiently small so that if . In the first case, makes sure that CUs only select acceptable D2D pairs at equilibriums. The first two cases imply acceptance and rejection of the CU’s proposal, respectively. The third case means that the CU refuses to cooperate with any D2D pairs.
Given an action profile , its associated matching is obtained as follows: for , and if and only if . Hence, the relationship between the pure Nash equilibrium (PNE) of and the SM can be stated as follows, which implies that an SM can be found via finding a PNE of .
Theorem 1
If action profile is a PNE, is an SM. Conversely, if is an SM, there is a PNE such that .
Proof:
On the one hand, let be a PNE. We will prove the stability of by contradiction. The individual rationality is easy to verify. Suppose there is a blocking pair in . Thus, CU can take action to improve its utility, which violates our assumption. Therefore, is stable.
On the other hand, let be an SM. We construct an action profile as follows: for CU , if , it takes action and action otherwise. We will prove that is a PNE by contradiction. Suppose is not a PNE, so there exists a CU deviating to take action . If , is not individually rational. Besides, If , there is a blocking pair in . Thus, is not stable, which violates our assumption. Therefore, is a PNE.
To develop the learning algorithm, we show that is a weakly acyclic under betterreplies game (WABRG), which enables us to adopt betterreply with inertia (BRI) learning algorithm [19] to find the PNE of . WABRG means that from any action profile, there is a betterreply path that terminates in a PNE in a finite number of steps. A betterreply path is a sequence of action profiles , where for each , there is a CU such that , and . In other words, in successive action profiles, only one CU changes its action to improve its utility.
Theorem 2
The proposed game is a WABRG.
Proof:
Suppose is not a PNE. We will construct a betterreply path that ends at a PNE to prove the theorem.
If there are any rejected CUs, we let them take action successively to obtain , such that the CUs unmatched in take action . Furthermore, according to Theorem 2.33 in [18], there exists a finite sequence of matchings , where , is stable, and there is a blocking pair for such that is obtained from by satisfying the blocking pair . Thus, we let CU select D2D pair to obtain . Similarly, we let rejected CUs take action successively to obtain . Note that the above process will not change the associated matching, i.e., . Repeating the above process, we can obtain an action profile such that and the unmatched CUs in take action . Note that is exactly the constructed action profile in the proof of Theorem 1, so it must be a PNE. Besides, it is easy to find that the sequence is a betterreply path. Hence, we can verify Theorem 2.
IiiB Learning Algorithm
Because each CU’s utility is related to , each CU has to learn its utility from the interactions with D2D pairs. Furthermore, the action of CU can be redefined as a proposal , where the time allocation is unknown in the case of incomplete information. Therefore, CUs have to make proposals explicitly to help D2D pairs establish their preferences. Specifically, at each period , based on history information, CU makes proposal , where
is calculated using the estimation of
. Based on the proposal profile , D2D pair selects the CU offering the maximal time allocation, and the selected CU is denoted by . After cooperation with D2D pair , CU can update its estimation using observation . Besides, to facilitate the learning process, CU can also choose the time allocation to make sure it has enough chances to cooperate with every D2D pair to obtain information, where is an arbitrary small number. Hence, with D2D pair selected, CU can choose for exploration.Combining BRI and Qlearning, we propose a novel learning algorithm. The entire algorithm is depicted in Algorithm 1 for some CU . In step 3, CU randomly selects D2D pair for exploration with probability , where step 3a is used to announce time allocation to help other CUs estimate their utilities. In step 4, with probability , CU adopts BRI to learn a PNE of using estimated utility . In step 6, based on observations, CU updates its estimation of in Qlearning way. Then, CU uses this updated estimation to calculate the associated time allocation in step 7. In step 9, CU uses other CUs’ announced time allocation and to estimate utility function.

With probability , choose the time allocation as .

With probability , choose the time allocation as .

With probability , select D2D pair .

With probability , select D2D pair according to the distribution, which is over the D2D pair selections that are better replies to CU’s full memory of length than with respect to .
(9) 
(10) 
(11) 
Theorem 3
With , the sequence converges to the true value with probability 1. Moreover, the algorithm converges to an SM in probability. Specifically, , where .
Proof:
Let denote the probability of CU cooperating with D2D pair at period . Thus,
So CU will cooperate with D2D pair infinitely often with probability 1. Based on [19], converges to with probability 1. Since is a continuous function of , we conclude that converges to with probability 1.
On the one hand, if we replace the estimated utility with the true utility in EBRIQ, the D2D pair selection process is exactly the stochastic BRI (SBRI) in [19]. Since is a WABRG, using lemma 5.17 in [19], we have that in this case. On the other hand, due to the step 3a in EBRIQ, the event that CU announces its time allocation will happen infinitely often with probability 1. So converges to with probability 1. Moreover, considering the convergence of and , the estimated utility will be sufficiently close to the true utility after an almost surely finite time. Thus, EBRIQ will select D2D pairs with exactly the same probabilities as SBRI. Hence, based on Theorem 1, the convergence of is verified.
Remark 1
On the one hand, larger memory length improves the robustness to the exploration behavior of other D2D pairs, which may speed up the convergence rate of the algorithm. On the other hand, Theorem 3 implies that the exploration probability decays more slowly with larger , which leads to slower convergence rate.
IiiC Implementation Issues
At the beginning of each frame, CUs will send their proposals to the BS. Then, the BS broadcasts a proposal list containing all the CUs’ proposals at a dedicated channel. Meanwhile, all the D2D pairs will listen to this channel. After receiving CUs’ proposals, each D2D pair will accept one of them. Then, each matched D2D pair will send a feedback to the BS using the channel occupied by its matched CU. Based on these feedback, the BS obtains the final matching and informs the result to the CUs. Thus, each CU knows its partner and can begin its data transmission.
Except the above handshaking procedure, no extra overhead is needed in the proposed algorithm.Thus, each iteration has low signaling overhead. Note that (9)(11) can be calculated in constant time. Moreover, the estimated utility is only needed in BRI. Thus, according to BRI, the algorithm only needs to estimate . Therefore, the computational complexity of each iteration is .
Iv Simulation Results
Simulation results are presented to evaluate the performance of the proposed algorithm. The channel gain is , where is the distance between receiver and transmitter, is the path loss exponent and
is fast fading with exponential distribution. The cell radius is 400 m. CUs are randomly distributed in an area of at least 300 m away from the BS. The distance between the DT and the BS is uniformly distributed between 150 and 250 m. The length of D2D link is uniformly distributed between 10 and 60 m. Besides, we set
dBm, mW, , , and , and the length of memory in BRI is set to 4.At first, we investigate the the convergence behavior of the proposed learning algorithm. For illustration purposes, we consider a small network with 2 CUs and 2 D2D pairs. There is only one SM, where CU 1 is matched with D2D pair 2 and CU 2 is matched with D2D pair 1. The results are given in Fig. 2 and Fig. 3. The results are averaged over 1000 simulations with the same topology. Fig. 2 presents the convergence of the time allocation estimation, where the estimation is normalized by the true value . It is observed that the sequence converges to asymptotically, which is consistent with Theorem 3. The convergence of CUs’ behaviors is given in Fig. 3. It can be found that CU 1 and CU 2 could acquire their correct partners. This result implies that PNE or SM will be achieved eventually.
Next, we compare the proposed algorithm with other distributed algorithms in a larger network with 4 CUs and 5 D2D pairs. Fig. 4 shows the achieved system throughput over time for different algorithms. The results are averaged over 1000 simulations with different topologies. In the classical explorationexploitation greedy algorithm, at each period, every CU selects the best D2D pairs so far with probability , and some random D2D pair with probability . Besides, the time allocation estimations are updated similarly to our algorithm. We take in the simulation. In the random algorithm, each CU selects D2D pair randomly and proposes as time allocation to guarantee its performance. We present the noncooperative scheme as well, where every CU takes action . It can be observed that our algorithm yields significant gain over other learning algorithms. Besides, the performance loss due to incomplete information is small. It is also worth mentioning that the cooperative scheme achieves much better performance than noncooperative scheme, which verifies the efficiency of the cooperative scheme.
V Conclusion
This paper considers a cooperative D2D communication system with incomplete information. We model the pairing problem between multiple CUs and multiple D2D pairs as a onetoone matching game and propose a novel learning algorithm, which converges to a stable matching. The simulation results verify our analysis and show that the proposed algorithm outperforms the classical greedy algorithm. In the future work, the location information will be considered to divide CUs and D2D pairs into small groups to speed up the learning process. Moreover, the learning algorithm with faster convergence rate will also be investigated.
References
 [1] Y. Cao, T. Jiang, and C. Wang, “Cooperative devicetodevice communications in cellular networks,” IEEE Wirel. Commun., vol. 22, no. 3, pp. 124–129, Jun. 2015.
 [2] Q. Wu et al., “Energyefficient D2D overlaying communications with spectrumpower trading,” IEEE Trans. Wireless Commun., vol. 16, no. 7, pp. 4404–4419, Jul. 2017.
 [3] S. Shalmashi and S. B. Slimane, “Cooperative devicetodevice communications in the downlink of cellular networks,” in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), Apr. 2014, pp. 2265–2270.
 [4] M. Seif et al., “Cooperative D2D communication in downlink cellular networks with energy harvesting capability,” in Proc. IEEE Int. Wireless Commun. Mobile Comput. Conf. (IWCMC), Jun. 2017, pp. 183–189.
 [5] L. Song et al., “Gametheoretic resource allocation methods for devicetodevice communication,” IEEE Wirel. Commun., vol. 21, no. 3, pp. 136–144, Jun. 2014.
 [6] Z. Zhou et al., “Energyefficient matching for resource allocation in d2d enabled cellular networks,” IEEE Transactions on Vehicular Technology, vol. 66, no. 6, pp. 5256–5268, June 2017.
 [7] S. Bayat et al., “Matching theory: Applications in wireless communications,” IEEE Signal Processing Magazine, vol. 33, no. 6, pp. 103–122, Nov 2016.
 [8] Z. Zhou et al., “Social bigdatabased content dissemination in internet of vehicles,” IEEE Transactions on Industrial Informatics, vol. 14, no. 2, pp. 768–777, Feb 2018.
 [9] C. Ma et al., “Cooperative spectrum sharing in D2Denabled cellular networks,” IEEE Trans. Commun., vol. 64, no. 10, pp. 4394–4408, Oct. 2016.
 [10] S. K. Jayaweera, M. Bkassiny, and K. A. Avery, “Asymmetric cooperative communications based spectrum leasing via auctions in cognitive radio networks,” IEEE Transactions on Wireless Communications, vol. 10, no. 8, pp. 2716–2724, August 2011.
 [11] Y. Yan, J. Huang, and J. Wang, “Dynamic bargaining for relaybased cooperative spectrum sharing,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 8, pp. 1480–1493, August 2013.
 [12] X. Feng, H. Wang, and X. Wang, “A game approach for cooperative spectrum sharing in cognitive radio networks,” Wireless Communications and Mobile Computing, vol. 15, no. 3, pp. 538–551, 2015.
 [13] X. Feng et al., “Cooperative spectrum sharing in cognitive radio networks: A distributed matching approach,” IEEE Trans. Commun., vol. 62, no. 8, pp. 2651–2664, Aug. 2014.
 [14] L. Duan, L. Gao, and J. Huang, “Cooperative spectrum sharing: A contractbased approach,” IEEE Trans. Mob. Comput., vol. 13, no. 1, pp. 174–187, Jan. 2014.
 [15] M. LopezMartinez et al., “A superprocess with upper confidence bounds for cooperative spectrum sharing,” IEEE Trans. Mob. Comput., vol. 15, no. 12, pp. 2939–2953, Dec. 2016.
 [16] J. N. Laneman and G. W. Wornell, “Distributed spacetimecoded protocols for exploiting cooperative diversity in wireless networks,” IEEE Trans. Inf. Theory, vol. 49, no. 10, pp. 2415–2425, Oct. 2003.
 [17] Z. Han et al., Game Theory in Wireless and Communication Networks: Theory, Models, and Applications. Cambridge University Press, 2011.
 [18] A. Roth and M. A. O. Sotomayor, TwoSided Matching: A Study in GameTheoretic Modeling and Analysis. Cambridge Univ. Press, 1992.
 [19] A. C. Chapman et al., “Convergent learning algorithms for unknown reward games,” SIAM Journal on Control and Optimization, vol. 51, no. 4, pp. 3154–3180, 2013.