# QoS-Aware Joint Policies in Cognitive Radio Networks

###### Abstract

One of the most challenging problems in Opportunistic Spectrum Access (OSA) is to design channel sensing-based protocol in multi secondary users (SUs) network. Quality of Service (QoS) requirements for SUs have significant implications on this protocol design. In this paper, we propose a new method to find joint policies for SUs which not only guarantees QoS requirements but also maximizes network throughput. We use Decentralized Partially Observable Markov Decision Process (Dec-POMDP) to formulate interactions between SUs. Meanwhile, a tractable approach for Dec-POMDP is utilized to extract sub-optimum joint policies for large horizons. Among these policies, the joint policy which guarantees QoS requirements is selected as the joint sensing strategy for SUs. To show the efficiency of the proposed method, we consider two SUs trying to access two-channel primary users (PUs) network modeled by discrete Markov chains. Simulations demonstrate three interesting findings: 1- Optimum joint policies for large horizons can be obtained using the proposed method. 2- There exists a joint policy for the assumed QoS constraints. 3- Our method outperforms other related works in terms of network throughput.

## I Introduction

\PARstartWith the advent of the new applications in wireless data networks, bandwidth demand has increased, intensively. The majority of the usable frequency spectrum for wireless networks has already been assigned to licensed users. In contrast to the apparent spectrum scarcity, extensive measurements indicate that a large portion of licensed spectrum lies unused [1]. Thus, there is an intensive research attempt to present new techniques to utilize the unoccupied resources, efficiently [4, 3, 2]. To get higher frequency reuse efficiency, SUs should dynamically access PUs’ channels. This concept is known as Opportunistic Spectrum Access (OSA) in literature[5]. In cognitive radio networks, channel occupation can be caused by two effects[6]: One is the disturbance due to PUs’ activities which can be modeled by finite state Markov chain[7]. The other is the impact caused by other SUs’ transmission[8].

Zhao et al. considered the Partially Observable Markov Decision Process (POMDP) framework for spectrum access [5]. They used POMDP approach to find an optimum policy for a single SU case. To generalize this solution to multi SUs, the simple Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA) protocol was employed[5]. In another related work [9], it is assumed that SUs obtain similar observations of PUs’ channels and therefore converge to the same opportunity assessment if they employ single user strategy. Results demonstrate that applying optimal single user strategy to multi-user setting causes significant degrade in network throughput performance [9]. To overcome this problem, it was shown that, using a randomized policy selection, the network performance can be improved. However, this policy does not guarantee QoS requirements among competing SUs. Besides, SUs’ collisions history is not considered in deriving belief vector. In other related work, Liu et al. considered two interfering SUs in a two-channel primary network[10]. Each SU observes different spectrum opportunity on each channel because of assumed structure of PUs’ network. They proposed a myopic policy for SUs in which both SUs exchange their beliefs in each time slot. Simulations illustrate that this myopic policy achieves near-optimal network throughput.

Considering time-invariant spectrum opportunities, several works have been done using game theory[11, 12]. Recently, Fu and van der Schaar utilized stochastic games to present a solution for dynamic interaction among competing SUs [6]. A Central Spectrum Moderator (CSM) is required in this model whose task is to announce the state of all channels to SUs in each time slot. However, having a centralized moderator is not practical in some cases and SUs can not sense all of channels in a limited time of single slot. In [13], Pham et al. proposed a game theoretic approach to QoS-aware channel selection for SUs which maximizes network throughput. They assumed that SUs know the spectrum availability before selecting appropriate channel.

Partially Observable Stochastic Game (POSG) is a general framework to solve multi-agent decision process [14]. In POSG, the state of the channel changes based on a discrete Markov Model and is partially observable to all agents. In this framework, each agent tries to maximize its own reward function in a repeated game. Hansen et al. proposed a Dynamic Programming (DP) approach to solve the problem of POSG. As a special case of POSG, the Dec-POMDP framework was investigated in [14] and [15], using the DP algorithm. In Dec-POMDP, all agents try to maximize a common reward function. Solving a Dec-POMDP problem by the DP algorithm becomes intractable when the horizon length of decision process increases. For instance, the DP algorithm runs out of memory even for a small horizon length in a trivial example [16]. Seuken and Zilberstein developed Memory Bounded Dynamic Programming (MBDP) to overcome the time complexity of existing DP algorithm[17].

In this paper, our goal is to design QoS-aware joint policies for sensing decisions of SUs in order to maximize network throughput. It is assumed that PUs’ network is slotted and all SUs have same spectrum opportunities in each time slot (see Fig. 1, SUs are in the transmission range of all PUs and different SUs have same observations of a specific channel). At first sight, to maximize the network throughput, each SU should be assigned a channel to exploit spectrum opportunities. Therefore, SUs avoid collisions with each other. This scheme is called partitioning strategy [10]. However, this strategy does not guarantee QoS requirements for SUs. For instance, when the probability of idle state is not the same for all channels, this strategy does not satisfy fairness. We formulate multiple SUs’ joint policies by Dec-POMDP. In the proposed method, the MBDP algorithm is employed to find optimum or sub-optimum joint policies for large horizon length. A joint policy which guarantees QoS requirement is selected to obtain sensing strategy for SUs. To the best of our knowledge, the problem of synchronizing SUs for multi user setting, in the presence of collisions, have received little attention. The proposed method ensures transceiver SUs synchronization. To demonstrate the efficiency of the proposed method, we consider two SUs trying to access two-channel PUs’ network modeled by discrete Markov chains. It is assumed that SUs have perfect sensing capability. Simulations yield these interesting findings: The MBDP algorithm obtains optimum solution for mentioned scenario. It is interesting to note that this algorithm is an approximate solution for Dec-POMDP and it does not guarantee to find an optimum joint policy. Moreover, there exists a joint policy which satisfies QoS constraints considered in simulations. Finally, comparing with two other related works [9, 10], we find out that the proposed method outperforms [9, 10] in terms of network throughput.

This paper is organized as follows. In Section II, we give an overview on Dec-POMDP. We also review the MBDP algorithm for Dec-POMDP. In Section III, the system model and Dec-POMDP formulations for cognitive radio network are described. In Section IV, we propose our method to extract QoS-aware joint policies. In Section V, as an example of our Dec-POMDP formulation, we define a scenario with two SUs trying to access a two-channel PUs’ network. Also, the numerical simulation and results are provided. Finally, the conclusion is presented.

## Ii Definitions and Preliminaries

In this section, we briefly review finite-horizon Dec-POMDP framework and the MBDP solution proposed for handling intractability problem of the DP algorithm. More details on the DP and MBDP algorithms could be found in [14, 15, 17].

### Ii-a Decentralized Partially Observable Markov Decision Process

A Dec-POMDP is a tuple , where,

- is a finite set of SUs indexed 1,…,n.

- is a finite set of states.

- represents the initial state distribution.

- is a finite set of actions available to SU and × is the set of joint actions, where denotes a joint action.

- is a finite set of observations for SU and × is the set of joint observations, where denotes a joint observation.

- is the set of Markovian state transition and observation probabilities, where denotes the probability that choosing joint action in state yields a transition to the state and the joint observation .

-×→ is a reward function which depends on joint actions and current state.

Dec-POMDP may be defined over a finite or infinite sequence of stages. In this paper, we focus on finite horizon Dec-POMDP. At each stage, all SUs simultaneously select an action and receive an observation. The reward for SUs is computed based on their action and state of channels. The goal is to maximize the expected sum of rewards:

(1) |

### Ii-B Memory Bounded Dynamic Programming for Dec-POMDP

Solving a Dec-POMDP means finding a joint policy that maximizes the expected total reward. A policy for a single agent can be represented by a decision tree , where nodes are labeled with actions and arcs are labeled with observations (a so called a policy tree). If denotes a set of horizon- policy trees for agent , a solution to Dec-POMDP with horizon can then be seen as a vector of horizon- policy trees (a so called joint policy tree) where . These policy trees can be constructed in two different approaches: top-down or bottom-up.

The first algorithm for solving Dec-POMDPs used a bottom-up approach [14]. Policy trees are constructed incrementally which means that the algorithm starts at the frontiers and works its way up to the roots using the DP algorithm. The DP algorithm updates in two steps. In the first step, the DP operator is given by a set of depth-t policy trees. A set of depth- + 1 policy trees, , is generated by considering any depth- + 1 policy tree that makes a transition after an action and observation to the root node of depth-t policy tree in . This step is called exhaustive backup [14]. In exhaustive backup, depth- + 1 policy trees are created. It is clear that the total number of constructed trees in each step increases exponentially. To alleviate this problem, unnecessary trees are pruned [14]. However, this modified DP algorithm runs out of memory even for simple problems [17].

The MBDP Algorithm |

begin |

1 max number of trees before backup |

2 horizon of the Dec-POMDP |

3 pre-compute relevant belief for each horizon |

4 initialize 1-step policy trees for each SU |

5 for = do |

6 Backup(), Backup() |

7 empty |

8 for = do |

9 choose relevant belief () for horizon |

10 for each , do |

11 evaluate each pair (, ) with respect to b |

12 end |

13 add best policy trees to and |

14 delete these policy trees from and |

15 end |

16 |

17 end |

18 select best joint policy tree from |

19 return |

end |

One of drawbacks in the pruning process is that it cannot predict which beliefs about the state and about the other SUs policies will eventually be useful before reaching the root of the policy trees. The MBDP algorithm combines the bottom-up and top-down approaches. By using top-down heuristics to find out relevant belief states, the DP algorithm compare the bottom-up policy trees and select the best joint policy. Some top-down heuristic policies are proposed in [17]. It is obvious that the state of channel is not affected by actions of SUs. Therefore, we can easily compute the most probable beliefs using the initial belief and Markov models of channels. The MBDP algorithm which is used in this paper is shown in Table I. The algorithm is written for two SUs and . It can be rewritten to any arbitrary number of SUs. The parameter denotes the number of policy trees that are used in exhaustive backup for constructing next stage. In other words, the size of set is . To evaluate each pair with respect to the belief vector , the concept of value vector in POMDP is employed[18]. The expected sum of reward with respect to the belief , , is computed by dot product of value vector and assumed belief:

(2) |

where is a -dimensional vector. For a depth- joint policy trees , the value vector is:

(3) |

where is the joint policy of subtrees selected by SUs after observation vector . In [17], it is proved that the MBDP algorithm has a linear time complexity with respect to the horizon length.

## Iii Problem Definition

### Iii-a System Model

Our model consists of: ) A spectrum with channels, assigned to PUs; ) PUs and SUs. It is assumed that all PUs and SUs communicate in a synchronous slot structure [5] and all SUs have same spectrum opportunities in each time slot (see Fig. 1). Each SU uses the beginning of each time slot to sense one of the channels. Based on the obtained observation, SUs can choose to either transmit on one of the channels or not to transmit at all. At the end of the time slot, SUs receives ACK from their corresponding receiver to know that if transmission was successful or not (see Fig. 2).

### Iii-B Dec-POMDP Formulation

For each SU, we have the following set of actions:

(4) |

where . represents the action of sensing channel for SU . s are only used in sensing level as shown in Fig. 3. denotes the action of accessing channel by SU . It should be noted that s are only used in transmission level as depicted in Fig. 3. shows the action that SU does not send its data and stays silent during the current time slot. Each channel is assumed to have a two state Markov model and also is independent of the other channels. Thus, we have:

(5) |

where shows the set of states for channel . Channel is available for SUs if . The probability of state transition for Markov model of channel is represented by where .

For SU , the following set of observations is considered:

(6) |

where shows that the channel is occupied by PUs and represents that channel is idle after sensing. In addition, shows that there is a collision after transmission on channel and denotes that there is no collision after action . and are obtained through ACK signal sent by receiver. and are observations in sensing level while and are observations in transmission level.

We define , the reward function for SU in time slot , as follows:

(7) |

It is clear that the reward function depends on joint actions of SUs and states of channels. The reward function for Dec-POMDP formulation in time slot is obtained from the sum of all SUs’ rewards:

(8) |

## Iv QoS-Aware Joint Policies

In this section, we propose a method to find QoS-aware joint policies. The proposed method consists of two phases: 1- First Phase: In this phase, a set of optimum or sub-optimum joint policies are obtained using the MBDP algorithm. 2- Second Phase: Among joint policies found in the first phase, the one which satisfies QoS constraints is selected.

### Iv-a Fisrt Phase

A set of optimum or sub-optimum joint policies are found in this phase by the MBDP algorithm. The MBDP algorithm has three inputs (see Fig. 4): 1- time horizon (T): the duration of decision process. 2- initial belief state (): the initial probability distribution of PUs’ channel occupation. 3- A set of transition probabilities of channels: the transition probabilities of Markov chains (s). The MBDP algorithm is executed for a specified number of trials.

### Iv-B Second Phase

In this phase, the joint policy that meets QoS constraints is selected. We define QoS constraints as the expected throughput of SUs which is given by the following equation:

(9) |

where is:

(10) |

and is a set of QoS parameters.

Definition- We say joint policy satisfies QoS constriants if the s meet following inequalities for an assumed parameter :

(11) |

where is the maximum achievable throughput of cognitive radio network.

### Iv-C Notes on Implementation Issues

For different QoS constraints, optimum or sub-optimum joint policy trees are computed off-line and saved in the SUs’ memory. Each joint policy tree has an identity number. Before SUs start sending, the initial belief is set to steady state distribution of channels. Moreover, SUs determine the initial belief precisely if they are allowed to sense all channels in the first slot. For the predefined horizon length () and assumed initial belief, SUs select an appropriate joint policy tree from their memory which guarantees QoS constraints. Afterwards, SUs send the identity number of the selected joint policy to their corresponding receivers in the first slot. The receiver tracks the transmitter’s sensing action from decision tree and observes the channel which the transmitter is currently sensing. Because of the perfect sensing capability and same opportunistic spectrum for all SUs, the transceiver SUs are synchronized by knowing the selected joint policy tree.

As the horizon length is incremented by one, the number of leaves in decision tree increases exponentially. Therefore, saving trees requires large amount of memory. We assume that SUs transmit for a few number of slots in which saving decision trees is efficient. This assumption is reasonable in cases that statistical behavior of channels (e.g. the transition probability of Markov models) changes frequently and using a joint policy for very large horizon is not rational.

## V Simulation

To evaluate the performance of the proposed method, we consider two SUs in two-channel PUs’ network. Each SU senses one of two channels in each time slot. The transition probabilities of channels are: . According to this distribution, channels and do not have same steady state probabilities. Channel is busy most of the time. It is also assumed that and in the first slot which determines the initial belief for the MBDP algorithm. Furthermore, each SU does not send and waits for next time slot if the observed channel is occupied by PUs. If the channel is idle, the SU sends. Besides, both SUs have perfect sensing capability. Considering above assumptions, we get a decision tree. An example of this tree is shown in Fig. 5(a). shows the action of sensing channel . denotes that the observed channel is free and transmission is successful. denotes the observed channel is free but collision happens with other SU during transmission. means that observed channel is busy and the SU waits for the next time slot. An example of joint policy tree for and 5. The set of observations for arcs in subtrees’ root are as same as the original root. These observations are omitted because of simplicity of illustration. is illustrated in Fig.

The proposed method is compared with two related works on multi SU scheme[9, 10]. A multiuser heuristic (MH) policy for sensing channels is proposed in [9]. In other words, for SU with belief vector on availability of channels , the probability of choosing channel in slot is given by:

(12) |

In the other work [10], it is assumed that SUs have different spectrum opportunities. A cooperation strategy which achieves near optimal throughput is given in [10]. In this strategy, SUs exchange their belief vector and use these information to take an action. If we rewrite cooperative approach formulation when SUs have same spectrum opportunities, the sensing strategy would be as follows:

(13) |

To find optimum or sub-optimum policies, we run the MBDP algorithm for 30 times in each horizon length (T). The parameter is set to three. The results of network throughput is shown in Fig. 6. The results are normalized to maximum achievable throughput of network. This figure demonstrates that the proposed method outperforms [9, 10]. We also consider the case . Parameter is 0.25 in this simulation. The results for these settings are shown in Fig. 7 for both proposed method and cooperative approach. In the cooperative approach, the second SU’s throughput starves. However, SUs have approximately fair throughput in the proposed method. The expected throughput for two SUs and different QoS constraints are presented in Table II. Parameter is set to 0.25. The pair in each entry denotes the throughput of SU and SU, respectively. Numerical results show that the proposed method guarantees QoS constrains.

Horizon length(T) | |||
---|---|---|---|

4 | 2,2 | 2.55,1.77 | 2.66,1.34 |

5 | 2.6150,2.385 | 3,2 | 3.5,1.5 |

6 | 3.03,2.97 | 3.7275,2.39 | 4,2 |

7 | 3.645,3.355 | 4.32,2.68 | 4.56,2.44 |

8 | 4.02,3.98 | 4.9,3.1 | 5.44,2.65 |

9 | 4.6,4.4 | 5.5,3.5 | 5.86,3.14 |

10 | 5.03,4.97 | 5.92,4.08 | 6.62,3.38 |

## Vi Conclusion

We proposed a new method to guarantee QoS requirements for sensing-based protocols in multi SUs’ network. By employing the MBDP algorithm, some optimum or sub-optimum joint policies are found and the one that satisfies QoS constraints is selected as the joint sensing strategy. Besides, the proposed method ensures transceiver SUs synchronization. For two SUs in a two-channel PUs’ network, simulations demonstrated that the proposed method achieves maximum throughput. Moreover, results show that the proposed method guarantees different QoS constraints.

## References

- [1] M. McHenry, “Spectrum White Space Measurements,” June 2003, Presented to New America Foundation Broadband Forum.
- [2] Q. Zhao and B. M. Sadler, “A Survey of Dynamic Spectrum Access: Signal Processing, Networking, and Regulatory policy,” IEEE Signal Processing Mag., vol. 24, no. 3, pp. 79-89, May 2007.
- [3] R. V. Prasad, P. Pawelezak, J. Hoffmeyer, and S. Berger, “Cognitive Functionality in Next Generation Wireless Networks: Standardization Efforts,” IEEE Comm. Mag., vol. 46, no. 4, pp. 72-78, Apr. 2008.
- [4] S. Pollin, E. Hossain , and V. K. Bhargava, “Coexistence and Dynamic Sharing in Cognitive Radio Networks,” in Cognitive Wireless Communication Networks, Eds. New York, NY: Springer, 2007.
- [5] Q. Zhao, L. Tong, A. Swami, and Y. Chen, “Decentralized Cognitive MAC for Opportunistic Spectrum Access in Ad Hoc Networks: A POMDP Framework,” IEEE JSAC, vol. 25, no. 3, pp. 589-600, April 2007.
- [6] F. Fu and M. van der Schaar, “Learning to Compete for Resources in Wireless Stochastic Games,” IEEE Trans. on Vehicular Technology, vol. 58, no. 4, pp. 1904-1919, May 2009.
- [7] Q. Zhang and S. A. Kassam , “Finite-state Markov model for Rayleigh fading channels,” IEEE Trans. Comm., vol. 47, no. 11, pp. 1688-1692, Nov. 1999.
- [8] L. Lai, H. El Gamal, H. Jiang, H. V. Poor, “Cognitive Medium Access: Exploration, Exploitation and Competition,” IEEE/ACM Trans. on Networking, 2007.
- [9] K. Liu, Q. Zhao and Y. Chen, “Distributed sensing and access in cognitive radio networks,” Proc. of 10th International Symposium on Spread Spectrum Techniques and Applications (ISSSTA), 2008.
- [10] H. Liu, B. Krishnamachari and Q. Zhao, “Cooperation and learning in multiuser opportunistic spectrum access ” IEEE International Conference on Communications WorkshopsICC Workshops’ 08, 2008.
- [11] Z. Ji and K. J. R. Liu, “Dynamic spectrum sharing: A game theoretical overview,” IEEE Communications Magazine, vol. 45, pp. 88 94, May 2007.
- [12] J. E. Suris, L. A. DaSilva, Z. Han, and A. B. MacKenzie, “Cooperative game theory for distributed spectrum sharing,” in Proceedings of IEEE International Conference Communications (ICC), June 2007.
- [13] H.M. Pham, J. Xiang, Y. Zhang and T. Skeie, “Qos-aware channel selection in cognitive radio networks: A game-theoretic approach,” IEEE Global Telecommunications Conference, 2008.
- [14] E. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic Programming for Partially Observable Stochastic Games,” 19th National Conference on Artificial Intelligence, July 2004.
- [15] D. Szer and F. Charpillet, “Point-based Dynamic Programming for Dec-POMDP,” AAAI-06, 2006.
- [16] S. Salehkaleybar, A. Majd and M. Pakravan, “A New Framework for Cognitive Medium Access Control: POSG Approach,” arXiv:1003.2813v1.
- [17] S. Seuken and S. Zilberstein, “Improved memory-bounded dynamic programming for decentralized POMDPs,” Proceedings of the twenty-third conference on uncertainty in artificial intelligence, pp. 344-351, 2007.
- [18] R. Smallwood and E. Sondik, “The Optimal Control of Partially Observable Markov Process over a Finite Horizon,” Operations Research, vol. 21, pp. 1071-1088, 10973.