JI Hong,ZHANG Tianxiang,ZHANG Kai,WANG Wanyuan,WU Weiwei
(1.School of Computer Science and Engineering,Southeast University,Nanjing 211189,China;2.ZTE corporation,Shenzhen 518057,China)
Abstract:With the rapid development of wireless network technologies and the growing de?mand for a high quality of service (QoS),the effective management of network resources has attracted a lot of attention.For example,in a practical scenario,when a network shock oc?curs,a batch of affected flows needs to be rerouted to respond to the network shock to bring the entire network deployment back to the optimal state,and in the process of rerouting a batch of flows,the entire response time needs to be as short as possible.Specifically,we re?duce the time consumed for routing by slicing,but the routing success rate after slicing is re?duced compared with the unsliced case.In this context,we propose a two-stage dynamic net?work resource allocation framework that first makes decisions on the slices to which flows are assigned,and coordinates resources among slices to ensure a comparable routing suc?cess rate as in the unsliced case,while taking advantage of the time efficiency gains from slicing.
Keywords:network slicing;dynamic resource allocation;reinforcement learning
With the fast growth of network technologies,such as 5G and data center networks,and ever-increas?ing demand for the high quality of service (QoS)[1],efficient employment of existing network infra?structure becomes a challenging task.Network slicing pro?vides an effective method that can introduce flexibility and faster resource deployment in network resource manage?ment[2].A slice is a horizontal subset of the entire network which is set to satisfy resource requests,for example,band?width for flows.A flow is a certain amount of bandwidth re?quirement on the passing links[3].
Many studies have been devoted to making full use of QoS and network resource utilization for traffic scheduling,for which queue management and scheduling are widely used.Generally,the flows in the queue may have different priorities,for example,the preceding flows may have a higher priority than other flows[4].In floodlight-based software defined net?works (SDN)[5],queue-based scheduling technology is widely used to implement QoS support.In the above research,the traffic to be transmitted is divided into QoS flows and optimal flows,and assigned to the queue according to their priorities[6].
However,the queuing nature of the above resource alloca?tion or scheduling strategy has shortcomings.It is necessary to calculate the feasible route of the flow/request in real-time by checking the remaining available bandwidth on the network link.However,it is time-consuming to calculate the feasible route for the flows in the queue which need to be processed in sequence.Inspired by network slicing,which has revealed an acceleration effect on routing[7],we introduce a network slicing model to parallel and speed up routing calculations.In the pro?posed network slicing model,different slices (e.g.,horizontal slices) share the same topology,but have different bandwidth resources on the link.Flows are allocated to different slices,and routed in an independent manner in each slice.Finally,the routing process can be calculated in parallel,thus speed?ing up the routing process.
The main challenge of network slicing is to determine to which slice the request should be deployed.The objective of resource allocation is to ensure a high deployment success rate,for which some flows may not be deployed due to scarce resources.Moreover,using the checking mechanism to test whether the route is feasible or not will degrade the efficiency of parallel routing.Against this background,the first contribu?tion of this paper is that we propose a two-stage resource allo?cation algorithm,which consists allocation to slice and the re?source adjustment among slices.In the first stage of the flow allocation,given the current arrival flows,a reinforcement learning (RL) based mechanism is proposed to deploy these flows.In the second stage of resource adjustment,a dynamic and parallel network resource reallocation among slices is pro?posed.Another contribution is that we conduct an experimen?tal evaluation in a hierarchical ring 5G network similar to a re?al large-scale network.Simulation results show that this meth?od can still reduce the routing calculation time and maintain the deployment success rate when dealing with large-scale net?works.
The rest of this paper is organized as follows.Section 2 re?views the related work on resource allocation of network tech?nologies.Section 3 describes the problem and forms the mod?el.We propose the resource allocation algorithm in Section 4.Section 5 presents the results of the experiments.Finally,we conclude this paper in Section 6.
In recent years,network slicing and effective use of network resources have received a lot of attention.SDN is the candi?date technology for implementing network slicing on common network infrastructure for deploying a number of services with different requirements.Deployed slices guarantee the isola?tion in terms of connectivity and performance[8].
Network resource virtualization (slicing) has been regarded as one of the main development trends of 5G cellular networks and data center networks,which can improve QoS,quality of experience (QoE),and network resource utilization[9].Through network slicing,the whole network can easily accommodate the different needs of divergent service types,applications,and services in support of vertical industries[10].A virtualized infrastructure is provided in Ref.[11],which is a network in?frastructure in which some or all of the elements are virtual?ized by the data center network,such as servers,links,switch?es and topology.The cloud computing platform consists of sin?gle or multiple virtualized infrastructure,which relies on virtu?alization technology to divide available resources and share them among users.
For the limitation of network resources,a resource alloca?tion plan can be implemented to improve communication reli?ability and network utilization.However,slicing may cause se?vere network performance degradation.ZHANG et al.[12]use a supply-demand model to quantify slicing interference.In or?der to maximize the total throughput of accepted requests,an adaptive interference awareness(AIA)heuristic method is pro?posed to automatically place slices in network slices custom?ized for 5G services.CHEN et al.[13]also develop a dynamic network slice resource scheduling and management method based on SDN to meet the services’requirements with timevarying characteristics.A resource combination and allocation mechanism is proposed to maximize the total rate of the virtu?alized network based on its channel state information[14].An al?gorithm based on iterative slice provision has been proposed,which adjusts the minimum slice requirement based on chan?nel state information,but does not consider the global re?source utilization rate of the network and slice priority.A cen?tralized joint power and resource allocation scheme for priori?ty multi-layer cellular networks has been proposed[15],which works by allowing users with higher priority to maximize the number of service users.Priority is only considered at the user level,and different priorities between slices are ignored.
In this paper,we formulate the network slicing model,which enables flow requests to be dispatched among different slices,thereby speeding up routing computations.Moreover,the proposed algorithm is validated to have the advantage of maximizing the success rate of flow deployment.
We model the wide area network (WAN) as an undirected graphG={V,E,A},whereV={v1,v2,...,vn} is a set of nodes which can represent switches or routers in the WAN,andE={(vs,vd)i|i=1,2,...,m} is a set of edges in the graph which can represent link connections between WAN nodes.LetA={a1,a2,...,am} represent the available bandwidth per link.We use the abbreviationesdor abbreviation of edge numberei,i=1,2,...,m,to denote an edge betweenvsandvd.We useasdoraito denote the available bandwidth of a spe?cific link.The notations to be used in this section can be found in Table 1.
Define a flowfjas users’requests,indicating that at each time period,a user requests certain channels of a pre-defined transmission rate (bandwidth demand).LetF={(vs,vd)j|j=1,2,...,h} denote a flow set,anddjdenote the pre-defined transmission rate of a flow.Each flow can be represented by a path that consists of a series of links,which is generated by the shortest path (SP) routing scheme.Note that a feasible SP route scheme depends on the current network state especially each link’s available bandwidth.Let the path calculated by a certain routing scheme forfjasPj.After determining a routing scheme,we can get all links inPj,soPjalso can be represent?ed as a set of links.These links should also satisfy the con?straints described above:dj≤ai,?ei∈Pj.
Assuming we have a real network and several flows,after al?locating some bandwidth resources to a flow,the network state will be changed,so the next flow’s routing scheme must de?pend on the changed network state.If they are planned togeth?er,some conflicts may happen,for the reason that,to deploy a bunch of flows,the shortest path should be calculated sequen?tially to avoid deployment failures of following flows.For exam?ple,there are two flows,namelyf1andf2,established on a net?work in Fig.1,wheref1={v1,v3,80}andf2={v1,v3,60}if their shortest paths are calculated at the same time.Both of them will have the same pathv1→v2→v3,but the link be?tweenv1andv2cannot hold these two flows,so we need to cal?culate paths sequentially.

▼Table 1.Notation overview

▲Figure 1.A network with 3 nodes and 3 edges
Due to this,the total time for planning all flows is the sum of time planning for each flow.In the real production environ?ment,we always expect a minimum deployment time.To achieve this goal,we need to establish some resource isola?tion,so that paralleling computing can be applied to the sepa?rate resource part.This is a method called slicing which is what we are going to introduce in the following.First,we intro?duce the slice setS={s1,...,sk,...,sK},which represents a set of slices.Each slice has the same topology asG,and each links’capacity is part of the original network.We call the pro?cess of dividing bandwidth into parts slicing.Assuming that the original network is divided intoKslices,in this scenario,the total time of calculating route paths is the maximum time consumed among all slices.If we haveKslices,the time con?sumed can be approximately reduced to 1/Kof original timeconsuming.
All slices can be seen as virtual networks based on the same physical network.In detail,we can also formulate a slice as an undirected graphsk={V,E,Ak},which has the same to?pology as the original network.Additionally,the delay of an edge is the same across slices.
Different slices have the same network topology and delay,while their bandwidths and available bandwidths can be differ?ent,which depends on the slicing method.Fig.2 illustrates the slicing method in our slicing model.

▲Figure 2.Network slices in our slicing model
Given the network slicing,by randomly allocating users’flow requests to slices,these flows in different slices can be routed by SP in parallel,which in turn can reduce total rout?ing time.However,compared with routing on the original net?work,this network slicing-based routing mechanism will lead to a higher failure rate,since the network resources might be sliced in unavailable fragments.Therefore,this method of di?rectly slicing reduces the flexibility of routing compared with routing on the original network without slicing.In other words,this method improves the calculation time performance at the expense of the success rate of routing.
To improve the flow deployment success rate,we design an appropriate method to determine which slice the flow should be deployed in.We useto denote thatfjwill be deployed insk,Fk=represent the set of flows deployed insk,and we also useFkto represent the set of flowssuccessful?ly deployed insk.
Our objective is to design a flow allocation algorithm to maximize the success rate of flow deployment on slices,which can be formulated as maxParticularly,we hope that the success rate after slicing can be close to the success rate of deployment on the original topology.
In this section,we propose a two-stage network resource al?location algorithm.In the first stage,once the flow requests are received,we propose a RL-based mechanism to allocate these flows to slices in sequence.In the second stage,aware of the imbalance of resources among slices,we propose a realtime resource adjustment mechanism to balance the resources dynamically.
We use reinforcement learning to train a general policy for specific topology and flow,which is used to determine on which slice each flow is deployed to maximize the success rate.
RL is a field of machine learning that emphasizes how to act based on the environment to maximize the expected bene?fits[16].The basic reinforcement learning model is defined by the tuple <S,A,R,P>,whereSis the set of states,Ais the set of actions,Ris the reward function andPis the state tran?sition probability.We formulate the flow deployment with net?work slicing problem as a Markov decision process (MDP),shown as follows:
? Observation state:The observation obtained by the agent from the simulation environment is composed of two parts.One is the current state of all slices,and the other is the infor?mation of the flow to be deployed.We useobsandobfto repre?sent these two parts.So,the states∈Scan be denoted bys={obs,obf}.More specifically,obsis represented by the adjacen?cy matrix of the available bandwidth of each slice.obfin?cludes the source node and the target node,and the size of the current flowfj.
? Action:The action space is a discrete space of slice in?dex.Specifically,our RL agent selects the slice number to de?ploy for each flow,so we useA={1,2,...,K} to represent the action space,whereKis the number of slices.
? Reward:Reward is numerical feedback obtained by the agent after taking an action according to the current state and applying it in the environment.The magnitude of the value can reflect the quality of the action,and the reinforcement learning design is used to maximize the reward value in the long-term range.In our situation,the specific form of the re?ward is shown in Eq.(1)below.

in which∈{-1,1}is a binary variable,when a flowfjto be deployed is successfully deployed on the the slicesk.The val?ue ofis 1,otherwise the value is -1.TheI(·) is an indicator function,and the condition is 1,otherwise,it is 0.
The reward function designed in this way can properly re?flect the correctness of the decision made by the RL agent.When most of the slices can correctly deploy the flow,it is easy to make a decision,so a small reward will be generated.Moreover,if most of the slices cannot deploy the flow correct?ly,and the RL agent makes a correct decision,it will produce a large positive reward.Conversely,the wrong decision will al?ways produce a negative reward.The absolute value of the re?ward is related to the difficulty of making a wrong decision.
? Transition probability:The MDP transition probabilityp(st+1|st,at)is deterministic,which depends on the way that the flow routing method and the resource adaptation between slices after the flow is deployed.It will be introduced in the next section.
We want to train an agent to decide which slice to assign based on the state of each slice.The process that the RL poli?cy interacts with the environment and the specific elements of the learning environment can be seen in Fig.3.
1) RL training mechanism:We use an asynchronous proxi?mal policy optimization (APPO) algorithm[17]based on the im?portance weighted actor-learner architecture (IMPALA)[18]to train our agent.Compared with synchronous proximal policy optimization (PPO),APPO is more efficient in wall-clock time due to its use of asynchronous sampling.Using a clipped loss also allows for multiple stochastic gradient descent (SGD)passes,and therefore the potential for better sample efficiency compared with IMPALA.Meantime,V-trace can also be en?abled to correct for off-policy samples.The PPO-based flow al?location algorithm is described in Ref.[17].It repeatedly uses the data obtained by sampling to update the strategy parame?ters with gradient ascent until the strategy is no longer updat?ed.The specific proof of convergence can be seen in the origi?nal paper[17].
The PPO is more stable than the Q-learning algorithm learn?ing process.This training framework can use distributed learn?ing to solve actual large-scale network topology scenarios.
2) RL Training Architecture:We use a framework called ray[20–22]for our reinforcement learning training,which can provide simple primitives for building and running distributed applications.With the help of ray,we can build a distributed framework to accelerate the training process of reinforcement learning.The specific framework is shown in Fig.4.Each roll?out worker contains a simulation environment and the latest policy,where each policy exchanges weights with the central SGD learning process at regular intervals to obtain the latest learned policy.
In this distributed reinforcement learning architecture,the central learner runs SGD in a tight loop,while asynchronously extracting sample batches from many participant processes,and also supports multiple GPU learners and experience replay.

▲Figure 3.Interaction between policy and environment

▲Figure 4.Reinforcement learning(RL)training architecture
We first introduce the traditional resource allocation in the slice scenario.In a network that already has some flows,each link in the network has different remaining bandwidths.At this time,we have to start the slicing operation.The common method is to evenly allocate the remaining bandwidth of each link to each slice.This relatively fair initialization strategy is intuitive;however,the resources of slices are unchanged dur?ing deployment.Therefore,the flexibility of flow routing on each slice is greatly reduced compared with when it is not sliced.
We now propose the dynamic resource adjustment strategy to increase the flexibility of resource allocation method.Com?pared with the traditional (static) method,our proposed meth?od dynamically adjusts resources among slices in a real-time manner,aiming at balancing the type of flow deployed in the slice.
The specific algorithm description can be seen in Algo?rithm 1,and the deployment in each slice can be processed in separate processes.In the beginning,in each slice process,we expand the link bandwidth of the slice in step 3,and this step is to apply resources transferred from other slices to the cur?rent slice before the flow deployment.Then in step 4,we cal?culate the path by the specified routing strategy.In the algo?rithm,we use the simple strategy of the SP as an example.In fact,it can be any routing strategy.After this,in step 5 we check the bandwidth request from another slice.If the avail?able bandwidth is greater than twice the maximum flow size,we then transfer the required bandwidth and reduce it in the link.At last,in step 6,we send requests to other slices to in?crease the bandwidth resources of each link inPn.

The specific structure of Algorithm 1 can be seen in Fig.5,where we use a shared memory stack to asynchronously ex?change bandwidth resources between different slice processes.This is an example of three slices,which can be adjusted to any number of slices.Each slice first obtains the bandwidth resources transferred by other slices,and then deploys the flow.After that,the slice determines whether the current band?width resources can meet the requests of other slices,and if so,transfers the bandwidth to the other slices.The slice final?ly sends bandwidth transfer requests to other slices according to the path taken by the current deployment flow.
In the flow classification process of RL training,the envi?ronment that RL agent learns from also contains a resource ad?aptation strategy between slices.With the cooperation of the two methods,RL can learn a policy to allocate flows that re?peatedly pass through certain links to the same slice.One slice focuses on transmitting the same type of flows,and other infrequently used links’bandwidth resources are automatical?ly adjusted to other more needed slices.

▲Figure 5.Dynamic resource adjustment
In this section,we conduct some experiments to explain the effectiveness of our algorithm.
We first introduce the information of network topology and flow.
Network topology:We use the ring network similar to a real 5G network as the experimental environment.This ring net?work consists of three kinds of rings,namely,the core layer,the convergence layer,and the access layer.The bandwidth and delay of links in three kinds of layers are different.The core layer,a ring network composed of the core equipment,has the largest bandwidth of the network,and it is the destina?tion of most services in the network.The convergence layer,the bandwidth of which is smaller than that of the core layer,connects the core layer and the access layer,and aggregates each access layer network.The access layer has the smallest bandwidth of the network,which is composed of users and ter?minals.An example of the network topology of the ring net?work is shown in Fig.6.In this experiment,we use a ring net?work with 6 245 nodes and 8 135 links.

▲Figure 6.An example of network topology of the ring network
Flows:The source and end nodes of the flow are randomly gen?erated,and the size of the flow isU(0.1,0)*max bandwidthaccess,whereU(0.1,1)is a uniform distribution.
In the training process,thanks to the ray framework and careful hyperparameter adjustment,our algorithm can tend to converge within five iterations.The specific training loss is shown in Fig.7.The critic loss can reflect the accuracy of RL agent’s estimation of observation,and the policy loss is close to zero,which indicates that the strategy is close to the opti?mal strategy.

▲Figure 7.Loss during training is based on a 2080 Ti graphics process?ing unit(GPU)and 120 sampling processes(2 Xeon CPUs)

▲Figure 8.Total reward of different methods
In RL test,we use accumulated rewards to show whether the classification effect is good or not,and we also convert other classification into the same measurement indicators.From Fig.8,we can see that the cumulative reward of the converged RL scheme fluctuates around -30 (the larger the better).The cumulative reward of the classic pooling method and random choice method used for comparison is around -70,so we can see that RL can effectively choose the correct slice for the flow.
The above is a unified measurement in the reinforcement learning environment,and what we actually care about is whether the success rate of the deployment can be improved with the help of RL,which relates to whether our reward de?sign is reasonable.
So we test the success rate of using the RL agent for flow classification,as well as the success rate of the classic method and random method as a comparison,and the results are shown in Fig.9.We can see that the strategy learned using RL is significantly better than the classic method while main?taining the same trend as in Fig.8,which shows that the re?ward we designed for RL is reasonable.
In the topology set above,we continuously increase the number of flows to compare the effects of different strategies by comparing the number of failed deployments.The first is the non-slicing case.After slicing,due to the decrease of rout?ing flexibility,the failure rate will inevitably be greater than the case without slicing,so we regard the non-slicing case as a benchmark.Then we test the static slicing strategy and com?pare the dynamically adapted slicing strategy we proposed.The result can be seen in Fig.10.
From Fig.10,we can see that the dynamic adjustment strate?gy can provide a failure rate that is almost close to that of an un?sliced case,and the effect is much better than the static solution.
The purpose of slicing is to save the calculation time for par?allel calculation,but the dynamic adjustment strategy will ob?viously increase some time consumption,so we test the aver?age time of routing each flow,and the results are shown in Fig.11.We can get that dynamic action almost bring no addi?tional time consumption.

▲Figure 9.Success Rate of different methods

▲Figure 10.Relative failure number of dynamic resource adaptation vs.static slicing

▲Figure 11.Comparison of calculation time of different methods
Although we have confirmed the effectiveness of the dy?namic resource adjustment strategy,we still have uncertain?ty about the combined effect of the two algorithms.There?fore,we conduct the following experiments to compare the performance of the RL agent flow classification and the clas?sic method with and without the dynamic resource adjust?ment strategy.The result is shown in Fig.12.It can be seen that no matter which method is used,the dynamic adjust?ment strategy can indeed further improve the success rate of flow deployment.

▲Figure 12.Comparison of the effectiveness of combined methods
To solve the time-consuming problem of path calculation,a slicing operation is used to separate resources so that parallel path calculation can be performed to shorten the time.At the same time,after slicing,although the calculation time is short?ened,the more serious problem occurs,which is the decline in the success rate of flow deployment.In response to this prob?lem,we propose an efficient network slicing with dynamic re?source allocation algorithm to let the success rate of flow de?ployment close to the level when not sliced.It combines the flow classification of reinforcement learning and resource ad?aptation between slices.The dynamic resource adaptive strate?gy between slices enables slice resources to be exchanged dur?ing flow deployment and gradually adapt to the characteristics of the flows to be deployed in each slice.In this way,the paral?lel calculation path can be shortened without sacrificing the success rate.Besides,we have also tested our method on a large-scale actual network structure,and the effect exceeds the previous classic methods.