(College of Information Science and Electronic Engineering,Zhejiang University,Hangzhou,Zhejiang 310027,China)
Abstract:By periodically aggregating local learning updates from edge users,federated edge learning(FEEL)is envisioned as a promising means to reap the benefit of local rich data and protect users'privacy.However,the scarce wireless communication resource greatly limits the number of participated users and is regarded as the main bottleneck which hinders the development of FEEL.To tackle this issue,we propose a user selection policy based on data importance for FEEL system.In order to quantify the data importance of each user,we first analyze the relationship between the loss decay and the squared norm of gradient.Then,we formulate a combinatorial optimization problem to maximize the learning efficiency by jointly considering user selection and communication resource allocation.By problem transformation and relaxation,the optimal user selection policy and resource allocation are derived,and a polynomial-time optimal algorithm is developed.Finally,we deploy two commonly used deep neural network (DNN) models for simulation.The results validate that our proposed algorithm has strong generalization ability and can attain higher learning efficiency compared with other traditional algorithms.
Keywords:data importance;federated edge learning;learning accuracy;learning efficiency;resource allocation;user selection
With the explosive growth of data generated by mobile devices and the remarkable breakthroughs made in artificial intelligence (AI) in recent years,the combination of AI and wireless networks is attracting more and more interests[1].To leverage the abundant data,which are unevenly distributed over a large number of edge devices,and to train a high quality prediction model,the traditional scheme is to do centralized learning by transmitting the raw data to the data center.However,this scheme has two drawbacks.On the one hand,the privacy of users may be divulged when the data center suffers from malicious attacks.On the other hand,the communication latency is long since the volume of data is large and the communication resource is limited.To overcome these two issues,a new framework,namely federated edge learning(FEEL),has been recently proposed in Ref.[2].This framework makes a collaboration of the distributed learning framework,named federated learning (FL)[3]and mobile edge computing(MEC)[4],which not only ensures users'privacy but also exploits the computing resource of both edge devices and edge servers.
In the FEEL system,edge devices need to interact with the edge server constantly to train a global model.Thus,communication cost is one of the major constraints of model training since the wireless communication resource is limited.Recently,several works have investigated accelerating the training task by reducing the communication overhead[5-6].To achieve a low-latency FEEL system,the authors in Ref.[5]propose a broadband analog aggregation scheme by exploiting over-theair computation and derive two communication-and-learning tradeoffs.In Ref.[6],the authors propose a new protocol to reduce the communication overhead and improve the training speed by selecting devices as many as possible based on their channel state information (CSI).Besides,energy-efficient FL over wireless networks has been investigated in Refs.[7]and[8].In Ref.[7],energy-efficient strategies are proposed for joint bandwidth allocation and energy-and-learning aware scheduling with less energy consumption.The authors in Ref.[8]propose an iterative algorithm to achieve the tradeoff between latency and energy consumption for FL.Moreover,several recent works focus on the problem of user selection for FL over wireless networks[9-12].In Ref.[9],the authors derive a tradeoff between the number of scheduled users and subchannel bandwidth under fixed amount of available spectrum.To improve the running efficiency of FL,the authors in Ref.[10]propose a scheduling policy by exploiting the CSI,i.e.,the instantaneous channel qualities.In Ref.[11],the authors consider a user selection problem based on packet errors and the availability of wireless resources,and a probabilistic user selection scheme is proposed to reduce the convergence time of FL in Ref.[12].
However,the aforementioned works ignore the fact that the process of model training is time-consuming as well.According to Ref.[13],different training samples are not equally important in a training task.Therefore,faced with the massive data,the topic of selecting important data to further accelerate the training task deserves to be studied.Several recent works have studied on this topic.In Refs.[14]and [15],data importance is quantified by the signal-to-noise ratio (SNR) and data uncertainty measured by the distance to the decision boundary.Based on this,the authors propose a data importance aware retransmission protocol and a user scheduling algorithm,respectively.
As we have mentioned before,some works have already investigated the acceleration of the training task based on data importance.However,this topic has not been investigated in the FEEL system yet,which is a distributed edge learning system.Inspired by this,we consider an FEEL system,where the learning efficiency of the system is improved by user selection based on data importance.First,we analyze the relation between the loss decay and the learning update information(LUI),i.e.,the squared norm of the gradient,and derive an indicator to quantify the data importance.Then,an optimization problem to maximize the learning efficiency of the FEEL system is formulated by joint user selection and communication resource allocation.The closed-form solution for optimal user selection policy and communication resource allocation is derived by problem transformation and relaxation.Based on this,we develop a polynomial-time algorithm to solve this mixed-integer programming problem.Finally,we verify the generalization ability and the performance improvement of our proposed algorithm by extensive simulation.
The rest of this paper is organized as follows.In Section 2,we introduce the FEEL system and establish the deep neural network (DNN) model and communication model.In Section 3,we propose an indicator to quantify the data importance,analyze the end-to-end latency in each communication round,and formulate the optimization problem to maximize the learning efficiency.The optimal solution and the optimal algorithm are developed in Section 4.Simulation results are presented in Section 5 and the whole paper is concluded in Section 6.
In this section,we will first introduce the FEEL system model.Then,both the DNN model and communication model are introduced.
We consider an FEEL system as shown inFig.1,which comprises an edge server andKdistributed users,denoted by K={1,2,...,K}.Each user utilizes its local dataset to train the local DNN model.Let Dk={(x1,y1),...,()} denote the local dataset of userk,where xiis the training sample,yiis the size of the corresponding ground-true label,andNkis the size of dataset.During each communication round,users first upload their gradients to the edge server.Then,the edge server collects the local gradients from users and aggregates them as the global gradient.Users update their local models by the global gradient broadcast by the edge server.Ultimately,users are supposed to collaborate with each other in training a shared global model.Therefore,users'privacy is protected since the raw data are not transmitted to the edge server.However,due to the limited wireless communication resource,the number of users participated in the training task is restricted.To tackle this issue,we intend to propose a user selection policy by jointly considering the LUI and CSI of each user.During each communication round,users'data are not of equal importance.So we only select part of users to upload their local gradients based on data importance and channel data rate.The following seven steps are defined as a communication round.
1) Calculate local gradient.In then-th communication round,each user utilizes its local dataset to train its local model.Denoteθas the parameter set of the DNN model.The local gradient vector[n]can be calculated by the backpropagation algorithm.Note that the local model and the local gradient are different among users since different users may have different datasets.
2) Upload the squared norm of local gradient.After obtaining the local gradient vector[n],each user calculates the squared norm of local gradientand transmits it to the edge server.Here,‖·‖2is the L2 norm.
3) Select User.The edge server receives the squared norm of local gradients from all users.Based on data importance and channel data rate,the edge server will determine which users are going to be selected to participate in the training task.
4) Upload local gradient of selected users.In this step,those selected users upload their local gradients to the edge server via the time division multiple access (TDMA) method without loss of generality.
5) Aggregate global gradient.The edge server receives the local gradients of all selected users and then aggregates them as the global gradient,which can be expressed as

whereak∈{0,1} indicates whether userkis selected,i.e.,ak=1 if userkis selected andak=0 otherwise.
6) Broadcast global gradient.After finishing the global gradient aggregation,the base station (BS) broadcasts the global gradient to all the users.
7)Update local model.After the global gradient is received,each user updates its local model,as

whereξ[n]is the learning rate of then-th communication round.
The above seven steps are periodically performed until the global model converges.During the training process,the local gradient and the CSI of users are different in each communication round.Therefore,the edge server should run the optimal algorithm to select users in each communication round.
In this work,all users adopt the same DNN model for training.To evaluate the error between the learning output and the ground-true labelyi,we define the loss function of training samples asl(θ,xi,yi).Thus,the local loss function of userkand the global loss function can be represented as


▲Figure 1.Seven steps in each communication round.

respectively.In the course of training,the global loss functionL(θ) is the objective function to be minimized.In our scheme,we aim to accelerate the training task and train a high quality global model.Without loss of generality,we utilize stochastic gradient descent (SGD) as the optimal algorithm.Then,the local gradient vector of userkis given by

where ?implies the gradient operator.
As described above,distributed users and the edge server need to exchange data from each other in each communication round.In our scheme,two frequently-used approaches of data transmission are adopted,named TDMA and broadcasting.
First,those selected users upload their local gradients to the edge server via the TDMA method.Specifically,a time frame is divided intontime slots.Each user transmits its data on its own time slot.According to Ref.[16],the length of each time frame in LTE standards is 10 ms.Actually,the transmission delay of the gradients is on the scale of second,which is far larger than the length of a time frame[17].Therefore,we can use the average uplink channel capacity,rather than the instantaneous channel capacity,to evaluate the data rate of userk[18],which can be expressed as

After the global gradient aggregation is finished,the BS will broadcast the global gradient to all users.In this way,all users are able to receive the global gradient synchronously.Letdenote the downlink channel power gain of userkandpDdenote the transmission power for all users.Thus,the downlink data rate is given by

In this section,we will first propose an indicator to quantify the data importance of users.Then,we analyze the end-to-end latency in each communication round and formulate the optimization problem to maximize the lower bound of the system learning efficiency.
In each communication round,only part of users is selected to participate in the training task because of the limited wireless communication resource.According to Ref.[13],different training samples do not equally contribute to the model training.Consequently,we intend to select users based on the level of data importance as well as the channel data rate.To quantify the data importance,we define the loss decay function as
The loss decay function ΔL[n]indicates the decrease of the loss in then-th communication round.From Eq.(8),in the same period of time,the larger the loss decays,the faster the training speed is.In other words,the loss decay reflects the data importance to some extent.
According to Ref.[19],the loss decay is proportional to the squared norm of the gradient.Thus,the lower bound of the loss decay in then-th communication round is given as
where β is a constant determined by the learning rate and the specific DNN model.Therefore,we can further link the data importance with the squared norm of the gradient vector.With the above discussions,we can quantify the data importance of userkby the squared norm of its local gradient,which can be represented as

Therefore,the lower bound of the global loss decay in a communication round can be expressed as

As mentioned before,our goal is to improve the learning efficiency of the FEEL system.Thus,the end-to-end latency of one communication round should be optimized.The detailed analysis of latency in one communication round is given as follows.
1) Calculate local gradient.The latency of local training for userkis denoted by
2)Upload local gradient of selected users.As we mentioned before,only those selected users upload their local gradients to the edge server via TDMA.So the average transmission delay of userkcan be expressed by

whereτkis the proportion of the time slot for userkin a time frame andVis the volume of the gradient,which is a constant for all users.
3) Broadcast global gradient.For all users,the latency of downloading the global gradient is given by

4) Update local model.Let us denoteas the delay of model updating for userk.
Since the squared norm of local gradient and the value ofakare small enough,the corresponding transmission delay can be neglected.Besides,the edge server has powerful computing capacity in general.Therefore,the aggregation delay can also be neglected.
Then we provide further analysis to obtain the whole latency of one communication round.Note that all users receive the global gradient and start to update the local model synchronously.However,the delay of model updating and training varies since users may have different computing power.Hence,users are only allowed to upload the squared norm of local gradient to the edge server until they all finish model updating and training.In addition,the edge server should begin to aggregate the global gradient until those selected users have uploaded their local gradients.Based on the above analysis,the end-to-end latency of the FEEL system in one communication round is given by

In this work,we aim to improve the learning efficiency of the FEEL system by jointly considering user selection and communication resource allocation.According to Ref.[20],we adopt the following criterion to evaluate the training performance of the FEEL system.
Definition 1:The learning efficiency of the FEEL system can be defined as

Remark 1:The definition of the learning efficiency implies the decay rate of the global loss in a given time periodT.The improvement of the learning efficiency means the acceleration of the training task.Therefore,it is appropriate to evaluate the training performance of the FEEL system by the learning efficiency.In our work,we aim to reduce the communication delay of each communication round.Besides,we maximize the lower bound of the system learning efficiency.Consequently,the learning efficiency of the FEEL system can be improved.
Based on the above analysis,the optimization problem can be mathematically formulated as

where the constraint (16b) indicates that the end-to-end latency of each user in one communication round is no more than the end-to-end latency of the FEEL system and the constraint(16c) represents the uplink communication resource limitation.For description convenience,we rewriteasTCin the following sections.
It is evident that the optimization problem P1 is a mixed-integer programming problem.Since the objective function of P1 is non-convex,it is rather challenging to directly solve it.Combining Eqs.(12) and (16b),we notice thatTis relevant toakandτk.Whenakandτkare fixed,the variableTmust be minimized to maximize the learning efficiency.Therefore,the optimal solution to problem P1 can be obtained when“≤”in the constraint(16b)is set to“=”,i.e.τk=
However,problem P1 is still hard to solve due to the integer constraint (16d).Therefore,we relax the integer constraintak∈{0,1} to the real-value constraintak∈[0,1].Problem P1 can then be relaxed into problem P2,which is given by


In the following sections,we first obtain the optimal solution to problem P2 with fixedT.Then,we continue to solve the problem P2 with varyingT,and the optimal solution to problem P1 is finally derived.
We now solve the problem P2.WhenTis given,problem P2 can be converted to a standard convex optimization problem since the objective function is concave and all constraints are convex.Thus,we can derive the optimal solution to P2 with fixedT.
Theorem 1:The optimal solution to problem P2 with fixedTis given as follows.

whereλ*is the optimal value of the Lagrange multiplier satisfying the constraint (17b).Particularly,the real-value ofdepends on the constraint(17b)if
Proof:See Appendix A.
Remark 2 (Optimal user selection policy):According to Theorem 1,λ*can be regarded as the threshold which determines whether to select the user.Besides,the selection priority of userkdepends on the product of its data importanceρkand the uplink data rateOn the one hand,a user with more important data contributes more to the global model training.On the other hand,the transmission delay can be shortened by selecting users with higher uplink data rates.Thus,the system prefers to select users with larger values ofBy doing so,the learning efficiency of the FEEL system can be improved.
In this part,we proceed to obtain the optimal system latency and develop the optimal communication resource allocation to further improve the learning efficiency of the FEEL system.So far,we have obtained the optimal user selection strategy when the system latency is invariant.Based on this,the optimal system latency must be obtained when“≤”in the constraint(17b)is set to“=”,i.e.,T=In order to develop the optimalTandτk,we introduce the following theorem.
Theorem 2:The optimal solutions to problem P2 and problem P1 are exactly the same.
Proof:See Appendix B.
Remark 3 (Optimal system latency and communication resource allocation):Theorem 2 indicates that the optimal solution ofakto problem P2 must be an integer solution.Based on this,the range of feasible solutions to problem P2 can be reduced greatly.Thus,we only need to compare the learning efficiency of the FEEL system when the total number of selected users varies.Here,users in the system are selected by the optimal user selection policy as aforementioned.SoT*that achieves the maximum learning efficiency is the optimal system latency to both problems P2 and P1,which can be expressed as

As we have indicated before,when“≤”in the constraint(16b)is set to“=”,the solution must be the optimal solution of problem P1.Consequently,we can obtain the optimal communication resource allocation by simple mathematical calculation,as

The result in Eq.(19)shows that a less time slot is allocated for the user with a higher uplink data rate.
Thus far,we have obtained the optimal solution to problem P1.In this part,we intend to develop an optimal algorithm for problem P1 based on the above analysis.As mentioned before,in order to obtain the optimal solution to problem P1,all selection cases should be compared.However,this would become very time-consuming as the number of users increases.Therefore,a low computational complexity algorithm is required.We defineEM,M∈{1,2,...,K}as the learning efficiency of the FEEL system whenMusers are selected.To better fit the practical systems,we have the following theorem.
Theorem 3:EMincreases first and then decreases with the increase ofM.
Proof:See Appendix C.
Remark 4:Theorem 3 indicates that the learning efficiencyEMhas only one global optimal.Therefore,we can select users successively by the optimal user selection policy until the learning efficiency of the FEEL system begins to decrease.By doing so,we are able to find the optimal solution to problem P1.According to the above analysis,the optimal algorithm for problem P1 is shown inAlgorithm 1.We can easily find that the computational complexity of this algorithm is determined by the sort operation.Therefore,the computational complexity is O(KlogK).With regard to mixed-integer programming problems,it is acceptable to find the optimal solution with a polynomial-time complexity,indicating that this algorithm can be applied to practical systems.

In this section,we test the performance of the proposed algorithm by simulation and validate the performance improvement by comparing with other traditional algorithms.
In the FEEL system,Kusers are stochastically distributed over the coverage of the BS.The coverage area of the BS is a circle with a radius of 500 m.All users are connected with the BS by wireless channels.The channel gains are generated by the pass loss model,128.1+37.6log(d [km]),while the smallscale fading obeys the Rayleigh distribution with uniform variance.The noise power spectral density is -174 dBm/Hz and the system bandwidth is 5 MHz.The uplink and downlink transmit powers are both 24 dBm.
We utilize the dataset CIFAR-10 as the local dataset of all users to train model.The dataset is composed of 60 000 32×32 color images in 10 classes,which includes 50 000 training images and 10 000 test images.We shuffle all training samples first,divide them intoKparts equally and then distribute them to all users,respectively.Two common DNN models,Mobile-NetV2 and ResNet18,are deployed for image classification.Since it is time-consuming to restart training,we utilize the pretrained model to reduce the model convergence time.

▲Figure 2.The test accuracy versus communication round.

▲Figure 3.The global training loss versus communication round.
The generalization ability refers to the adaptability of algorithms to different DNN models.To test the generalization ability of our proposed algorithm,we implement it on the two DNN models as mentioned before when there areK=14 users in the FEEL system.Meanwhile,we make comparisons with the performance of proposed algorithm and thebaseline algorithmwhere all users are selected with equal communication resource allocation.The simulation results of the test accuracy and the global training loss are shown inFigs.2and3,respectively.From the figures,the proposed algorithm can achieve a high learning accuracy and a fast convergence rate for different DNN models.The result shows that our proposed algorithm has excellent generalization ability and can be widely implemented in practical systems.Moreover,the performance of our proposed algorithm is similar to that of thebaseline algorithmwith the increase of communication round rather than training time.It is reasonable since our proposed algorithm aims to reduce the communication delay in each communication round,rather than the number of communication rounds.Besides,this result demonstrates that our proposed algorithm can achieve the similar training speed by only selecting partial users in the FEEL system.
In this part,we compare the performance of our proposed algorithm with other conventional algorithms to verify its superiority.The two benchmark algorithms are described as follows.·Baseline algorithm:In each communication round,all users in the FEEL system participate in the training task with equal communication resource allocation,i.e.,τk=
·All selected algorithm:In each communication round,all users in the FEEL system participate in the training task with optimal communication resource allocation based on Eq.(19).
Here we use the pre-trained ResNet18 model to test the performance of the three algorithms in an FEEL system withK=14 users.The test accuracy versus training time with different algorithms is shown inFig.4.From the figure,it can be seen that our proposed algorithm achieves the highest test accuracy among all algorithms.The reason is that our proposed algorithm not only selects users based on data importance but also makes the optimal communication resource allocation.By doing so,only users with more important data and higher uplink data rate participate in the training task.Thus,the communication latency is reduced and the global loss decay rate increases,which eventually improves the learning efficiency of the system.The gap between thebaseline algorithmand theall selected algorithmdemonstrates the gain obtained by the optimal communication resource allocation.The gap between theall selected algorithmand theproposed algorithmdemonstrates the gain obtained by the optimal user selection.In conclusion,our proposed algorithm accelerates the training task and improves the learning efficiency of the FEEL system by jointly considering user selection and communication resource allocation.
To further verify the applicability and effectiveness of our proposed algorithm,we select one communication round randomly to obtain more simulation results.Figs.5and6illustrate the results of user selection and communication resource allocation for our proposed algorithm in the communication round we selected,respectively.From Fig.5,we can observe that userkis selected only when the product of its data importance and uplink data rate,i.e.,is no less than the selection threshold,which is consistent with Theorem 1.Moreover,in order to clearly present the relationship between the communication resource allocation and the uplink data rate,we plot the corresponding uplink data rate for all users inFig.7.Combining Fig.6 with Fig.7,it can be observed that a selected user with a higher uplink data rate is allocated with less communication resource,which is consistent with Eq.(19).
In the end,we further study how the number of users impacts the training performance of the FEEL system.The test accuracy versus training time with different numbers of users is shown inFig.8.From the figure,it can be seen that our proposed algorithm achieves the highest system learning efficiency whenK=6.The reasons can be explained as follows.The number of time slot allocated to the selected user is large when the number of users is small.Consequently,the communication latency greatly reduces,and the learning efficiency of the FEEL system significantly improves in this scenario.Moreover,the number of selected users is limited by the scarce wireless communication resource when the number of users is too large.Therefore,the learning efficiency of the FEEL system does not improve with user number when too many users in the system.

▲Figure4.Thetestaccuracyversustrainingtimewithdifferentalgorithms.

▲Figure 5.User selection for the proposed algorithm.

▲Figure 6.Communication resource allocation for the proposed algorithm.

▲Figure 7.Corresponding uplink data rate.

▲Figure 8.Test accuracy versus training time with different numbers of users.
In this paper,we aim to accelerate the training task and improve the learning efficiency of the FEEL system by proposing an optimal user selection policy based on data importance and CSI.After analyzing the data importance of users and the end-to-end latency of the FEEL system,we formulate an optimization problem to maximize the learning efficiency of the FEEL system.By problem transformation and relaxation,we first develop the optimal user selection policy.Based on this,the optimal communication resource allocation is developed in closed-form.We further develop a polynomial-time algorithm to solve this mixed-integer programming problem and prove its optimality.Finally,the simulation results show that our proposed algorithm has strong generalization ability and can significantly improve the learning efficiency of the FEEL system.
Our work has demonstrated that the learning efficiency of the FEEL system can be further improved by user selection based on data importance and wireless resource allocation.However,some assumptions have been made to gain insightful results.In the future,we will make further investigation to better fit the practical systems.First,we have assumed that there is no inter-cell interference in the uplink.In the future,the FEEL system with inter-cell interference deserves further investigation.Second,the local gradient received by the edge server may contain data errors,which may affect the training performance of the FEEL system.Therefore,our future work can further study the impact of those errors.Last but not the least,it is meaningful to extend our proposed algorithm to the FEEL system,where orthogonal frequency-division multiple access (OFDMA) is adopted for data transmission.
We apply the Lagrangian method to obtain the optimal solution to problem P2 with fixedTsince it is a convex optimization problem.The Lagrangian function is defined as

whereλis the Lagrange multiplier related with the constraint(17b).By applying the Karush-Kuhn-Tucker (KKT) conditions and simple calculation,we can draw the following necessary and sufficient conditions,as

With simple mathematical calculation,we can derive the optimal user selection policy as shown in Theorem 1,which ends the proof.
According to Theorem 1,users are selected by the descending order ofHence,we can assume thatak=1 whenk=1,2,...,Mandak=0 whenk=M+2,M+3,...,K.Moreover,it is not clear whetheraM+1=0 oraM+1=1.Then,we denoteE(1)as the objective function of problem P2,which can be expressed as

So the derivative ofE(1)with respect toaM+1is given by

It shows that the sign of derivative is consistent with the sign of the numerator of Eq.(24).However,the value of the numerator of Eq.(24) is independent ofaM+1.Therefore,E(1)is monotone whenaM+1∈[0,1].That is,the maximum value ofE(1)must be obtained either whenaM+1=0 or whenaM+1=1.In conclusion,the optimal solution ofakto P2 must be an integer solution.Hence,this solution must be the feasible solution to problem P1 as well.Moreover,after relaxation,the maximum value of the objective function is non-decreasing.Thus,the optimal solutions to problem P2 and P1 are exactly the same,which ends the proof.
According to Theorem 2,we know that the optimal solutions to problem P2 and P1 are exactly the same.Thus,we only consider the integer solutions here.When no user is selected,the learning efficiency is zero obviously.The learning efficiency must increase first with the number of selected users.In other words,at least one user is selected.Then we consider the following condition.

From Eqs.(25) and (26),we can obtain the following inequalities.

According to Eq.(27),we can derive the recurrence formula as

which impliesEM-2<EM-1.Then we can obtain the conclusion recursively,as

Similar to the above analysis,we have the following conclusion,as

Based on the above analysis,EMfirst increases and then decreases withM,which ends the proof.