Scheduling Policies for Federated Learning in Wireless Networks:An Overview

2020-08-11 10:29:08

ZTE Communications 2020年2期

(Department of Electronic Engineering,Tsinghua University,Beijing 100084,China)

Abstract:Due to the increasing need for massive data analysis and machine learning model training at the network edge,as well as the rising concerns about data privacy,a new distributed training framework called federated learning (FL) has emerged and attracted much attention from both academia and industry.In FL,participating devices iteratively update the local models based on their own data and contribute to the global training by uploading model updates until the training converges.Therefore,the computation capabilities of mobile devices can be utilized and the data privacy can be preserved.However,deploying FL in resource-constrained wireless networks encounters several challenges,including the limited energy of mobile devices,weak onboard computing capability,and scarce wireless bandwidth.To address these challenges,recent solutions have been proposed to maximize the convergence rate or minimize the energy consumption under heterogeneous constraints.In this overview,we first introduce the backgrounds and fundamentals of FL.Then,the key challenges in deploying FL in wireless networks are discussed,and several existing solutions are reviewed.Finally,we highlight the open issues and future research directions in FL scheduling.

Keywords:federated learning;wireless network;edge computing;scheduling

1 Introduction

With the deployment of deep learning algorithms on Internet-of-Things (IoT) devices at the network edge[1]and the explosive growth of mobile data[2],technologies like edge learning[3]emerge and focus on running deep learning algorithms at the wireless access network.To ensure the performance of deep learning in practical scenarios,such as auto-driving and user preference prediction,efficient training of the learning model with the data generated at the network edge is necessary.However,transmission of massive training data from edge devices to servers is challenging due to limited wireless communication resources,as well as the privacy requirement,which makes it difficult to exploit centralized training for updating the learning model.To solve this problem,federated learning (FL)[4]is proposed,which exchanges learning models rather than raw data between edge devices and edge servers by deploying the training algorithms on edge devices.Since mobile devices will consume their limited computation and communication resources when participating in FL,mobile devices may not be willing to contribute.Therefore,some incentives have been introduced,such as the access to the high-quality models trained by FL,as well as some payment after participating in the FL training.

In a typical FL system,there is an edge server and several edge devices,which collaboratively train a learning model.The architecture of FL system is shown inFig.1.In each iteration (also known as communication round),the edge server aggregates the local models from edge devices in order to update the global model.Then the edge server broadcasts the newest model to edge devices for model training in the next round.After receiving the newest model,each edge device improves this model based on its own data to obtain a new local model.This process goes on until the global model converges.When aggregating the local model,each device can send the gradient of the local model back to the edge server as well as the whole local model.Compared with sending the whole model,sending gradients can reduce the information loss under the constraint of signal-to-noise ratio(SNR) and thus perform better than sending the whole model in analog aggregation (refer to Section 2.2 for details),because the norm of the gradient is smaller than the model generally.Except for analog aggregation,aggregating gradients and models are equivalent from the scheduling point of view,thus we consider that edge devices upload their updated local models rather than model gradients in the following parts of this paper unless otherwise specified.Fig.1 shows two different model aggregation schemes,analog aggregation and digital aggregation.In the analog aggregation scheme,edge devices send local models to the sever simultaneously and the aggregation is performed in the wireless channel according to the waveform-superposition property.In this way,the system can reduce the transmission latency since the transmission latency will not scale linearly with the number of devices.However,stringent synchronization between devices is needed during the model uploading,and the aggregation is vulnerable under the attack of third-party devices.In the digital aggregation,the model can be encoded for compression,encryption,and other purposes,which prevents the model from being aggregated in the wireless channel and is not suitable for analog aggregation.Although the digital aggregation is more convenient than the analog aggregation,long transmission latency will be introduced when the number of devices is large.

By distributing model training to the edge devices,FL mitigates the problem of privacy leaks caused by sending the raw training data from devices to the server.With the advantage of protecting data privacy,FL has been applied in some data sensitive scenarios,such as health artificial intelligence (AI)[5].However,some studies show that the learning model can still result in privacy leaks[6].To solve this problem,differential privacy-based methods[7-8],collaborative training-based methods[9-10]and encryption-based methods[11-12]are proposed,which can protect the privacy of parameters of learning model.

▲Figure 1.Architecture of federated learning system.

Another advantage of FL is saving the communication cost of transmitting a large amount of training data.However,FL meets some new challenges.The training of the learning model is distributed to edge devices that may have non-independent and identically distributed (non-i.i.d.) training dataset[13],which results in bad performance (such as low accuracy) of the learning model.Also,due to the different computation capabilities of devices,the FL system should consider the synchronization of the model updates from devices,and to address the straggler issues.In practical scenarios,the wireless resources of the FL system are usually limited,and thus the edge server may not be able to receive the local models from all the edge devices.To solve this problem,one direction of research is reducing the cost of transmitting the local model for every edge device,including model compression by quantization[14]and only updating the model for the edge server when the models have significant improvement[15].Another research direction is the scheduling of devices,where the edge server needs to schedule a subset of edge devices to send the model update.The device scheduling can reduce the communication cost but may result in slower convergence rate of the model training.Given the constrained wireless resources,scheduling policies for FL are proposed to maximize the convergence rate[16]of the learning model or to minimize the energy consumption[17]of the whole system.

There are some existing surveys on FL and edge machine learning[18-21].In Ref.[18],the authors provide a general overview on FL and its challenges in implementation,but do not consider specific issues of deploying FL in wireless networks.The architecture of deep learning and the process of training and inference in the context of edge computing are studied in Ref.[19].However,the authors of Ref.[19]place more emphasis on optimizing the FL algorithm itself rather than the scheduling policies for FL.The authors of Ref.[20]focus on communication-efficient FL in mobile edge computing platforms,rather than the scheduling policies that maximize the convergence rate of FL under resource constraints.In Ref.[21],the authors discuss potential FL applications in mobile edge computing,the resource allocation problems and data privacy problems in FL.Nevertheless,the authors of Ref.[21]have not provided an in-depth survey on the scheduling policies according to the model aggregation technique of FL in wireless networks,which can greatly affect the design of the scheduling policies.

In summary,none of the existing work has studied the FL in wireless networks from a scheduling perspective.Therefore,we provide a taxonomy on the aggregation methods used in FL,and discuss scheduling policies that can optimize the training performance under resource constraints for both digital-aggregation-based and analog-aggregation-based FL.The rest of this paper is organized as follows.In Section 2,we first introduce FL systems with analog-transmission-based aggregation,and then several scheduling policies designed for the analog-aggregation-based FL are discussed.The scheduling policies designed for the digital-aggregation-based FL are introduced in Section 3.Section 4 gives the conclusion of this paper and the future directions of federated learning.

2 Analog Aggregation

In a conventional wireless system,a base station needs to decode (deliver) the individual information from (to) each user.Accordingly,digital communications and orthogonal multiple access techniques have been developed and widely used.However,a key difference in the FL system is that,while aggregating the local models,the server is not interested in the individual parameters of edge devices,but their average.Note that the waveform-superposition property adds all the signals in a wireless multiple access channel,using analog transmission for global model aggregation,which is a more communication-efficient strategy[22-27].Edge devices synchronize with each other and transmit their local models concurrently.Then the wireless channel carries out the summation over the air,and the server receives the desired values,i.e.,the average of the local models,after dividing the received signal by the number of devices involved.Analog aggregation is also called over-the-air computation and it can further support more flexible functions such as weighted summation via power allocation,so that the server can receive the weighted average of local parameters.Some recent papers are summarized inTable 1.

2.1 Device Scheduling for Analog Aggregation

▼Table 1.Summary of recent papers on analog aggregation

A key issue of analog aggregation is how to schedule devices based on their channel states and power constraints.In thet-th round,each devicenobserves the channel statehn,t,and then aligns the transmission powerpn,t,to ensure that the server can receive its desired value.The power alignment equation is given by whereatis a power scalar that determines the received SNR at the server side,as well as the energy consumption at the device side.Parameterhthis called the power-truncation threshold,i.e.,a device can be scheduled only if its current channel state is better than the threshold.Parametersatandhthshould be carefully selected in order to optimize the training performance for FL.

In Ref.[23],two fundamental tradeoffs,namely the SNRtruncation tradeoff and reliability-quantity tradeoff,are rigorously characterized.Under the assumption of Rayleigh fading,the relation between the received SNR and power-truncation threshold is studied.The SNR-truncation tradeoff is then revealed:increasinghthcan improve the received SNR at the server,at the cost of truncating more devices which cannot satisfy the channel quality requirement.Moreover,the received SNR is limited by the furthest device with largest path-loss.Followed by this observation,a cell-interior scheduling policy is proposed,where only the devices within a distance thresholdrthcan be scheduled in each communication round.Parameterrthbalances the tradeoff between communication reliability and data quantity:largerrthenables the server to schedule more devices and exploit more data for training,while it degrades the received SNR and leads to a noisier version of the average of local models.An alternating scheduling policy is proposed,where the server alternates between the cell-interior scheduling policy and all-included scheduling policy.Finally,theoretical analysis indicates that the communication latency of analog aggregation can be reduced byO(N/log2N) compared to its digital counterpart,whereNis the number of devices.

Removing the Rayleigh fading constraint,an online energy-aware dynamic device scheduling policy is proposed in Ref.[24].Since the explicit mapping between the loss function of the FL task and the set of devices scheduled in each round remains unknown,an alternative objective function that maximizes the average number of scheduled devices is considered.The long-term average energy constraint (which is equivalent to power constraint) of each device is transformed to a virtual energy deficit queue based on Lyapunov optimization.In each communication round,each device acquires the current channel statehn,tand decides whether to update its local model individually by considering the value of the virtual queue and the required energy consumption.The proposed device scheduling policy works in an online fashion,without requiring any information of the channel states in the future.It also works well if the channel states are non-i.i.d across time.

A multi-antenna analog aggregation FL system is considered in Ref.[25],where the number of scheduled devices is maximized under the mean-square-error(MSE)constraint.Satisfying the MSE requirement can limit the transmission error,and thus it guarantees the accuracy of the aggregated learning model parameters.In order to improve the efficiency of the device scheduling policy,a sparse and low-rank approach is introduced.

2.2 Sparsification and Error Accumulation

The neural networks to be trained for FL tasks usually have huge dimensions,with thousands to millions of parameters.However,the wireless bandwidth is in general limited,and thus the communication latency scales up with the dimension of local models.To further reduce the communication cost for model aggregation,gradient sparsification techniques are introduced in Refs.[26]and [27].Note that transmitting local gradients rather than local models can improve the power efficiency of analog aggregation,because all the power is used to transmit the information unknown to the server.Therefore,all the devices update their gradients rather than the up-to-date models.

To reduce the dimension of local gradients,a random linear projection is first employed,inspired by compressive sensing.In particular,each local model is multiplied by a random matrix,where each entry follows Gaussian distribution.The random matrix is shared by the devices and the server.Then each device only retainskentries with largest absolute values,which can be regarded as the most important parameters of the gradients,while setting all the other gradients to zero.Here,kis a design parameter which balances the tradeoff between communication reliability and distortion:with smallerk,each entry can be transmitted in a higher power,so that the SNR at the server is higher.However,more information of the local gradient is lost due to the sparsification,degrading the accuracy of the neural network as well as the convergence rate of training.

Instead of discarding all the lost information due to sparsification,a more efficient way is to do error accumulation at the device side.In particular,in each round,the device calculates the differences between the sparse gradients and the original gradients,and adds these differences to the gradients obtained in the next round before employing sparsification.In this way,the error due to sparsification is accumulated by workers,and the training accuracy can be improved according to the experimental results.

Device scheduling policies are also designed for analog aggregation with gradient sparsification and error accumulation.In Ref.[26],additive white Gaussian noise (AWGN) channel is considered,and both equal and unequal power allocation policies are designed.The unequal policy puts more power to the initial rounds,motivated by the fact that the variance of the gradients diminishes across time.Ref.[27]further considers Rayleigh fading channels.Extensive experiments show that compared to the digital aggregation,analog aggregation can improve the convergence rate of training,particularly at low bandwidth and stringent power regimes.

2.3 Non-IID Training Data

The non-i.i.d.data,i.e.,the different distributions of data samples at devices,is also a major bottleneck for FL.It is shown in Ref.[28]that high non-i.i.d.data reduces the accuracy of the neural network by 11% under the Modified National Institute of Standards and Technology (MNIST) dataset,and by over 50% under CIFAR-10 dataset.The non-i.i.d.level of data refers to the difference of local data distribution and global data distribution,which can be characterized by the earth mover's distance,a measurement of the distance between two distributions.To reduce the non-i.i.d.level and thus improve the training accuracy,the server collects some sharable data samples from the devices and disseminates the data to the whole FL system.

Non-i.i.d.data is still a key issue in analog aggregation FL systems.In Ref.[24],data redundancy is introduced to reduce the non-i.i.d.level of data samples,which can be obtained by exchanging data between a group of devices or collecting data with overlapped coverage in IoT networks.Fig.2illustrates the analog aggregation for FL systems with data redundancy.Workers 1 and 2,3 and 4 exchange their local datasets with each other,and the redundancy level of the system,i.e.,how many devices store each data sample,is two.The experiment results with non-i.i.d.data using MNIST dataset are shown inFig.3,whereis the average energy constraint (in J),“dyn”is the proposed online energy-aware dynamic device scheduling policy,and“myopic”is a benchmark policy where devices can use as much energy asin each round.Parameterrdenotes the redundancy level,and=∞refers to the case where devices have infinite energy,so that all of them can be scheduled in each round.We can see that the proposed dynamic device scheduling policy outperforms the myopic benchmark,and data redundancy can improve the training accuracy significantly.In particular,when=5,increasing redundancy fromr=1 tor=2 can achieve an improvement of 10%in training accuracy.

3 Digital Aggregation

In many other studies,the FL systems are deployed in existing wireless networks (e.g.,cellular network or Wi-Fi network),where orthogonal-access schemes such as orthogonal frequency division multiple access (OFDMA) are used for model aggregation.To distinguish them from analog aggregation approaches,we categorize these approaches into digital aggregation.In digital aggregation,the participating devices need to share the scarce wireless bandwidth to upload the updated local models,making the global aggregation very timeconsuming.Further,the limited energy and computing resources of participating devices make it more challenging to deploy FL in real wireless networks.Therefore,various scheduling policies have been proposed to address these challenges.These scheduling policies can be divided into the following three categories:aggregation frequency adaptation,local accuracy tuning,and device scheduling.Table 2summarizes the highlights of recent papers on digital aggregation.

3.1 Aggregation Frequency Adaptation

▲Figure 2.Analog aggregation for federated learning with data redundancy.

▲Figure 3.Training accuracy of dynamic device scheduling policy in Ref.[24]under independent and identically distributed(i.i.d.)and non-i.i.d.data.

▼Table 2.Summary of recent papers on digital aggregation

In FL,the local update consumes computing resources of devices and the global aggregation consumes the bandwidth resources.Since FL iterates between local updates and global aggregations,the frequency of global aggregations (i.e.,the reciprocal of the number of local updates between two adjacent global aggregations) should be carefully tuned to balance the consumption of computing and bandwidth resources.In Ref.[29],the authors first analyze the convergence bound of FL with respect to(w.r.t)the number of local updates between two adjacent global aggregations.The bound shows that a higher global aggregation frequency can speed up the FL convergence,while the drawback is consuming more wireless resources for global aggregation.Then a scheduling policy that adapts the frequency of global aggregations in real time to maximize the convergence rate of FL is derived based on the derived convergence bound.The proposed scheduling policy is applicable to non-i.i.d.data distributions and heterogeneous resource constraints of participating devices.Their simulation results show that adaptively adjusting the global aggregation frequency can greatly improve the convergence rate of FL,compared with fixed global aggregation frequency counterparts.Further,the authors of Ref.[30]extend the scheduling policy proposed by Ref.[29]into a client-edge-cloud hierarchical system.In the client-edge-cloud hierarchical FL system,each edge server is allowed to perform partial aggregation that aggregates the updated local models of the edge devices within its communication range.While for the cloud-based global aggregation,the partially aggregated models at edge servers are aggregated through the backbone network by the centralized cloud server.The aggregation frequencies of two levels of model aggregation (i.e.,edge-based partial aggregation and cloud-based global aggregation) are optimized to minimize the global loss function value under a constrained number of total local updates.

3.2 Local Accuracy Tuning

The tradeoff between computation and communication is balanced through optimizing the aggregation frequency in aggregation frequency adaptation.Alternatively,some researchers balance this tradeoff via tuning the accuracy level of the local models.In general,increasing local model accuracy requires more computation,while fewer communication rounds are needed for more accurate local models to achieve a fixed global accuracy.

In Ref.[31],the authors refer to an upper bound for the number of communication rounds w.r.t.global accuracy and local accuracy,which is applied to strong convex loss functions for designing the scheduling policy.They adopt time division multiple access (TDMA) for media access control (MAC) layer and dynamic voltage and frequency scaling (DVFS) for devices'CPUs.Thus the frequencies of devices'CPUs,the communication latency of local model uploading and the local accuracy are jointly optimized to minimize the weighted sum of training latency and device energy consumption.As a result,both the computation-communication tradeoff and the device energy consumption-FL training latency tradeoff can be characterized.The overall non-convex optimization problem is decoupled into convex sub-problems,and the closed-form optimal solutions to the sub-problems are illustrated by extensive numerical results.While in Ref.[32],the authors consider a similar FL system but with frequency division multiple access(FDMA).Therefore,the bandwidth allocated to each devices should be jointly optimized with the communication latency,the CPU frequency and the local accuracy.Due to the complicated nature of the problem,the authors of Ref.[32]proposed an iterative algorithm.Their simulation results show that up to 25.6% latency reduction and 37.6% energy reduction can be achieved compared to conventional FL.

3.3 Device Scheduling

Due to the limited wireless resources and stringent training delay budget,only a portion of devices are allowed to upload local models in each round in real FL systems[33].Thus the device scheduling policy is critical to FL and will affect the convergence performance in the following two aspects.On one hand,the server needs to wait until all scheduled devices have finished updating and uploading their local models in each round.Therefore,scheduling more devices per round can significantly slow down the model aggregation,because of the reduced bandwidth allocated to each device and the higher probability of having a straggler device (i.e.,the device with limited computation capabilities or bad wireless channel).On the other hand,scheduling more devices per round increases the convergence rate (w.r.t.the number of rounds)[34]and can potentially reduce the number of rounds required to attain the same accuracy.To this end,the scheduling policy should carefully balance the latency and learning efficiency per round.

Recently,device scheduling problems in FL have received many research efforts.The authors of Ref.[17]consider a joint scheduling and radio resource allocation problem for FL.In Ref.[17],OFDMA is used for model uploading,where bandwidth allocation can be optimized to reduce the energy consumption.To further characterize the convergence performance,they assume that the convergence rate linearly increases with the number of scheduled devices.Therefore,the optimization objective is set to be the weighted sum of the energy consumption and the number of scheduled devices with a predetermined tradeoff factor,so as to balance the energy consumption and convergence rate.After relaxing the integer constraint for the device scheduling as the real-value constraint,the optimization problem is solved by iteratively solving the bandwidth allocation and scheduling sub-problems.

Furthermore,some recent studies consider the unreliable wireless transmissions.In Ref.[35],the authors propose to deploy FL in cellular networks where inter-cell interference can affect the transmissions of model aggregation.For the transmission quality,only if the received signal-to-interferenceplus-noise ratio (SINR)exceeds a threshold,the received local models can be successfully decoded.The convergence rate of FL under such settings,accounting for effects from both scheduling and interference,is then derived.Furthermore,three basic scheduling policies,namely the random scheduling,roundrobin and proportional fair,are compared in terms of FL convergence rate.Their results show that the proportional fair policy performs better under a high SINR threshold,while roundrobin is suitable for a low SINR threshold.However,the authors of Ref.[36]consider OFDMA for model aggregation and use the packet error rate to capture the unreliability of the wireless transmission.In Ref.[36],a convergence rate bound w.r.t.packet errors is first derived,given the transmitting power of devices,OFDMA resource block allocation and device scheduling policy.Then,the authors formulate an optimization problem to maximize the convergence rate by jointly optimizing the transmitting power allocation,resource block allocation and scheduling policy.The optimization problem is solved in a two-step manner:first obtaining the optimal transmitting power of each device given the device scheduling and resource block allocation;then using the Hungarian algorithm to find the optimal device scheduling and resource block allocation.As shown by simulations,the proposed method can reduce up to 10% and 16% loss function value,compared to:1)optimal device scheduling with random resource allocation;2)random device scheduling and random resource allocation,respectively.

However,the convergence rate w.r.t.time,which is critical for real-world FL applications,has not been addressed by aforementioned works.To accelerate the FL training,the authors of Ref.[37]propose to maximize the number of scheduled devices in a given time budget for each round,while the stragglers are discarded to avoid slowing down the model aggregation.The proposed greedy scheduling policy iteratively schedules the device that consumes the least time in model updating and uploading,until reaching the time budget.Although the proposed scheduling policy is simple,their experiments show that it is efficient and applicable to both non-i.i.d.data distributions and heterogeneous devices.

Nevertheless,the time budget is chosen through experiments and can hardly be adjusted under highly-dynamic FL systems.To overcome this drawback,Ref.[16]proposes a joint scheduling and resource allocation policy with fast convergence for FL.Specifically,a latency-optimal bandwidth allocation policy for local model updating and uploading is first derived.Then given the set of scheduled devices and the latency-optimal bandwidth allocation,based on a known upper bound of the number of required rounds to attain a fixed global accuracy,an upper bound of the time required to attain a fixed global accuracy is derived.Finally,an iterative scheduling policy is proposed that iteratively schedules the device that minimizes the approximate time upper bound until the approximate upper bound begins to increase (i.e.,scheduling more devices makes the convergence time longer).Fig.4shows the highest achievable accuracy within a total training time budge that equals to 300 seconds under different scheduling policies,including fast converge scheduling policy[16],random scheduling policy with empirically optimal number of scheduled devices (random-opt),client selection policy[37],and proportional fair policy[35].The experiments are conducted using non-i.i.d.distributed MNIST dataset,and it is assumed that all devices are randomly located in a cell.With different cell radius,the simulation results show that the fast converge scheduling policy always outperforms other scheduling policies in terms of the convergence rate w.r.t.time,and is applicable to non-i.i.d.data.

▲Figure 4.Highest achievable accuracy under different scheduling policies v.s.the radius of device distributed area.

4 Conclusions and Future Directions

This paper presents a brief introduction of FL in wireless networks and in particular an overview on the scheduling policies for wireless FL.Firstly,the motivation of deploying FL in wireless networks and the fundamentals of FL systems are introduced.Then,a series of works in the FL systems with analog aggregation are discussed,including device scheduling,model sparsification and data redundancy.Afterwards,we provide an overview on another series of works in FL systems with digital aggregation,including aggregation frequency adaptation,local accuracy tuning and device scheduling.However,apart from the aforementioned works,there are still some challenges and future research directions in deploying FL in wireless networks:

1) Delayed CSI:In the existing works on analog aggregation,power alignment is based on perfect CSIs of devices.While in practice,the server only has delayed CSIs of devices,and how to align the transmission power of devices to minimize the distortion of the aggregated model under delayed CSI remains an open problem.To address this challenge,using the recurrent neural network to predict instantaneous CSI according to the historical CSI estimations may be a future direction.

2) Non-i.i.d.data distribution:Since the data distributions of different devices are usually non-i.i.d.in practical wireless FL applications,it is crucial to design non-i.i.d.data distribution-aware scheduling policies.Although the non-i.i.d.issue in FL systems with digital aggregation has been considered in Refs.[16],[29-30]and [37],none of them has proposed any method to alleviate the accuracy degradation caused by non-i.i.d.data.In the future,the data redundancy introduced in Ref.[24]and the communication-efficient data exchange technologies between different devices can be considered in FL systems with digital aggregation to address the non-i.i.d.issue.

3) Convergence guarantee:FL is actually a distributed optimization algorithm that cannot always guarantee to converge.Although most FL algorithms empirically converge and several existing works have provided convergence analysis for FL with convex or strongly convex loss functions.Theoretical analysis and evaluations on the convergence of FL with generally non-convex loss functions are still open problems.

ZTE Communications2020年2期

ZTE Communications的其它文章: ZTE Communications Guidelines for Authors; Editorial:Special Topic on Machine Learning at Network Edges; Crowd Counting for Real Monitoring Scene; Adaptive and Intelligent Digital Signal Processing for Improved Optical Interconnection; Joint Placement and Resource Allocation for UAV-Assisted Mobile Edge Computing Networks with URLLC; Knowledge Distillation for Mobile Edge Computation Offloading