Gao-hua YAO,Jian-hai YU,Zhen-kun LU
(1 College of Electronic and Information Engineering,Wuzhou University,Wuzhou 543002,China)
(2 Guangxi Colleges and Universities Key Laboratory of Image Processing and Intelligent Information System,Wuzhou University,Wuzhou 543002,China)
(3 College of Information Science and Engineering,Guangxi University for Nationalities,Nanning 530000,China)
Abstract:The current convolutional neural network has some disadvantages such as a large amountof parameters,a slow detection speed,low detection accuracy in a complex environment,and it cannotbe embedded inmobile electronic devices.For improvement,this paper designs a face detection method with two layers of front-to-back separation and lightweight convolutional neural network.The first layer of the network uses a full convolutional neural network to quickly extract facial features and generate a large number of face boundary candidate frames.The second layer of the network uses a deep fully connected convolutional neural network to screen the candidate regions of the face inferred by the first layer of the network and output the face size,coordinates and confidence.The experiments show that the face detectionmethod designed in this paper has higher detection accuracy and detection speed on the Face Detection Data Set and Benchmark(FDDB),and the lightweight network designmakes it possible to transplant the algorithm tofront-end electronic devices.
Key words:Parameter quantity,Electronic equipment,Complex environment,F(xiàn)ull convolution network,F(xiàn)ace boundary candidate box,Lightweight
Face detection,as a part of face recognition,has always been a hot topic in academia.Prior to the outbreak of deep learning,this papermainly extracts artificially the designed image features such as Haar,HOG,SIFT,LBP,and use the algorithms such as Adaboost,ACF[1],SVM,DPM[2]to detect faces.The common disadvantage of these algorithms is that the feature extraction is relatively simple,and it cannot detect faces under the complex factors such as pose,expression,blur,and occlusion.
With the development of integrated circuits[3],the performance of graphics cards has been greatly improved,and some algorithms around deep learning have sprung up,such as the classic deep learning face detection algorithm Cascade CNN[4]proposed by H Li,and derived from VJ[5].It uses a three-layer cascaded convolutional neural network,and can be used to achieve fast and accurate detection of faces in complex scenes to a certain extent.But because of it increase additionally three correction networks aiming at a person face frame on the foundation of the network of 3F,it consumesmore calculations.
In 2014,Girshick R proposed the algorithm RCNN[6]and the improved series of algorithms Fast-RCNN[7]and Faster-RCNN[8].These series of algorithms are proposed for general targets and can be used for face detection targets.The advantage is that the detection performance is better,the disadvantage is that the speed is slow,and it cannot meet the real-time requirements of speed for face detection.In 2016,J Redmon proposed the algorithm YOLO[9]which is an end-to-end network,and that can reducemany unnecessary calculations.The detection speed is fast enough,but the detection accuracy is low.Also appeared in the same period is the algorithm SSD[10],which has the fast detection speed as the algorithm YOLO.But the accuracy of face detection of dense small targets is poor.In 2018,Xu Tang et al.Proposed the algorithm PyramidBox[11],which detects small faces in uncontrolled environments.The backbone network module of the algorithm is VGG16,and increased low-level feature pyramid network layers,environmentally sensitive Forecasting and environmental enhancement module.The detection accuracy is extremely high,but it is still in the exploratory stage and rarely used in actual engineering at present.
Most of the deep learning face detection algorithms rely on high-performance servers and are difficult to apply to embedded electronic devices.In order to realize the algorithm based on the idea of cascade,it designs a face detectionmethod of two cascaded convolutional neural networks separated front and back.First,ituses the first layer of lightweight convolutional network to rough extract the face features and generate a large number of face candidate frames.Refer to the design of Liu[12],it designs the second layer of the network as a deep convolutional neural network to remove a lot of redundant features in the candidate face image fragments generated by the first layer of the network,and filter the face frame to return the true position of the face.
The face detection framework in this paper is designed in two levels,and its operation diagram is shown in Fig.1.The original image is an RGB image downloaded randomly on the network.Before the detected image is input to the network,the original imagemust be scaled to generate a pyramid image,so that the first-level network can cover face image fragments of different sizes.Then itextracts facial features of different scales of the image fragments inputted to network.The first-level network is used to generate face candidate frames,and the second-level is used to judge and filter non-face candidate frames,and output the final face border.

Fig.1 Face detection fIow chart
F-Net is designed as a lightweight full convolutional network.The input of this network is an RGB image fragment of 18*18*3 pixels,which is obtained by sliding awindow on the image detected through awindow of 18*18 pixels.After four convolutional layers and two pooling layers,the outputs are tensors of size 2*1*1 and 4*1*1,as shown in Fig.2.These two outputs respectively represent the score of the face category and the scores of four coordinate information of the face frame.Comparing the face classification score and the preset threshold,it deconvolutes the feature points that are larger than the preset threshold,and return to their true positions on the original image toform several candidate framesR={r1,r2,r3,…,rn}.In order to obtain a larger recall rate,we preset the smaller threshold setting of F-Net.The candidate boxricontains information such as the coordinates of the candidate box,the face prediction score,and the feature value offset.rican be expressed as
Because the first-level network outputs a large number of overlapping candidate frames,so the overlapping candidate frames need to be filtered.We use NMS(Non-Maximum Suppression)algorithm tofilter the overlapping candidate frames.The filtered candidate frame image fragmentswere scaled to 34*34 pixels by a linear interpolation algorithm to input to the next-level network for screening.
The input of the B-Net network is the output of the F-Netnetwork.F-Net processes the original image and scales a series of images of the original image through a sliding window to generate a number of feature maps.By comparing with the preset thresholds,the features of the face are returned to the original image position by deconvolution,and the image segment is intercepted and the image size is changed to 34*34 pixels by linear interpolation.B-Net is a deeper fullyconnected network than the F-Net network.The task of this network is to make more accurate judgments and filters on the output of the F-Netnetwork and output the final face coordinates.The network structure consists of four convolutional layers,three pooling layers,and twofully connected layers[13],as shown in Fig.3.

Fig.2 F-net Network structure diagram

Fig.3 B-net network structure diagram
The experimental data comes from WIDER FACE[14]and FDDB[15]which are two authoritative datasets in world.WIDER FACE is used as the initial data set for our network training,and FDDB is a setof algorithm to evaluate the test data.We design F-Net and B-Net,with inputs of 18*18 pixels and 34*34 pixels,and the two networks are trained separately.Before training F-Net,we need to randomly crop WIDER FACE into image fragments of 18*18 pixels,and calculate the size of the face image fragments and correspondingly labeled face area IOU values,and divide the training data into positive samples,negative samples,and Some face samples.The image samples with IOU values greater than 0.65 are taken as positive samples,and the image fragments with 0.35<IOU<0.65 are taken as partial face samples,and the fragmentswith IOU<0.35 are taken as negative samples.Positive and negative samples are used for the classification task of the network,and positive samples and some face samples are used for the regression task of the network.The ratio of the three samples is set to 3∶1∶1.The training samples of B-Net are generated by training to generate F-Net network model tests.Similarly,the training samples are also divided into positive samples,some face samples,and negative samples to train the B-Net network.
F-Net and B-Net designed in this paper are trained simultaneously by combining the face classification task and the face border regression task.Assign weights to both tasks based on their importance,and the design of the loss function also varies according to the task of the network.The task of face classification is to determine that each point on the featuremap belongs to a human face and a non-human face,so its loss function is represented by a cross-entropy loss function,Such as formula(1):

The values ofare only 0 and 1,representing the true labels of the sample.0 represents the human face,and 1 represents the non-face.represents the predicted probability of the sample.This loss function represents the difference between the true sample label and the predicted probability.The task of face border regression is to calculate the distance between the four points of the face candidate window and the real face coordinates,and adjust the coordinates of the candidate window according to this distance difference.Therefore,the Euclidean distance loss function is used to represent,Such as formula(2):

In order to keep the gradient of the face classification task and the frame regression task in the training process at an order ofmagnitude,at the same time,according to the importance of the two tasks,aweight λis introduced and the value is set to 0.5.The total loss functionLtotalis as in formula(3):

The experiments in this paper are performed under the deep learning framework TensorFlow1.2.The experimental environment is shown in Table 1.The black and gray curves are used to represent F-Net and B-Netnetworks,while retaining the changes of cls_accuracy(recall rate),cls_loss(Face classification loss)and bbox_loss(face border regression loss)with the number of iterations.As shown in Fig.4,Tensorboard is used to visualize the changes of the three variables.As shown in Figs.5 and 6,the recall rates of F-Net and B-Net reached 0.96 and 0.99 respectively,the loss of face classification was reduced to 0.15 and 0.05,and the frame regression loss was reduced to 0.05 and 0.04.

Tab Ie 1 Lab environm ent

Fig.4 Reca II rate

Fig.5 Face c Iassification Ioss

Fig.6 Border regression Ioss
ROC curve,also known as the receiver operating characteristic curve,under certain stimulus conditions,taking the false report probability P(y/N)obtained by the subjects under different judgment standards as the abscissa,and taking the hit probability P(y/SN)as the ordinate,is a line into which connected each point.
We use the authoritative benchmark dataset FDDB to judge the performance of the algorithm.The FDDB dataset is derived from news pictures and a database created by extracting their news titles.The pictures include a variety of complex poses,lighting,backgrounds,expressions,actions,and occlusion environment,closing to reality and identifying the environment.There are two types of FDDB test results,of which one is continuous counting evaluation,and the other is discrete counting evaluation.This paper uses discrete evaluationmethods,which has obvious advantages than the face detection algorithms currently such as Viola-Jones,F(xiàn)ast Bounding Box[16],SURF Cascade[17],XZJY[18],SURF Fronta[19],Boosted Exemplar[20],Joint Cascade[21],Cascade CNN,DDFD[22],CCF[23],BBFCN[24],this algorithm has obvious advantages.As shown in Fig.7,the abscissa represents a false positive number,and the ordinate represents a true rate.The accuracy of the algorithm in this paper can reach 0.917,and the accuracy can reach above 0.91 when the false positive number is less than 500.

Fig.7 ROC cu rves ofm u Itip Ie a Igo rithm s
The paper has proven the advantages of the algorithm by observing the ROC curve.It canintuitively feel the performance of the algorithm through some legends.From the FDDB data set,it selects the legends constrained by factors such as scale,skin color,expression,blur out-of-focus,pose,and occlusion to observe the performance of our algorithm.As shown in Fig.8,our algorithm works onmultiple non-controlling factors.It can still perform well under the constraints ofmultiple uncontrolled factors,and detect faces successfully in pictures in the context of complex factors.

Fig.8 Performance of the a Igorithm under m u Itip Ie com p Iex factors
This paper explores a face detection method that is robust to complex environmental factors and has fast detection speed.It designs a two-layer lightweight convolutional neural network,and the input sizes of the network are 18*18 and 34*34 respectively.The pyramidal preprocessing of the input image allows the network to detectmulti-scale faces.At the same time,the design of lightweight network and cascademethods reduces network parameters,and improves detection accuracy,and it also has obvious advantages comparing with the popular face detection algorithms.The focus of this paper is on the exploration of convolution methods,using deep separable convolutionmethods tofurther accelerate the network,and at the same time,to transplant the algorithm into the embedded devices to realize its due value.