Multi-Head Attention Graph Network for Few Shot Learning

2021-12-11 13:29:08BaiyanZhangHefeiLingPingLiQianWangYuxuanShiLeiWuRunshengWangandJialieShen

Computers Materials&Continua 2021年8期

Baiyan Zhang,Hefei Ling,＊,Ping Li,Qian Wang,Yuxuan Shi,Lei Wu Runsheng Wang and Jialie Shen

1School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan,430074,China

2School of Electronics,Electrical Engineering and Computer Science,Queens University,Belfast,BT7 1NN,UK

Abstract:The majority of existing graph-network-based few-shot models focus on a node-similarity update mode.The lack of adequate information intensifies the risk of overtraining.In this paper, we propose a novel Multihead Attention Graph Network to excavate discriminative relation and fulfill effective information propagation.For edge update,the node-level attention is used to evaluate the similarities between the two nodes and the distributionlevel attention extracts more in-deep global relation.The cooperation between those two parts provides a discriminative and comprehensive expression for edge feature.For node update,we embrace the label-level attention to soften the noise of irrelevant nodes and optimize the update direction.Our proposed model is verified through extensive experiments on two few-shot benchmark MiniImageNet and CIFAR-FS dataset.The results suggest that our method has a strong capability of noise immunity and quick convergence.The classification accuracy outperforms most state-of-the-art approaches.

Keywords: Few shot learning; attention; graph network

1 Introduction

The past decade has seen the remarkable development of deep learning in a broad spectrum of Computer Vision field, including Image classification [1], Object Detection [2-4], Person reidentification [5-8], Face Recognition [9], etc.Such progress cannot be divorced from vast amounts of labeled data.Nevertheless, the performance can be adversely affected by the data-hungry condition.Thus, there is an urgent need to enable learning systems to efficiently resolve new tasks with few labeled data, which is termed as few-shot learning (FSL).

The origin of FSL can be traced back to 2000, E.G.Miller et al.investigated Congealing algorithm to learn the common features from a few examples and accomplished the matching of specific images [10].Since then, considerable literature has grown up around the theme of few-shot learning [11].The vast majority of existing implementation methodologies belong to meta-learning (ML), which implements an episodic training strategy to learn the task-agnostic knowledge from abundant meta-train tasks.Multifarious ML approaches fall into three major groups:learn-to-measure methods provide explicit criteria across different tasks to assess the similarity between labeled and unlabeled data [12,13]; learn-to-model methods generate and update parameters through collaborating with proven networks [14,15]; learn-to-optimize methods suggest to fine-tune a base learner for fast adaptation [16].Despite its diversity and efficacy, mainstream meta-learning models mostly pay attention to generalize to the unseen task with transferable knowledge, but few explore inherent structured relation and regularity [17].

To remedy the drawback above, another line of work has focused on Graph Network, which adopted structural representation to support relational reasoning for few-shot learning [17].The early work constructed a complete graph to represent each task, where label information was propagated by updating node features from neighborhood aggregation [18].Thereafter, more and more graph methods have been devoted to few-shot learning.Such as edge-labeling framework EGNN [19], transductive inference methods TPN [20], distribution propagation methods DPGN [21], etc.With various features involved in the graph update, limited label information has been converted to multiple forms, and then double-counting and aggregation, entailing many otherwise unnecessary costs [22].Consequently, how to find the discriminable information and realize effective propagation is a problem that desperately needs to be settled.

Figure 1:The overall framework of the MAGN model.In this figure, we present a 3-way 1-shot problem as an example.After Feature Embedding Module femb (details in Section 4.2.1), samples and their relations generate the initial graph.There are L generations in the GNN module(we show one of them for simplicity).Each generation consists of node feature update and edge feature update, with cooperation among the node-attention, distribution-attention and labelattention.The solid circle represents support samples and the hollow circle represents the query samples.The square indicates the edge feature and the darkness of color denotes the value.The darker the color, the larger the value.The detailed process is described in Section 3

In this paper, we propose a novel multi-head attention Graph Network (MAGN) to address the problem stated above, which is shown in Fig.1.In the process of updating the graph network,different weights are assigned to different neighbor nodes.Compared to the node-similarity based weight of existing methods, we provide new insights into multi-level fusion similarity mechanism with distribution feature and label information to improve discriminative performance.More specifically, for node update, we treat the label information as an initial adjacency matrix to soften the noise of irrelevant nodes, thereby providing a constraint for the update direction.For edge update, we excavate the distribution feature by calculating the edge-level similarity of overall samples, as a feedback of global information, it reveals more in-depth relations.Collocating with the regular node-level attention, more valuable and discriminable relations would be involved in the process of knowledge transfer.Furthermore, we verify the effectiveness of our methods through extensive experiments on the MiniImagenet and CIFAR-FS datasets.The results show that MAGN exceeds comparable performance in quick convergence, robustness at the same time keeps the property of accuracy.

2 Related Work

2.1 Meta-Learning

Meta-learning, also known as “learn to learn,” plays an essential role in addressing the issue of few-shot learning.According to the different content in the learning systems, it can be divided into three categories:learn-to-measure methods, which based on metric learning,employs an attention nearest neighbor classifier with the similarity between labeled and unlabeled data.Matching networks adopts a cosine similarity [15], Prototypical network [12] establishes a prototype for each class and utilize Euclidean distance as a metric.Differ from above, Relation Net [13] devises a CNN-based relation metric network.Learn-to-optimize methods suggest to fine-tuning a base learner for fast adaptation.MAML [16] is a typical approach that learns a good initialization parameter for rapid generalization.Thereafter, various models derived from MAML, such as first-order gradients methods Reptile [23], task-agnostic method TAML [24],Bayes based method BMAML [25], etc.Learn-to-model methods generate and update parameters on the basis of the proven networks.Meta-LSTM [26] embraces the LSTM network to update the meta-learner parameters.VERSA [27] builds a probabilistic amortization network to obtain softmax layer weights.In order to predict weights, MetaOpt Net [28] advocates SVM, R2-D2 adopts ridge regression layer [29], while Dynamic Net [30] uses a memory module.

2.2 Graph Attention Network

The attention mechanism is essential for a wide range of technologies, such as sequence learning, feature extraction, signal enhancement and so on [31].The core objective is to select the information that is more critical to the current task objective from the numerous information.The early GCN works have been limited by the Fourier transform derivation, which was challenging to deal with a directed graph with indiscriminate equal weight [32].Given that, Yoshua Bengio equipped the graph network with a masked self-attention mechanism [33].During information propagation, it assigns different weights to each node according to the neighbor distribution.Benefited from this strategy, GAT can filter noise neighbor and improve the performance of the graph Framework.Such an idea was adopted and enhanced by GAAN [34].It combined these two mechanisms, the multi-head attention to extract various information, likewise the self-attention to gather them.

3 Model

In this section, we first summarize the preliminaries of few-shot classification following previous work and then describe our method in more technical detail.

3.1 Preliminaries

Few-shot learning:The goal of FSL is to train a reliable model with the capability of learning and generalizing from few samples.A common setting isN-wayK-shot classification task.Clearly,each taskTconsists of support setSand query setQ.There areN?Klabeled samples in the support set, whereNis the number of class andKis the number of samples in each class.Samples in the query set are unlabeled, but they belong to theNclass of support set.The learning algorithm aims to produce a mapping function from query samples to the label.

Meta-Learning:One of the main obstacles in the FSL is overfitting caused by limited labeled data.Meta-learning adopts episodic training strategy to make up for this, which increase generalization ability through extensive training on similar tasks.Given train date setDtrainand test date setDtest,Dtrain∩Dtest=?.Each taskTis randomly sampled from a task distributionP(T).It can be expressed asxirepresents thei-th sample,yiis its label.Tis the number of samples inQ.In the training stage, there are plenty ofN-wayK-shot classification tasks which samples fromDtrain.Through amounts of training episodic on these tasks, we can propose a feasible classifier.And in the testing stage, samples of each task stem fromDtest.Since tasks inDtrainandDtestfollow the same distributionP(T).Such classifier can generalize well on the task which samples fromDtest.

3.2 Initialized GNN

Graph Neural Networks:In this section, we describe the overall framework of our proposed GNN, as shown in Fig.1.Firstly, we utilize an embedding module to extract feature (detail in Section 4.2.1), after that each task is expressed as a fully-connected graph.ThroughLlayers Graph Update, the GNN realizes information transfer and relational reasoning.Specifically, the taskTis formed as the graphG=(V,E), where each nodevi∈Vdenotes the embedding samplexiin taskT, and each edgeei,j∈Ecorresponds to the relationship of two connected nodesvjandvi, wherei,j=1,2···F,Fis the numbers of all samples in theT,F=N×K+T.

Initial graph feature:In the graphG=(V,E), node features are initialized as the output of feature embedding module:v0i=femb(xi;θemb).Whereθembis the parameter set of the embedding modulefemb.Edge features are used to indicate the degree of correlation between the two connected nodes,ei,j∈[0,1].Given the label information, we set the edge features of labeled samples to reach the two extremes of intra-class and inter-class relations, while the edge features of unlabeled samples share the same relation to others.Therefore, the edge features are initialized as Eq.(1):

3.3 Multi-Head Attention

The majority of existing few-shot graph-models focus on a node-attention update mode,which adopts the node similarity to control neighborhood aggregation.This mode ignores the inherent relationships between the samples, which may lead to the risk of overtraining.Therefore,we propose a multi-head attention mechanism with distribution feature and label information to enhance the model capability.

3.3.1 Node-Level Attention

Like some existing methods as EGNN and DPGN, the node-level attention is based on the similarity between the two nodes.Since each node has a different neighborhood, we use normalization operation for nodes in the same neighborhood to get more discriminative and comparable results.We employ node-level attention with node-similarity defined as follows:

In detail, given nodesvkiandvkjfrom thek-th layer, Att is a metric network with four Conv-BN-ReLU blocks to calculate the primary similarity of the two nodes.In Eq.(3),N(i)denotes the neighbor set of the nodevi.Then we apply a local normalization operation by softmax and get the final node-similarity~nki,j.

3.3.2 Distribution-Level Attention

The node-level attention relies on the local relationships of node similarity, while the global relationship has not yet been fully investigated.To mine more discriminative information, we extract the global distribution feature by aggregating the edge features of overall samples and then evaluate the similarity of distribution feature, with definitions as Eqs.(4) and (5).

whereDkiis the distribution feature of nodevkifrom thek-th layer, it consists of all the edge features ofvki.Similarly, we can get the distribution feature of nodevkjasDkj.Then both of them would be sent to the Att network to assess the distribution similarity.The same softmax operation aims at simplifying the computations.

3.3.3 Label-Level Attention

In the previous work, though the aggregation scope is the neighborhood of each node,it extends beyond the same class.Furthermore, the update of graph network is a process of information interaction and fusion, therefore increasing the noise of nodes from diverse classes.We set an adjacency matrix to filter irrelevant information and constraint update direction as shown in Eq.(6).

whereAkis the adjacency matrix at thek-th layer.Ais the label adjacency matrix, the elementai,jis equal to one whenviandvjhave the same label and zero otherwise.Ekis the matrix of edge feature.It combines long-term label information with short-term updated edge features in a Recurrent Neural Network.Such operation prunes useless information from inter-class samples and distills useful intra-class samples.

3.4 Feature Update

Information transmission has been facilitated through the alternate update of node features and edge features.In particular, the update of node feature depends on neighborhood aggregation,where edge features cooperate with label information to control the relation transformation.While the edge features of MAGN subject to node-similarity and neighborhood distribution.

Based on the above update rule, the edge features at the(k+1)-th layer can be formulated as follows:

where conca/ave represents the connection between the two attention mechanisms, conca means cascade connection, ave denotes mean reversion.represents the node-similarity as shown in Eq.(3),represents the distribution-similarity as shown in Eq.(5).

The node vectors at the(k+1)-th layer can be formulated as Eq.(8):

where MLPvis the node update network with two Conv-BN-ReLU blocks,is the adjacency status ofvjandviat the(k+1)-th layer.It aggregates the node features of neighbor set with multi-head attention mechanism shown in Fig.2.

Figure 2:Multi-head attention

3.5 Prediction

OverLlayers update of node and edge feature, the classification results of nodexican be obtained from a prediction probability of corresponding edge feature at the final layerby softmax function:

In Eq.(9),is the Kronecker function that outputs one ifyj=nand zero otherwise.stands for the prediction probability whereviis in then-th category.

3.6 Training

During the episodic training, the parameters of proposed GNN are trained in an end-to-end manner.The final objective is to minimize the total loss function computed in all layers as shown in Eq.(10):

whereλkis the weight ofk-th layer,LErepresents the cross-entropy loss function,is the probability predictions of samplexiat thek-th layer andyiis the ground-truth label.

4 Experiments

For a fair comparison, we conduct our method on two standard few-shot learning datasets following the proposed experimental settings of EGNN and make contrast experiments with stateof-the-art approaches.

4.1 Datasets

MiniImageNet is a typical benchmark few-shot dataset.As a subset of the ImageNet, it is composed of 60,000 images uniformly distributed over 100 classes.All of the images are RGB colored, the size is 84?84?3.Following the setting provided by [26], we randomly select 64 classes for training, 16 classes for validation, and 20 classes for testing.

CIFAR-FS is derived from CIFAR-100 dataset.The same as MiniImageNet, it is formed of 100 classes and each class contains 600 images, which splits 64, 16, 20 for training, validation, and testing.In particular, the main obstacles of low resolution (32?32) and high inter-class similarity make classification task technically challenging.

Before training, both datasets have been endured data augmentation with transformation as horizontal flip, random crop, and color jitter (brightness, contrast, and saturation).

4.2 Implementation Details

4.2.1 Embedding Network

We adopt ConvNet and ResNet12 for the backbone embedding module.Following the same setting used in [19,23], the ConvNet architecture contains four convolutional blocks, each block is composed of 3 ?3 convolutions, a batch normalization, a 2 ?2 max-pooling and a LeakyReLU activation.Similar to ConvNet, ResNet12 also has four blocks, one of which is replaced by a residual block.

4.2.2 Parameter Settings

We evaluate MAGN in 5-way 1-shot and 5-shot classification task on both benchmarks.There are three layers in the proposed GNN model.In the meta-train stage, each batch consists of 60 tasks.While in the meta-test step, each batch obtains ten tasks.During training, we adopt the Adam optimizer with an initial learning rate of 5?10?4and a weight decay of 10?6.The dropout rate is set as 0.3, and the loss coefficient is 1.The results of our proposed model are obtained through 100kiterations on MiniImageNet and CIFAR-FS.

4.3 Results and Analysis

4.3.1 Main Results

We compare our approach with recent state-of-the-art models.The main results are listed in Tabs.1 and 2.According to diverse embedding architectures, the backbone can be divided into ConvNet, ResNet12, ResNet18, and WRN28.The major difference is the number of residual blocks.In addition, GNN-based methods are listed separately for the sake of intuition.Extensive results show that our MAGN yields better performance on both datasets.For example, among all the Convnet-architecture methods, The MAGN is substantially better than others.Although the results are slightly lower than DPGN, we still obtain the second place with a narrow gap of both backbones.Nevertheless, some common graph network methods like EGNN, DPGN adopt training and testing with labels in a consistent order, such as the label in the 5-way 1-shot task is from support set (0, 1, 2, 3, 4) to the query set (0, 1, 2, 3, 4).The learning system may learn the order of task rather than the relation of samples.To avoid this effect, we disrupt the label order of support set and query set.This setup makes our results less than optimal, but it is more in line with the reality of the scene.The proposed MAGN acquires a robust result that would not be biased by the noise of label order.

Table 1:Classification accuracy on CIFAR-FS

4.3.2 Ablation Study

Effect of Data shuffling mode:There are three ways to scramble data:shuffle the support set, shuffle the query set and shuffle both sets.We conduct a 5-way 1-shot trial with label-node attention in MiniImagenet.The comparative result is shown in the Tab.3.As we can see, the use of data shuffling mode has little effect on the accuracy rate, while it makes a difference to the time of convergence.It is consistent with the essence of random selection.To further explore the convergence performance of the model, the default setting is shuffling the order of both sets.

Effect of Different Attention:The major ablation results of different attention components are shown in Fig.3.All variants are performed on the 5-way 1-shot classification task of MiniImageNet.The baseline adopts only node attention (“NodeAtt”).On this basis, the variant “DisNode”adds distribution-level attention to assist edge update.For samples in the same class, their surrounding neighborhood would follow a similar distribution.Thus the “DisNode” model can mine more discriminable relationship between the two nodes and obtain an enhancement in accuracy.Besides, the performance of concatenating aggregation is superior to average aggregation.This advantage extends to the final state of three attentions with a slight rise from 0.49 (“CatDisNode”-“AveDisNode”) to 0.85 (“Cat3Att”-“Ave3Att”).The variant “LabNode” equips node update with label-level attention, leading to a considerable improvement in convergent iteration from 89k to 63k.We attribute this to the filtering capability of label adjacency matrix, which constrains update direction and realizes fast convergence.

Table 2:Classification accuracies on MiniImageNet

Table 3:5-way 1-shot results on MiniImagenet with different data shuffling mode

Figure 3:Effect of different attention.The left part shows the accuracy of variants with different attention components, the right part describes the convergence process of those variants

Effect of Layers:In GNN, the depth of the network has some influence on feature extraction and information transmission.To explore this problem, we perform 5-way 1-shot experiments with different numbers of layers.As shown in Tab.4, accuracy rate and convergence times are improved steadily with the network deepens.To manage the trade-off between convergence and accuracy, a 3-layers GNN is configured for our models.

Table 4:5-way 1-shot results on MiniImagenet with different layers

5 Conclusion

In this paper, we propose a multi-head attention Graph Network for few-shot learning.The multiple attention mechanism including three parts:node-level attention explores the similarities between the two nodes, and distribution-level attention extracts more in-deep global relation.The cooperation between those two parts provides a discriminative expression for edge feature.While the label-level attention, served as a filtration, weakens the noise of some inter-class information during node update and accelerates the convergence process.Furthermore, we scramble the training data of support set and query set to guarantee to transfer order-agnostic knowledge.Extensive experiments on few-shot benchmark datasets validate the accuracy and efficiency of the proposed method.

Funding Statement:This work was supported in part by the Natural Science Foundation of China under Grant 61972169 and U1536203, in part by the National key research and development program of China (2016QY01W0200), in part by the Major Scientific and Technological Project of Hubei Province (2018AAA068 and 2019AAA051).

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

Computers Materials&Continua2021年8期

Computers Materials&Continua的其它文章: Deep-Learning-Empowered 3D Reconstruction for Dehazed Images in IoT-Enhanced Smart Cities; VGG-CovidNet:Bi-Branched Dilated Convolutional Neural Network for Chest X-Ray-Based COVID-19 Predictions; A Technical Framework for Selection of Autonomous UAV Navigation Technologies and Sensors; Decision Making in Internet of Vehicles Using Pervasive Trusted Computing Scheme; A New Segmentation Framework for Arabic Handwritten Text Using Machine Learning Techniques; Using Semantic Web Technologies to Improve the Extract Transform Load Model