999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Multi-Head Attention Graph Network for Few Shot Learning

2021-12-11 13:29:08BaiyanZhangHefeiLingPingLiQianWangYuxuanShiLeiWuRunshengWangandJialieShen
Computers Materials&Continua 2021年8期

Baiyan Zhang,Hefei Ling,*,Ping Li,Qian Wang,Yuxuan Shi,Lei Wu Runsheng Wang and Jialie Shen

1School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan,430074,China

2School of Electronics,Electrical Engineering and Computer Science,Queens University,Belfast,BT7 1NN,UK

Abstract:The majority of existing graph-network-based few-shot models focus on a node-similarity update mode.The lack of adequate information intensifies the risk of overtraining.In this paper, we propose a novel Multihead Attention Graph Network to excavate discriminative relation and fulfill effective information propagation.For edge update,the node-level attention is used to evaluate the similarities between the two nodes and the distributionlevel attention extracts more in-deep global relation.The cooperation between those two parts provides a discriminative and comprehensive expression for edge feature.For node update,we embrace the label-level attention to soften the noise of irrelevant nodes and optimize the update direction.Our proposed model is verified through extensive experiments on two few-shot benchmark MiniImageNet and CIFAR-FS dataset.The results suggest that our method has a strong capability of noise immunity and quick convergence.The classification accuracy outperforms most state-of-the-art approaches.

Keywords: Few shot learning; attention; graph network

1 Introduction

The past decade has seen the remarkable development of deep learning in a broad spectrum of Computer Vision field, including Image classification [1], Object Detection [2-4], Person reidentification [5-8], Face Recognition [9], etc.Such progress cannot be divorced from vast amounts of labeled data.Nevertheless, the performance can be adversely affected by the data-hungry condition.Thus, there is an urgent need to enable learning systems to efficiently resolve new tasks with few labeled data, which is termed as few-shot learning (FSL).

The origin of FSL can be traced back to 2000, E.G.Miller et al.investigated Congealing algorithm to learn the common features from a few examples and accomplished the matching of specific images [10].Since then, considerable literature has grown up around the theme of few-shot learning [11].The vast majority of existing implementation methodologies belong to meta-learning (ML), which implements an episodic training strategy to learn the task-agnostic knowledge from abundant meta-train tasks.Multifarious ML approaches fall into three major groups:learn-to-measure methods provide explicit criteria across different tasks to assess the similarity between labeled and unlabeled data [12,13]; learn-to-model methods generate and update parameters through collaborating with proven networks [14,15]; learn-to-optimize methods suggest to fine-tune a base learner for fast adaptation [16].Despite its diversity and efficacy, mainstream meta-learning models mostly pay attention to generalize to the unseen task with transferable knowledge, but few explore inherent structured relation and regularity [17].

To remedy the drawback above, another line of work has focused on Graph Network, which adopted structural representation to support relational reasoning for few-shot learning [17].The early work constructed a complete graph to represent each task, where label information was propagated by updating node features from neighborhood aggregation [18].Thereafter, more and more graph methods have been devoted to few-shot learning.Such as edge-labeling framework EGNN [19], transductive inference methods TPN [20], distribution propagation methods DPGN [21], etc.With various features involved in the graph update, limited label information has been converted to multiple forms, and then double-counting and aggregation, entailing many otherwise unnecessary costs [22].Consequently, how to find the discriminable information and realize effective propagation is a problem that desperately needs to be settled.

Figure 1:The overall framework of the MAGN model.In this figure, we present a 3-way 1-shot problem as an example.After Feature Embedding Module femb (details in Section 4.2.1), samples and their relations generate the initial graph.There are L generations in the GNN module(we show one of them for simplicity).Each generation consists of node feature update and edge feature update, with cooperation among the node-attention, distribution-attention and labelattention.The solid circle represents support samples and the hollow circle represents the query samples.The square indicates the edge feature and the darkness of color denotes the value.The darker the color, the larger the value.The detailed process is described in Section 3

In this paper, we propose a novel multi-head attention Graph Network (MAGN) to address the problem stated above, which is shown in Fig.1.In the process of updating the graph network,different weights are assigned to different neighbor nodes.Compared to the node-similarity based weight of existing methods, we provide new insights into multi-level fusion similarity mechanism with distribution feature and label information to improve discriminative performance.More specifically, for node update, we treat the label information as an initial adjacency matrix to soften the noise of irrelevant nodes, thereby providing a constraint for the update direction.For edge update, we excavate the distribution feature by calculating the edge-level similarity of overall samples, as a feedback of global information, it reveals more in-depth relations.Collocating with the regular node-level attention, more valuable and discriminable relations would be involved in the process of knowledge transfer.Furthermore, we verify the effectiveness of our methods through extensive experiments on the MiniImagenet and CIFAR-FS datasets.The results show that MAGN exceeds comparable performance in quick convergence, robustness at the same time keeps the property of accuracy.

2 Related Work

2.1 Meta-Learning

Meta-learning, also known as “learn to learn,” plays an essential role in addressing the issue of few-shot learning.According to the different content in the learning systems, it can be divided into three categories:learn-to-measure methods, which based on metric learning,employs an attention nearest neighbor classifier with the similarity between labeled and unlabeled data.Matching networks adopts a cosine similarity [15], Prototypical network [12] establishes a prototype for each class and utilize Euclidean distance as a metric.Differ from above, Relation Net [13] devises a CNN-based relation metric network.Learn-to-optimize methods suggest to fine-tuning a base learner for fast adaptation.MAML [16] is a typical approach that learns a good initialization parameter for rapid generalization.Thereafter, various models derived from MAML, such as first-order gradients methods Reptile [23], task-agnostic method TAML [24],Bayes based method BMAML [25], etc.Learn-to-model methods generate and update parameters on the basis of the proven networks.Meta-LSTM [26] embraces the LSTM network to update the meta-learner parameters.VERSA [27] builds a probabilistic amortization network to obtain softmax layer weights.In order to predict weights, MetaOpt Net [28] advocates SVM, R2-D2 adopts ridge regression layer [29], while Dynamic Net [30] uses a memory module.

2.2 Graph Attention Network

The attention mechanism is essential for a wide range of technologies, such as sequence learning, feature extraction, signal enhancement and so on [31].The core objective is to select the information that is more critical to the current task objective from the numerous information.The early GCN works have been limited by the Fourier transform derivation, which was challenging to deal with a directed graph with indiscriminate equal weight [32].Given that, Yoshua Bengio equipped the graph network with a masked self-attention mechanism [33].During information propagation, it assigns different weights to each node according to the neighbor distribution.Benefited from this strategy, GAT can filter noise neighbor and improve the performance of the graph Framework.Such an idea was adopted and enhanced by GAAN [34].It combined these two mechanisms, the multi-head attention to extract various information, likewise the self-attention to gather them.

3 Model

In this section, we first summarize the preliminaries of few-shot classification following previous work and then describe our method in more technical detail.

3.1 Preliminaries

Few-shot learning:The goal of FSL is to train a reliable model with the capability of learning and generalizing from few samples.A common setting isN-wayK-shot classification task.Clearly,each taskTconsists of support setSand query setQ.There areN?Klabeled samples in the support set, whereNis the number of class andKis the number of samples in each class.Samples in the query set are unlabeled, but they belong to theNclass of support set.The learning algorithm aims to produce a mapping function from query samples to the label.

Meta-Learning:One of the main obstacles in the FSL is overfitting caused by limited labeled data.Meta-learning adopts episodic training strategy to make up for this, which increase generalization ability through extensive training on similar tasks.Given train date setDtrainand test date setDtest,Dtrain∩Dtest=?.Each taskTis randomly sampled from a task distributionP(T).It can be expressed asxirepresents thei-th sample,yiis its label.Tis the number of samples inQ.In the training stage, there are plenty ofN-wayK-shot classification tasks which samples fromDtrain.Through amounts of training episodic on these tasks, we can propose a feasible classifier.And in the testing stage, samples of each task stem fromDtest.Since tasks inDtrainandDtestfollow the same distributionP(T).Such classifier can generalize well on the task which samples fromDtest.

3.2 Initialized GNN

Graph Neural Networks:In this section, we describe the overall framework of our proposed GNN, as shown in Fig.1.Firstly, we utilize an embedding module to extract feature (detail in Section 4.2.1), after that each task is expressed as a fully-connected graph.ThroughLlayers Graph Update, the GNN realizes information transfer and relational reasoning.Specifically, the taskTis formed as the graphG=(V,E), where each nodevi∈Vdenotes the embedding samplexiin taskT, and each edgeei,j∈Ecorresponds to the relationship of two connected nodesvjandvi, wherei,j=1,2···F,Fis the numbers of all samples in theT,F=N×K+T.

Initial graph feature:In the graphG=(V,E), node features are initialized as the output of feature embedding module:v0i=femb(xi;θemb).Whereθembis the parameter set of the embedding modulefemb.Edge features are used to indicate the degree of correlation between the two connected nodes,ei,j∈[0,1].Given the label information, we set the edge features of labeled samples to reach the two extremes of intra-class and inter-class relations, while the edge features of unlabeled samples share the same relation to others.Therefore, the edge features are initialized as Eq.(1):

3.3 Multi-Head Attention

The majority of existing few-shot graph-models focus on a node-attention update mode,which adopts the node similarity to control neighborhood aggregation.This mode ignores the inherent relationships between the samples, which may lead to the risk of overtraining.Therefore,we propose a multi-head attention mechanism with distribution feature and label information to enhance the model capability.

3.3.1 Node-Level Attention

Like some existing methods as EGNN and DPGN, the node-level attention is based on the similarity between the two nodes.Since each node has a different neighborhood, we use normalization operation for nodes in the same neighborhood to get more discriminative and comparable results.We employ node-level attention with node-similarity defined as follows:

In detail, given nodesvkiandvkjfrom thek-th layer, Att is a metric network with four Conv-BN-ReLU blocks to calculate the primary similarity of the two nodes.In Eq.(3),N(i)denotes the neighbor set of the nodevi.Then we apply a local normalization operation by softmax and get the final node-similarity~nki,j.

3.3.2 Distribution-Level Attention

The node-level attention relies on the local relationships of node similarity, while the global relationship has not yet been fully investigated.To mine more discriminative information, we extract the global distribution feature by aggregating the edge features of overall samples and then evaluate the similarity of distribution feature, with definitions as Eqs.(4) and (5).

whereDkiis the distribution feature of nodevkifrom thek-th layer, it consists of all the edge features ofvki.Similarly, we can get the distribution feature of nodevkjasDkj.Then both of them would be sent to the Att network to assess the distribution similarity.The same softmax operation aims at simplifying the computations.

3.3.3 Label-Level Attention

In the previous work, though the aggregation scope is the neighborhood of each node,it extends beyond the same class.Furthermore, the update of graph network is a process of information interaction and fusion, therefore increasing the noise of nodes from diverse classes.We set an adjacency matrix to filter irrelevant information and constraint update direction as shown in Eq.(6).

whereAkis the adjacency matrix at thek-th layer.Ais the label adjacency matrix, the elementai,jis equal to one whenviandvjhave the same label and zero otherwise.Ekis the matrix of edge feature.It combines long-term label information with short-term updated edge features in a Recurrent Neural Network.Such operation prunes useless information from inter-class samples and distills useful intra-class samples.

3.4 Feature Update

Information transmission has been facilitated through the alternate update of node features and edge features.In particular, the update of node feature depends on neighborhood aggregation,where edge features cooperate with label information to control the relation transformation.While the edge features of MAGN subject to node-similarity and neighborhood distribution.

Based on the above update rule, the edge features at the(k+1)-th layer can be formulated as follows:

where conca/ave represents the connection between the two attention mechanisms, conca means cascade connection, ave denotes mean reversion.represents the node-similarity as shown in Eq.(3),represents the distribution-similarity as shown in Eq.(5).

The node vectors at the(k+1)-th layer can be formulated as Eq.(8):

where MLPvis the node update network with two Conv-BN-ReLU blocks,is the adjacency status ofvjandviat the(k+1)-th layer.It aggregates the node features of neighbor set with multi-head attention mechanism shown in Fig.2.

Figure 2:Multi-head attention

3.5 Prediction

OverLlayers update of node and edge feature, the classification results of nodexican be obtained from a prediction probability of corresponding edge feature at the final layerby softmax function:

In Eq.(9),is the Kronecker function that outputs one ifyj=nand zero otherwise.stands for the prediction probability whereviis in then-th category.

3.6 Training

During the episodic training, the parameters of proposed GNN are trained in an end-to-end manner.The final objective is to minimize the total loss function computed in all layers as shown in Eq.(10):

whereλkis the weight ofk-th layer,LErepresents the cross-entropy loss function,is the probability predictions of samplexiat thek-th layer andyiis the ground-truth label.

4 Experiments

For a fair comparison, we conduct our method on two standard few-shot learning datasets following the proposed experimental settings of EGNN and make contrast experiments with stateof-the-art approaches.

4.1 Datasets

MiniImageNet is a typical benchmark few-shot dataset.As a subset of the ImageNet, it is composed of 60,000 images uniformly distributed over 100 classes.All of the images are RGB colored, the size is 84?84?3.Following the setting provided by [26], we randomly select 64 classes for training, 16 classes for validation, and 20 classes for testing.

CIFAR-FS is derived from CIFAR-100 dataset.The same as MiniImageNet, it is formed of 100 classes and each class contains 600 images, which splits 64, 16, 20 for training, validation, and testing.In particular, the main obstacles of low resolution (32?32) and high inter-class similarity make classification task technically challenging.

Before training, both datasets have been endured data augmentation with transformation as horizontal flip, random crop, and color jitter (brightness, contrast, and saturation).

4.2 Implementation Details

4.2.1 Embedding Network

We adopt ConvNet and ResNet12 for the backbone embedding module.Following the same setting used in [19,23], the ConvNet architecture contains four convolutional blocks, each block is composed of 3 ?3 convolutions, a batch normalization, a 2 ?2 max-pooling and a LeakyReLU activation.Similar to ConvNet, ResNet12 also has four blocks, one of which is replaced by a residual block.

4.2.2 Parameter Settings

We evaluate MAGN in 5-way 1-shot and 5-shot classification task on both benchmarks.There are three layers in the proposed GNN model.In the meta-train stage, each batch consists of 60 tasks.While in the meta-test step, each batch obtains ten tasks.During training, we adopt the Adam optimizer with an initial learning rate of 5?10?4and a weight decay of 10?6.The dropout rate is set as 0.3, and the loss coefficient is 1.The results of our proposed model are obtained through 100kiterations on MiniImageNet and CIFAR-FS.

4.3 Results and Analysis

4.3.1 Main Results

We compare our approach with recent state-of-the-art models.The main results are listed in Tabs.1 and 2.According to diverse embedding architectures, the backbone can be divided into ConvNet, ResNet12, ResNet18, and WRN28.The major difference is the number of residual blocks.In addition, GNN-based methods are listed separately for the sake of intuition.Extensive results show that our MAGN yields better performance on both datasets.For example, among all the Convnet-architecture methods, The MAGN is substantially better than others.Although the results are slightly lower than DPGN, we still obtain the second place with a narrow gap of both backbones.Nevertheless, some common graph network methods like EGNN, DPGN adopt training and testing with labels in a consistent order, such as the label in the 5-way 1-shot task is from support set (0, 1, 2, 3, 4) to the query set (0, 1, 2, 3, 4).The learning system may learn the order of task rather than the relation of samples.To avoid this effect, we disrupt the label order of support set and query set.This setup makes our results less than optimal, but it is more in line with the reality of the scene.The proposed MAGN acquires a robust result that would not be biased by the noise of label order.

Table 1:Classification accuracy on CIFAR-FS

4.3.2 Ablation Study

Effect of Data shuffling mode:There are three ways to scramble data:shuffle the support set, shuffle the query set and shuffle both sets.We conduct a 5-way 1-shot trial with label-node attention in MiniImagenet.The comparative result is shown in the Tab.3.As we can see, the use of data shuffling mode has little effect on the accuracy rate, while it makes a difference to the time of convergence.It is consistent with the essence of random selection.To further explore the convergence performance of the model, the default setting is shuffling the order of both sets.

Effect of Different Attention:The major ablation results of different attention components are shown in Fig.3.All variants are performed on the 5-way 1-shot classification task of MiniImageNet.The baseline adopts only node attention (“NodeAtt”).On this basis, the variant “DisNode”adds distribution-level attention to assist edge update.For samples in the same class, their surrounding neighborhood would follow a similar distribution.Thus the “DisNode” model can mine more discriminable relationship between the two nodes and obtain an enhancement in accuracy.Besides, the performance of concatenating aggregation is superior to average aggregation.This advantage extends to the final state of three attentions with a slight rise from 0.49 (“CatDisNode”-“AveDisNode”) to 0.85 (“Cat3Att”-“Ave3Att”).The variant “LabNode” equips node update with label-level attention, leading to a considerable improvement in convergent iteration from 89k to 63k.We attribute this to the filtering capability of label adjacency matrix, which constrains update direction and realizes fast convergence.

Table 2:Classification accuracies on MiniImageNet

Table 3:5-way 1-shot results on MiniImagenet with different data shuffling mode

Figure 3:Effect of different attention.The left part shows the accuracy of variants with different attention components, the right part describes the convergence process of those variants

Effect of Layers:In GNN, the depth of the network has some influence on feature extraction and information transmission.To explore this problem, we perform 5-way 1-shot experiments with different numbers of layers.As shown in Tab.4, accuracy rate and convergence times are improved steadily with the network deepens.To manage the trade-off between convergence and accuracy, a 3-layers GNN is configured for our models.

Table 4:5-way 1-shot results on MiniImagenet with different layers

5 Conclusion

In this paper, we propose a multi-head attention Graph Network for few-shot learning.The multiple attention mechanism including three parts:node-level attention explores the similarities between the two nodes, and distribution-level attention extracts more in-deep global relation.The cooperation between those two parts provides a discriminative expression for edge feature.While the label-level attention, served as a filtration, weakens the noise of some inter-class information during node update and accelerates the convergence process.Furthermore, we scramble the training data of support set and query set to guarantee to transfer order-agnostic knowledge.Extensive experiments on few-shot benchmark datasets validate the accuracy and efficiency of the proposed method.

Funding Statement:This work was supported in part by the Natural Science Foundation of China under Grant 61972169 and U1536203, in part by the National key research and development program of China (2016QY01W0200), in part by the Major Scientific and Technological Project of Hubei Province (2018AAA068 and 2019AAA051).

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

主站蜘蛛池模板: 国产精品吹潮在线观看中文| 国产女人在线观看| 四虎免费视频网站| 中文字幕在线日本| 69av免费视频| 伊伊人成亚洲综合人网7777| 亚洲a级在线观看| 一边摸一边做爽的视频17国产| 青青青国产视频| 久久久久国产一区二区| 午夜人性色福利无码视频在线观看| 免费国产一级 片内射老| 99在线视频网站| 国产激爽爽爽大片在线观看| 91精品国产一区自在线拍| 亚洲一区波多野结衣二区三区| 久久精品丝袜| 久久99精品国产麻豆宅宅| 最新痴汉在线无码AV| 亚洲人成亚洲精品| 中文字幕第4页| 国产一级毛片在线| 欧美黄网站免费观看| 玖玖免费视频在线观看 | 最新日本中文字幕| 亚洲成人网在线播放| 在线国产毛片手机小视频| 午夜老司机永久免费看片| 亚洲无限乱码| 这里只有精品国产| 国产成本人片免费a∨短片| 国产成人免费视频精品一区二区 | 91久久性奴调教国产免费| 国产日韩欧美视频| 在线国产你懂的| 亚洲综合欧美在线一区在线播放| 国产极品粉嫩小泬免费看| 精品一区国产精品| 一级毛片在线播放| 午夜啪啪网| 黄色三级网站免费| 亚洲视频黄| av无码一区二区三区在线| 亚洲AⅤ永久无码精品毛片| 亚洲天堂日韩在线| 欧美在线一二区| 天天色综网| 中文字幕自拍偷拍| 91视频免费观看网站| 国产精品极品美女自在线看免费一区二区 | 中文无码精品A∨在线观看不卡| 日韩欧美视频第一区在线观看| 天天综合天天综合| 国产中文一区二区苍井空| 国产一在线| 国产区在线看| 欧美日韩91| 人妻精品久久久无码区色视| 91丝袜美腿高跟国产极品老师| 亚洲天堂在线视频| 国产精品yjizz视频网一二区| 露脸国产精品自产在线播| 国产91九色在线播放| 亚洲男人天堂久久| a亚洲视频| 国产黄网永久免费| 久久综合九九亚洲一区| 在线观看视频一区二区| 中文字幕在线日韩91| 国产精品一区不卡| 韩日免费小视频| 久久综合亚洲色一区二区三区| 国产成人精品一区二区| 亚洲系列无码专区偷窥无码| 一级毛片免费播放视频| 中文字幕人妻无码系列第三区| 99久久精品免费看国产免费软件| 日韩一区精品视频一区二区| 日韩在线第三页| 国产成人资源| 天天色综网| 国产福利小视频在线播放观看|