999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing

2022-08-13 02:07:58QiminChengYuzhuoZhouHaiyanHuangandZhongyuanWang
IEEE/CAA Journal of Automatica Sinica 2022年8期

Qimin Cheng, Yuzhuo Zhou, Haiyan Huang, and Zhongyuan Wang

Dear editor,

Cross-modal retrieval in remote sensing (RS) data has inspired increasing enthusiasm due to its merit in flexible input and efficient query. In this letter, we address to establish semantic relationship between RS images and their description sentences. Specially, we propose a multi-attention fusion and fine-grained alignment network,termed MAFA-Net, for bidirectional cross-modal image-sentence retrieval in RS. While multiple attention mechanisms are fused to enhance the discriminative ability of visual features for RS images with complex scenes, fine-grained alignment strategy is introduced to study the hidden connection between RS observations and sentences.To validate the capability of MAFA-Net, we leverage four captioning benchmark datasets with paired RS images and descriptions, i.e.,UCM-Captions, Sydney-Captions, RSICD and NWPU-Captions.Experimental results on the four datasets demonstrate that MAFANet can yield better performance than the current state-of-the-art approaches.

Related work: The accelerated advancement in earth observation technology witnesses an explosive growth of multi-modal and multisource remote sensing data. Cross-modal retrieval in RS facilitates flexible and efficient query, which has attracted extensive interest in recent years and can be applied to natural disaster early-warning and military intelligence generation, etc.

Significant efforts have been devoted to cross-modal retrieval for natural images. To probe fine-grained relationships among images and sentences, Chenet al. [1] proposed a cross-modal retrieval model(IMRAM) based on a recurrent attention technique. Leeet al. [2]proposed a stacked attention mechanism-based graphic retrieval model (SCAN) to learn more discriminative textual and visual feature representations. Wanget al. [3] proposed a multi-modal tensor fusion network (MTFN) to directly measure the similarity between different modalities through rank-based tensor fusion. Wanget al. [4] proposed a position focused attention network (PFAN) to improve cross-modal matching performance. Besides, to satisfy industrial requirement, Wuet al. [5] proposed a hashing approach to achieve large-scale cross-modal retrieval via learning a unified hash representation and deep hashing functions for different modalities in a self-supervised way. Although these achievements gained inspiring results for retrieval tasks in natural images, their robustness and generalization ability need to be verified when transfer to RS fields due to the intrinsic and extrinsic properties of RS data.

Motivated by the burgeoning demands for multi-modal requests in RS like military intelligence generation, researchers have paid more attention to RS cross-modal retrieval during the recent several years.To explore semantic correlation between visual features and textual description of RS data, Abdullahet al. [6] proposed a novel deep bidirectional ternary network (DBTN) for Text-to-Image (T2I)matching task through features fusion strategy. With regard to Image-to-Text (I2T) retrieval for RS data, Chenget al. [7] proposed to use a cross-attention mechanism and a gating mechanism to enhance the association between RS images and descriptions, which is the first attempt to prove the possibility of bidirectional T2I and I2T retrieval in RS. Afterwards, Lvet al. [8] proposed a fusion-based correlation learning model (FCLM) to capture multi-modal complementary information and fusion features and to further supervise the learning of the feature extraction network. Yuanet al.[9] proposed an asymmetric multimodal feature matching network(AMFMN) to extract the salient visual features of RS images through a multi-scale visual self-attention technique, and exploited it to guide textual feature extraction. Moreover, they further designed a concise and efficient version of their cross-modal retrieval model, namely LW-MCR [10] on the basis of knowledge distillation. For fast and efficient retrieval on large-scale RS data, Mikriukovet al. [11]introduced a novel deep unsupervised cross-modal contrastive hashing model. Except for image-sentence retrieval, there has been some work on visual-audio retrieval [12], image-sketch retrieval[13], cross-source panchromatic-multispectral image retrieval [14],[15] and zero-shot image-word matching [16].

It is no doubt that all the above work partly advances the crossmodal retrieval in RS from different aspects including visual feature representation and description optimization strategy, etc. However,current work on bidirectional image-sentence retrieval in RS is deficient in 1) Achievements on bidirectional image-sentence retrieval for RS data are very limited and comprehensive analysis is still lacking. Current work [6]?[11] conducts comparative experiments with the baseline for natural images unexceptionally; 2) The generalization of existing approaches on much larger and more challenging RS captioning datasets needs to be verified. The size of the datasets applied by existing approaches [6], [8]?[11] is limited(with the maximum of 24 333 original captions in RSICD [17] and 23 715 granular captions in RSITMD [9]); 3) Semantic ambiguity of complex scenes of RS data remains unsolved.

To address these limitations, we propose a novel cross-modal network for bidirectional T2I and I2T retrieval in RS. The contribution of our work lies in: 1) We aim to differentiate visual features for complex scene representation through fusing multiple attention mechanisms and reinforce the intra-modality semantic association through fine-grained alignment strategy. 2) We evaluate the effectiveness and robustness of our proposed network on a much larger dataset, NWPU-Captions with 157 500 captions in total, along with the several popular benchmark datasets.

MAFA-Net: The motivation of MAFA-Net includes two aspects.The first one is to depict RS images, especially those complex scenes, with more abstract and discriminative feature representation.The second one is to address semantic ambiguity existed in different modality of RS data through establishing fine-grained relevance between RS image region and visual words.

To this end, MAFA-Net consists of two main parts: a multiattention fusion module and a fine-grained alignment module. The multi-attention fusion module aims to weaken interference from background noise in RS images and enhance the salient objects,thereby to improve the discriminative ability of the visual features.The fine-grained alignment module exploits sentence features as context information to further optimize and update the visual features of RS images. The overall architecture of MAFA-Net is shown in Fig. 1.

Fig. 1. The overall architecture of MAFA-Net.

Fig. 2. The architecture of multi-attention fusion module.

Dataset and metrics: Four RS datasets are selected to evaluate the performance of different approaches in the cross-modal imagesentence retrieval task.

1) UCM-Captions: This dataset is released by [18] based on the UCMerced dataset. The size of each image is 256×256, and the pixel resolution is 0.3048 m. Each image is described with five different sentences and hence contains 10 500 descriptions in total.

2) Sydney-Captions: This dataset is released by [18] based on the Sydney dataset and includes 3065 descriptions for 613 cropped images. The original images in it are with size of 18 000×14 000 and pixel resolution of 0.5 m. Each cropped image is described by five varied sentences.

3) RSICD: There are totally 10 921 RS images and 24 333 original descriptions in this dataset [17], the scale of which is larger than the aforementioned two datasets. Images in it are resized to 224×224 pixels, meanwhile 54 605 sentences are utilized by randomly duplicating existing descriptions.

4) NWPU-Captions: NWPU-Captions is provided by Wuhan University and Huazhong University of Science and Technology based on the NWPU-RESISC45 dataset. It incorporates 45 different labels with each one including 700 instances. Each image is described by five sentences according to certain annotated rules and the total number of descriptions is 157 500. This dataset is challenging due to its large scale and big variations.

We use the criteria R@K (K = 1, 5, 10) to evaluate the performance of different approaches. Larger R@K indicates better performance.

Experimental settings: In the training process, we set the batch size to 16 and the learning rate to 0.0005 which decreases by 0.7 after every 20 epochs. Totally, 120 epochs are conducted. The margin thresholdδin the loss function is set to 0.2. The visual feature of image region is of 2048-dimensional while the word feature is of 300-dimensional. The hidden dimension of Bi-GRU is 2048. During training, word features are initialized randomly and fed to Bi-GRU.

Results and analysis: We conduct experiments on the four benchmark datasets and Tables 1?4 report the experimental results of various methods including representative cross-modal models for natural images like IMRAM [1], SCAN [2], MTFN [3], PFAN [4]and latest models for RS data like FCLM [8], AMFMN [9] and LWMCR [10].

It can be seen from Tables 1?4 that generally MAFA-Net achieves better retrieval performance than other models on four datasets.Although, on the first three datasets, MAFA-Net occasionally slightly underperforms others on some metrics. This might be related to the relatively small amount of data in the UCM-Captions dataset and the Sydney-Captions dataset, and the unbalanced distribution of data categories in the Sydney-Captions dataset itself. However, on the much larger and challenging NWPU-Captions dataset, MAFANet achieves best on all evaluation metrics. The results of MAFANet on four different datasets also demonstrate its robustness.

Table 1.Comparative Experimental Results on UCM-Captions

Table 2.Comparative Experimental Results on Sydney-Captions

We also conduct ablation experiments to evaluate the contribution of multi-attention fusion module (MA) and fine-grained alignment module (FA) to MAFA-Net. Table 5 reports the results on NWPUCaptions, in which _nMA_nFA means the basic network without the two modules, _nMA means the network without MA module and_nFA means the network without FA module. It can be seen that the two modules can significantly improve the retrieval performance of the MAFA-Net separately, while their contributions are relatively close. Table 5 also tabulates the training and testing time for executing different models on NWPU-Captions.

We further show the visualization results of our MAFA-Net in Figs. 3?6.

It can be seen that most of the retrieval results match the input,which indicates that the MAFA-Net proposed in this letter can maintain a good semantic correspondence between RS images and sentences. It is worth mentioning that even for the challenging highdensity scenes with a great of small and clustered objects, MAFANet still performs well (see Fig. 6).

Conclusion: In this letter, we propose a multi-attention fusion and fine-grained alignment network (MAFA-Net) to conduct the crossmodal image-sentence retrieval task in the remote sensing domain.MAFA-Net aims at addressing the properties of multiscale properties and the problem of semantic ambiguity existed in cross-modalretrieval of RS data. Specifically, we design a multi-attention fusion module to improve the feature representation ability. Meanwhile, a fine-grained alignment module is designed to make the information between two different modalities (e.g., visual and textural) interact.Besides the three public available benchmark datasets, a much larger captioning dataset, NWPU-Captions, is utilized to evaluate the performance of MAFA-Net. Experimental results prove that MAFANet outperforms current approaches and even for challenging highdensity scenes, MAFA-Net can get satisfying results. In the future,we would like to consider more modalities like LiDAR or multispectral images and domain adaption [19] for RS visual applications.

Table 3.Comparative Experimental Results on RSICD

Table 4.Comparative Experimental Results on NWPU-Captions

Table 5.Ablation Experimental Results on NWPU-Captions

Acknowledgments: This work was supported by the National Natural Science Foundation of China (42090012), Special Research and 5G Project of Jiangxi Province in China (20212ABC03A09),Guangdong-Macao Joint Innovation Project (2021A0505080008),Key R & D Project of Sichuan Science and Technology Plan(2022YFN0031), and Zhuhai Industry University Research Cooperation Project of China (ZH22017001210098PWC).

Fig. 3. Visualization results of MAFA-Net on UCM-Captions.

Fig. 4. Visualization results of MAFA-Net on Sydney-Captions.

Fig. 5. Visualization results of MAFA-Net on RSICD.

Fig. 6. Visualization results of MAFA-Net on NWPU-Captions.


登錄APP查看全文

主站蜘蛛池模板: 在线观看国产黄色| 日韩麻豆小视频| 国产精品极品美女自在线| 美女一级毛片无遮挡内谢| 亚洲人成电影在线播放| 亚洲综合色婷婷| 亚洲午夜国产片在线观看| 国产肉感大码AV无码| 美女内射视频WWW网站午夜| 日韩精品一区二区三区大桥未久 | 亚洲人成高清| 好吊妞欧美视频免费| 亚洲自拍另类| 国产精品美女在线| 亚洲成人在线免费| 色一情一乱一伦一区二区三区小说| 国产精品自拍露脸视频| 亚洲va在线观看| 国产清纯在线一区二区WWW| 欧美啪啪网| 无码不卡的中文字幕视频| 综合网天天| 99这里只有精品在线| 无码一区18禁| 久久夜色精品国产嚕嚕亚洲av| 色综合中文字幕| 欧美成人精品在线| 57pao国产成视频免费播放| 尤物在线观看乱码| 精品久久久久成人码免费动漫| 丁香五月激情图片| 无码日韩人妻精品久久蜜桃| 暴力调教一区二区三区| 蜜臀av性久久久久蜜臀aⅴ麻豆| 亚洲小视频网站| 国产又色又刺激高潮免费看| 成人综合久久综合| 欧美精品高清| 理论片一区| 亚洲永久免费网站| 91精品伊人久久大香线蕉| 2020最新国产精品视频| 欧美成在线视频| 999国内精品久久免费视频| 国产女人在线视频| 国产精品免费久久久久影院无码| 精品剧情v国产在线观看| 这里只有精品国产| 九九视频免费在线观看| 香蕉久人久人青草青草| 成人免费网站久久久| 欧美日韩免费| 欧美特黄一免在线观看| 国产打屁股免费区网站| 国产夜色视频| 在线国产欧美| 日韩欧美国产成人| 亚洲高清中文字幕在线看不卡| 尤物亚洲最大AV无码网站| 白丝美女办公室高潮喷水视频| 97国产精品视频人人做人人爱| 婷婷六月激情综合一区| 免费jjzz在在线播放国产| 国产成人精品视频一区二区电影| 亚洲无码37.| 2021天堂在线亚洲精品专区 | 国产一区三区二区中文在线| 欧美va亚洲va香蕉在线| 国产精品美女自慰喷水| 午夜性刺激在线观看免费| 日本人妻丰满熟妇区| 国产在线自乱拍播放| 国产精品刺激对白在线| 久久性妇女精品免费| 亚洲无码不卡网| 免费不卡视频| 久久黄色毛片| 国产精品无码久久久久久| 99re这里只有国产中文精品国产精品 | 伊大人香蕉久久网欧美| 婷婷综合在线观看丁香| 亚洲人成网址|