999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

ANALYSIS OF THE GENOMIC DISTANCE BETWEEN BAT CORONAVIRUS RATG13 AND SARS-COV-2 REVEALS MULTIPLE ORIGINS OF COVID-19?

2021-06-17 14:00:08裴少君

(裴少君)

Department of Mathematical Sciences,Tsinghua University,Beijing 100084,China

Stephen S.-T.YAU(丘成棟)?

Department of Mathematical Sciences,Tsinghua University,Beijing 100084,China Yanqi Lake Beijing Institute of Mathematical Sciences and Applications,Beijing 101408,China

E-mail:yau@uic.edu

Abstract The severe acute respiratory syndrome COVID-19 was discovered on December 31,2019 in China.Subsequently,many COVID-19 cases were reported in many other countries.However,some positive COVID-19 samples had been reported earlier than those officially accepted by health authorities in other countries,such as France and Italy.Thus,it is of great importance to determine the place where SARS-CoV-2 was first transmitted to human.To this end,we analyze genomes of SARS-CoV-2 using k-mer natural vector method and compare the similarities of global SARS-CoV-2 genomes by a new natural metric.Because it is commonly accepted that SARS-CoV-2 is originated from bat coronavirus RaTG13,we only need to determine which SARS-CoV-2 genome sequence has the closest distance to bat coronavirus RaTG13 under our natural metric.From our analysis,SARS-CoV-2 most likely has already existed in other countries such as France,India,Netherland,England and United States before the outbreak at Wuhan,China.

Key words SARS-CoV-2;multiple origins of COVID-19;mathematical genomic distance;k-mer natural vector

1 Introduction

The severe acute respiratory syndrome COVID-19 was reported on December 31,2019 in Wuhan(Hubei province,China)and is caused by a new type of coronavirus called SARS-CoV-2.SARS-CoV-2 is the seventh pathogenic coronavirus to human,and another six types of human coronaviruses are MERS-CoV,SARS-CoV,HCoV-229E,HCoV-HKU1,HCoV-NL63,and HCoV-OC43.Although SARS-CoV-2 has a lower mortality rate than SARS-CoV,it is highly contagious and less detectable with a long incubation period[1].So it is more threatening than other coronaviruses.

The early SARS-CoV-2’s cases were associated with a sea food market in Wuhan.But its origin and intermediate host are still unclear.Bat coronavirus RaTG13 is the most similar sequence to SARS-CoV-2 found so far,which provides evidence for a bat origin of SARS-CoV-2[2].However,bat coronavirus RaTG13 was collected in 2013 and formed a distinct lineage from SARS-CoV-2,which could not transmit to humans directly.Then a series of studies on intermediate hosts have been conducted,including pangolin,mink and so on[3-5].Another controversial issue is the place of the earliest human-to-human SARS-CoV-2 transmission.Although the earliest cases reported in other countries were generally in February 2020,more and more studies indicate that SARS-CoV-2 was spreading in December 2019 in France,Italy and the United States[6-8].However,there are not complete sequences of these samples.So we hope to analyze the relationship of the existing sequences of SARS-CoV-2 to infer the early transmission of SARS-CoV-2 in human hosts.

Traditional methods to analyze the relationship of genome sequences are based on multiple sequence alignment(MSA).After alignment,a matrix of similarities between genome sequences will be given.But the similarity does not satisfy the triangular inequality property of mathematical distance[9].So it cannot reflect the real biological distance of genome sequences.In this paper,we use a mathematical method called k-mer natural vector method to code the complete genome sequences with high quality in GISAID(https://www.gisaid.org/)as vectors in the Euclidean space[10,11].Then a new natural distance between the vectors is defined to measure the relationship of sequences.Based on the results,we conclude that before the outbreak at Wuhan,China,SARS-CoV-2 most likely has already existed in other countries such as France,India,Netherland,England and United States.

2 Materials and Methods

2.1 Dataset

All the complete genome sequences of SARS-CoV-2 were downloaded on GISAID until July 19,2020.To ensure the accuracy of analysis,the low-quality sequences which contain letters other than A,C,G and T are eliminated from the dataset.Finally,there are 15,641 sequences in our dataset.The accession numbers of SARS-CoV-2s are shown in supplementary file 1.All the reference sequences of ss-RNA viruses were downloaded from NCBI up to March 23,2020.In this study,we remove three types of sequences:(1)viruses without family label;(2)families including one or two sequences;and(3)viruses including letters other than A,C,G and T.Totally 2051 sequences are retained,which belong to 40 families.The details of these sequences are shown in supplementary 2.

2.2 K-mer Natural Vector

Definition 2.1Let S=s1s2s3...snbe a genomic sequence of length n,where si∈{A,C,G,T},i=1,...,n.K-mer is defined as a string of k consecutive nucleotides within a genomic sequence.For a given positive integer k,there are 4ktypes of k-mers.Then the k-mer natural vector of the genomic sequence is composed of the following three components:

The correspondence between a genomic sequence and its associated k-mer natural vector is one-to-one and it is obvious that for any given k-mer li,higher central moments converge to zero quickly for a random generated sequence[11].For example,for bat coronavirus RaTG13,the magnitude ofis 10?3,which is significantly smaller than that ofof 103.Then the components ofhave little effect on the value of the Euclidean distance between k-mer natural vectors.Thus,we only calculate up to the second central moment in our experiment.

2.3 A new natural metric on the space of genome sequences

For a given k,each genomic sequence is associated with a 3×4k-dimensional k-mer natural vector in the Euclidean space.In the previous study,most researches only consider one specific k to measure the distances between sequences[12,13].Thus,one tricky problem is how to choose the value of k.However,we believe that the natural metric should involve all the k-mers for k≥1.So,we propose a new metric containing the information of all the k-mers for k≥1.

Definition 2.2Let dk(v1,v2)be the Euclidean distance between two k-mer natural vectors v1,v2of two genome sequences s1,s2for?k≥1,then the new natural metric of two genome sequences s1,s2is defined as Dk(s1,s2)=d1(v1,v2)+

Theorem 2.3The new metric Dksatisfies three properties:

?Non-negativity:Dk(s1,s2)≥0

?Positivity:if Dk(s1,s2)=0,then s1=s2.

?Symmetry:Dk(s1,s2)=Dk(s2,s1).

?Triangle inequality:Dk(s1,s2)≤Dk(s1,s3)+Dk(s2,s3).

Proof∵?k≥1,dk(v1,v2)≥0.

∴Dk(s1,s2)=

If Dk(s1,s2)=0,then di(v1,v2)=0,i=1,...,k,∴v1=v2.According to the one-to-one correspondence between a genome sequence and its k-mer natural vector[11],then s1=s2.

The beauty of our new natural metric is that it contains information of the distributions from 1-mer to k-mer and is a mathematical metric for two genome sequences.

3 Results

3.1 The choice of the most accurate natural metric by the nearest neighborhood classification of ss-RNA virus

The definition of the new metric is Dk=where dkis the Euclidean distance between k-mer natural vectors of two genome sequences.But due to the limitation of capacity of computing,we cannot calculate too large value of k.So,all the reference sequences of ss-RNA viruses are used to determine which metric Dkwe should choose.All the ss-RNA viruses belong to 40 families.For k from 1 to 11,we use new metric Dkas the distance to perform the nearest neighborhood classification of virus families.The results are illustrated in Figure 1 by black bars.We can see that the highest classification accuracy is 91.1%,when k=7.For comparison,we also calculate the nearest neighborhood classification accuracies using d1to d11respectively,which are shown by white bars in Figure 1.Obviously,our new natural metric is more accurate.So we choose D7in the next analysis of SARS-CoV-2 genome sequences.

Figure 1 The classification accuracies of ss-RNA virus families for different k.

The accuracies by Dkare in black bars and the accuracies by dkare in white bars.

3.2 The new natural metric between RaTG13 and SARS-CoV-2 genome sequences

In the previous study,bat coronavirus RaTG13 is the closest relative of SARS-CoV-2[2].So the distance between RaTG13 and each SARS-CoV-2 is calculated to analyze the transmission of SARS-CoV-2 in human hosts based on our new natural metric.According to the classification accuracies above,we choose k=7 to calculate our new metric.The distances D7between the genome sequence of RaTG13 and all the genome sequences of SARS-CoV-2 in our dataset are ranked.The first five SARS-CoV-2 genome sequences with the shortest distance are shown in Table 1,which were collected in France,India,Netherlands,England and United States respectively.The distances of other sequences are shown in supplementary file 3.The distance between SARS-CoV-2 collected in Wuhan and bat coronavirus RaTG13 is 31006.95,ranking 426.This means that the SARS-CoV-2 genome sequences in Table 1 collected from these 5 countries are more similar with bat coronavirus RaTG13 than that of SARS-CoV-2 collected in Wuhan.These results indicate that the place where human-to-human SARS-CoV-2 transmission first happened is extremely unlikely to be Wuhan,but France,India,Netherlands,England and United States,with an accuracy rate higher than 91%.

Table 1 The top five genome sequences of SARS-CoV-2 with the shortest distance D7

4 Discussion

Since December 2019,the severe respiratory pneumonia COVID-19 has spread globally.However,the first cases of COVID-19 could be earlier than those officially reported in many countries.Many studies have detected SARS-CoV-2 in earlier preserved biological or environmental samples.For example,Sridhar et al.suggested that SARS-CoV-2 may have appeared in the United States in December 2019 by identifying SARS-CoV-2-reactive antibodies.Of the 7,389 samples,106 were reactive by pan Ig.And it failed to confirm whether these positive tests came from community transmission or travel transmission,because only 11 of the volunteers who donated blood have been to Asia recently[7].Carrat et al.reported that SARS-CoV-2 infection may have occurred as early as November 2019 in France based on the anti-SARS-CoV-2 IgG test[8].So the accurate identification of the origin of SARS-CoV-2 is a very important problem.However,the sequences in their studies are not complete.They cannot be analyzed by our method.In this paper,we not only provide a novel metric to study viral sequence based on k-mer natural vector,but also apply it to the analysis of the existing complete genome sequences of SARS-CoV-2 to identify the early circulation of SARS-CoV-2.

The previous methods based on k-mer always only consider the frequencies of k-mers and a certain value of k[12,13].Here,the k-mer natural vectors contain both the frequency and the distribution of k-mers in the genome sequences.The correspondence between genome sequence and its k-mer natural vector is one-to-one.Especially,the k-mers for any k are involved in our new defined metric,so it can reflect the real biology distance between genome sequences and does not lose any information.

It is commonly accepted that SARS-CoV-2 is originated from bat coronavirus RaTG13,and the SARS-CoV-2 reference genome(NC 045512.2)[14]is uncertain whether it is earlier than the emerging strains.So we choose bat coronavirus RaTG13 as the reference and determine which SARS-CoV-2 genome sequence has the closest distance to bat coronavirus RaTG13 under our natural metric.According to the rank of distances,before the outbreak at Wuhan,SARSCoV-2 most likely has already existed in other countries such as France,India,Netherlands,England and United States.So our result shows that Wuhan is extremely unlikely to be the first place of human-to-human SARS-CoV-2 transmission.

AcknowledgementsWe thank the researchers worldwide who sequenced and shared the complete genomes of SARS-CoV-2 and other coronaviruses from GISAID(https://www.gisaid.org/).

主站蜘蛛池模板: av尤物免费在线观看| 久久午夜夜伦鲁鲁片无码免费| a级毛片在线免费| 久久黄色视频影| 亚洲第一页在线观看| 亚洲国产精品无码久久一线| 亚洲综合久久成人AV| 人妻少妇乱子伦精品无码专区毛片| 国产精品无码AV片在线观看播放| A级毛片高清免费视频就| 亚洲国产精品无码久久一线| 91蝌蚪视频在线观看| 四虎AV麻豆| 国产av一码二码三码无码 | 国产人妖视频一区在线观看| a级毛片毛片免费观看久潮| 中文字幕资源站| 午夜日本永久乱码免费播放片| 97在线视频免费观看| 精品福利视频导航| 精品三级在线| 天天综合网站| 午夜精品影院| 激情网址在线观看| 久久男人视频| 国产欧美精品一区二区| 国产黄网永久免费| 日韩国产亚洲一区二区在线观看| 欧美另类视频一区二区三区| h视频在线播放| 国产女人水多毛片18| 性色生活片在线观看| 免费一级毛片在线播放傲雪网| 激情六月丁香婷婷| 99在线观看国产| 亚洲国产成人无码AV在线影院L| 国产成人午夜福利免费无码r| 亚洲精品无码AV电影在线播放| 久久国产精品77777| 亚洲精品在线观看91| 亚洲婷婷六月| 亚洲中文字幕在线精品一区| 亚洲欧美不卡中文字幕| 国产成人综合亚洲网址| 色婷婷综合在线| 久久久91人妻无码精品蜜桃HD| 国产毛片久久国产| 亚洲乱码在线播放| 国产美女一级毛片| 91精品啪在线观看国产| 91欧美亚洲国产五月天| 91久久夜色精品| 丰满人妻被猛烈进入无码| 又污又黄又无遮挡网站| 毛片基地视频| 99久久性生片| 亚洲午夜综合网| 亚洲高清免费在线观看| 伊人天堂网| 日韩二区三区| 中文字幕免费视频| 中国毛片网| 亚洲AV无码不卡无码| 熟女视频91| 中文字幕亚洲乱码熟女1区2区| 日韩毛片在线播放| 国产精品9| 亚洲女同欧美在线| 国产精鲁鲁网在线视频| 67194亚洲无码| 福利国产微拍广场一区视频在线| 日本欧美视频在线观看| 一本大道无码日韩精品影视| 国产一二三区视频| 天堂在线www网亚洲| 亚洲日韩日本中文在线| 日韩大乳视频中文字幕| 精品久久国产综合精麻豆| 免费人成网站在线观看欧美| 免费无码又爽又黄又刺激网站| 欧美日韩精品综合在线一区| 丝袜无码一区二区三区|