999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

A novel machine learning approach (svmSomatic) to distinguish somatic and germline mutations using next-generation sequencing data

2021-04-15 04:36:52Yu-FangMao,Xi-GuoYuan,Yu-PengCun
Zoological Research 2021年2期

DEAR EDITOR,

Somatic mutations are a large category of genetic variations,which play an essential role in tumorigenesis.Detection of somatic single nucleotide variants (SNVs) could facilitate downstream analysis of tumorigenesis.Many computational methods have been developed to detect SNVs, but most require normal matched samples to differentiate somatic SNVs from the normal state, which can be difficult to obtain.Therefore, developing new approaches for detecting somatic SNVs without matched samples are crucial.In this work, we detected somatic mutations from individual tumor samples based on a novel machine learning approach, svmSomatic,using next-generation sequencing (NGS) data.In addition, as somatic SNV detection can be impacted by multiple mutations,with germline mutations and co-occurrence of copy number variations (CNVs) common in organisms, we used the novel approach to distinguish somatic and germline mutations based on the NGS data from individual tumor samples.In summary,svmSomatic: (1) considers the influence of CNV cooccurrence in detecting somatic mutations; and (2) trains a support vector machine algorithm to distinguish between somatic and germline mutations, without requiring normal matched samples.We further tested and compared svmSomatic with other common methods.Results showed that svmSomatic performance, as measured by F1-score, was significantly better than that of others using both simulation and real NGS data.

In recent years, many developed tools have achieved good results in somatic mutation detection.These approaches can be classified into two categories: i.e., those using paired tumor-normal samples to distinguish somatic mutations from uncommon germline polymorphisms, e.g., VarDict (Lai et al.,2015), Muse (Fan et al., 2016), and FaSD-somatic (Wang et al., 2014), and those using tumor samples without normal matched samples, e.g., SomVarIUS (Smith et al., 2016),SNVer (Wei et al., 2011), and ISOWN (Kalatskaya et al.,2017).The first detection category has the advantage of excluding germline mutations with allele frequencies ≥1% in global populations (Sherry et al., 2001).However, rare germline mutations specific to an individual can affect the detection of somatic mutations.Furthermore, obtaining matched normal samples in clinical practice can be difficult.The second detection category can save on sequencing costs and is favored in clinical practice.However, some novel single nucleotide variants (SNVs) found in individuals will severely influence somatic mutation detection accuracy, resulting in higher false positives (Liu et al., 2016).In general, existing methods achieve relatively good detection results, but these tools only consider one type of variation in the genome.

With the above considerations, we propose a new machine learning-based method, named svmSomatic, to distinguish somatic and germline mutations without normal matched samples using next-generation sequencing (NGS) cancer genome data.The svmSomatic approach incorporates copy number variation (CNV) analysis in somatic mutation detection, extracts a set of somatic-relevant features at each site, and trains the support vector machine (SVM) classifier.We applied svmSomatic using real and simulation sequencing data.Results showed that this method is superior to others with consideration of the influence of CNVs.

The svmSomatic procedure workflow is shown in Figure 1A.The process starts with input of a tumor sample without normal matched samples and a human reference genome,followed by short-read alignment.As svmSomatic is focused on distinguishing somatic SNVs from germline SNVs and considers the influence of CNVs, we used existing methods to first detect SNVs and CNVs.Therefore, svmSomatic follows afour-step process for task learning.In the first step, five somatic SNV-related features are extracted: i.e., read depth,allele frequency (AF), mapping quality, mismatched reads,and copy number of each site.In the second step, the SVM is employed to complete classification (Hastie & Tibshirani,1998).In the third step, the SMV classifier is trained with the labeled samples.In the final step, the trained SVM classifier is used to distinguish between germline and somatic mutations.

Figure 1 Overview of svmSomatic method and performance comparison among five methods

The detection of CNVs and SNVs is the first step before running the svmSomatic algorithm.Currently, many existing methods can detect CNVs and SNVs.We chose our previously proposed method STIC (Yuan et al, 2020b) for the detection of SNVs and the classic method FREEC (Boeva et al., 2012) for the detection of CNVs.Both methods can work on single tumor samples without normal matched samples and exhibit reasonable performance, even when tumor purity(fraction of tumor cells in tumor tissue mixture) is relatively low.We also conducted a simulation experiment to demonstrate the performance of the two methods, with results presented in Supplementary Text 1.It should be noted that this preprocess step is relatively independent from the implementation of the svmSomatic algorithm and users can choose other methods for the detection of CNVs and SNVs according to their requirements.

Genomic data were extracted using BWA (Li & Durbin,2009) and SAM tools (Li et al., 2009).Four features were extracted from the Pileup file, including read depth, number of mismatched reads, AF, and average mapping quality.Finally,according to the FREEC results, copy number information was added to each SNV site as the fifth feature.These five features are associated with SNVs (Yuan et al., 2020b).Read depth denotes the number of reads aligned on some sites and provides important information for the deduction of copy number and number of variant alleles.AF can distinguish germline and somatic mutations.Due to the influence of tumor purity and copy number, the number of mismatched reads will vary, and the AF value will deviate from the ideal.Average mapping quality also considers sequencing errors.These five features show good separability and reliability, allowing the classifier to easily distinguish between somatic and germline mutations.Table 1 shows the features and their corresponding definitions.

Distinction between somatic and germline mutations is primarily achieved through AF.Studies have shown that for heterozygous and homozygous genotypes, the AF of germline SNVs is 0.5 and 1, respectively (Xu, 2018).However, when germline AF is involved in somatic copy number change events, it may deviate from 0.5 or 1.Similarly, AF with somatic mutations can fluctuate due to CNV, normal tissue mixing, and subcloning (Cun et al., 2018; Xi et al., 2020).Therefore, it is necessary to add copy number as a feature to the classifier.

Here, the SVM was selected as the algorithm classifier as itshows outstanding performance in classification problems.The design of the SVM classifier considers the distance between different categories to determine the optimal classification boundary by maximizing the distance between classes (Guyon et al., 1993; Lappalainen et al., 2015).We used the SVM as a binary classifier.Further details can be found in Supplementary Text 2.

Table 1 Description of five extracted features

Crucially, the SVM classifier must be trained before performing classification.We trained the SVM classifier using simulation datasets.In brief, 10-fold cross-validation was used to assess algorithm performance and chose the best classification strategy.We generated 100 000 SNVs,containing 50 000 somatic mutations and 50 000 germline mutations.The training dataset contain 45 000 randomly selected somatic mutations and 45 000 randomly selected germline mutations.The training dataset contained only two data types, labeled 1 and 0, representing germline and somatic mutations, respectively.The best parameter combination was chosen using 10-fold cross-validation based on the highest F1-score.Further details can be found in Supplementary Text 3.

The new approach consists of two parameters, i.e., C and γ.The best method to determine the optimal parameter values in space was C={ 1.0,10.0,100.0,1000.0} and γ={0.001,0.01,0.1,1.0,10.0}, with the parameter combination C=1 000.0 and γ=0.1.However, due to hyperparameter distribution characteristics (Liu et al., 2006), the best combination was not unique.Here, we only present an optimal combination.

To evaluate performance, we applied the newly proposed method using the simulation datasets.As the simulation data showed a clear pattern, we calculated sensitivity and precision of the simulation experiment results and then used the F1-scores for comprehensive evaluation (Yuan et al., 2012,2017).In addition, we compared the new approach to four classic methods (i.e., STIC (Yuan et al, 2020b), FaSD-somatic(Wang et al., 2014), SNVSniffer (Liu et al., 2016), and VarScan2 (Koboldt et al., 2012)) using their default parameters for reasonable and fair comparison.

SInC (Pattnaik et al., 2014) was used to generate sequencing reads of chromosome 21.A total of 100 000 somatic SNVs and 100 000 germline SNVs were simulated.Half of the SNVs were heterozygous and the other half homozygous.We also simulated 226 CNVs in chromosome 21 ranging in length from 10 000 to 100 000.The simulated CNV types included gain and loss with copy numbers of 0, 1,3, 4, 5, and 6.To simulate different tumor purity levels, a pair of tumor-normal matched genomes was prepared.The tumor genome contained 200 000 SNVs and the normal genome contained only germline SNVs.FASTQ files from mixed samples with tumor purity ranging from 0.2 to 0.8 were generated.The sequencing coverage depths were 10X, 20X,30X, 40X, and 50X.To reduce the influence of noise from instruments and equipment, 10 simulation experiments for each coverage were carried out.The results presented are the average of the 10 replicates.Comparisons of the svmSomatic approach and four other methods were performed with the above data.Results are shown in Figure 1B, with coverage of 30X, and Supplementary Text 4.The recall and precision results of the five methods are presented in Supplementary Text 5.

As shown in Figure 1B, the prediction of somatic SNVs improved with the increase in tumor purity; when tumor purity remained constant, prediction of somatic SNVs increased with the increase in coverage.In contrast, for STIC, the overall performance fluctuated with the increase in tumor purity.Somatic SNV prediction by STIC was dependent on AF, and thus was impacted by the increase in copy number.SNVSniffer and VarScan2 achieved satisfactory results at various coverages.However, FaSD-somatic was greatly affected by coverage, and only achieved good results when coverage was high.Overall, the performance of svmSomatic showed advantage over the other methods.

The svmSomatic method was also applied to real data.As several of the methods (FaSD-somatic (Wang et al., 2014),VarScan2 (Koboldt et al., 2012), and SNVSniffer (Liu et al.,2016)) require matched samples for comparison, we collected paired tumor-normal samples (EGAR00001008630 and EGAR00001008681) for this experiment.Figure 1C shows the results of svmSomatic and other methods for chromosome 21.The blacked numbers in the table represents the number of somatic SNVs detected.SvmSomatic predicted the largest number of somatic SNVs, followed by STIC (Yuan et al,2020b), FaSD-somatic (Wang et al., 2014), SNVSniffer (Liu et al., 2016), and VarScan2 (Koboldt et al., 2012).For sample data, the F1-score could not be calculated.Thus, to evaluate method performance using real data, overlap among the five methods was analyzed using the overlapping density score(ODS), which developed by Yuan (Yuan et al, 2020a) as expressed in Equation (1).

whereNsis the mean number of overlaps of one method with other methods andNpis the mean number of overlaps divided by the total predictions by the method.Here, we assumed that the overlaps between different methods were true positives.Thus,Nscould be defined as sensitivity andNpcould be defined as precision.The product ofNsandNpis similar to the area under an ROC curve (AUC), but the greater the value,the higher the performance.ODS(FaSD-somatic)=137.7,ODS(VarScan2)=101.8, ODS(SNVSniffer2)=45.2, ODS(STIC)=887.5, ODS(svmSomatic)=899.3, svmSomatic had the highest ODS value, followed by STIC, FaSD-somatic, VarScan2, and SNVSniffer.These results indicate that svmSomatic has a higherNs, higher mean number of overlaps with other methods, and higher sensitivity.Overall, svmSomatic showed slightly better results compared to the simulation data when applied to real data.

In this paper, we developed a new open-source method(svmSomatic) to distinguish somatic SNVs from germline SNVs in tumor-only NGS data.SvmSomatic considers the influence of copy number variation when distinguishing SNVs.Furthermore, it is a single-sample-based method that does notrequire normal matched samples.The approach can be applied for individual chromosomes as well as whole exome and genome data.The detection of somatic SNVs should facilitate downstream research on tumors, including gene annotation and targeted drug therapy.SvmSomatic is written in Python language and implemented on the Linux system.The source code and manual documents are freely available at https://github.com/BDanalysis/svmSomatic.

SUPPLEMENTARY DATA

Supplementary data to this article can be found online.

COMPETING INTERESTS

The authors declare that they have no competing interests.

AUTHORS’ CONTRIBUTIONS

Y.F.M.and X.G.Y.participated in the design of algorithms and experiments.Y.F.M.and Y.P.C.participated in analysis of the performance of the proposed method.Y.P.C.and X.G.Y.directed and conceived the study and helped edited the manuscript.All authors read and approved the final version of the manuscript.

主站蜘蛛池模板: 国产亚洲精品在天天在线麻豆 | 五月婷婷伊人网| 国产成人精品18| 国产区在线看| 九九免费观看全部免费视频| 亚洲欧美日韩成人高清在线一区| 精品亚洲欧美中文字幕在线看| 日韩人妻精品一区| 国产亚洲欧美在线人成aaaa| 日a本亚洲中文在线观看| a亚洲视频| 中文一级毛片| 精品伊人久久久香线蕉| 久久一级电影| 成人一级黄色毛片| 波多野结衣中文字幕久久| 麻豆国产在线观看一区二区| 亚洲三级视频在线观看| 亚洲日韩精品欧美中文字幕 | 欧美亚洲香蕉| 亚洲精品无码成人片在线观看| 欧美a在线看| 国产成人综合亚洲网址| 久久综合成人| 精品久久久久无码| 18禁色诱爆乳网站| 天堂成人在线| 一级毛片在线播放免费| 国模私拍一区二区| 国产麻豆精品久久一二三| 99视频全部免费| V一区无码内射国产| 最新加勒比隔壁人妻| 全裸无码专区| 国产成人久视频免费| 欧美一级专区免费大片| 精品久久高清| 国产毛片不卡| 国产激情影院| 中文字幕在线一区二区在线| 亚洲福利一区二区三区| 亚洲第一区欧美国产综合| 亚洲无码高清一区二区| 国内精自线i品一区202| 成人国内精品久久久久影院| 中文字幕 91| 四虎亚洲国产成人久久精品| 美女亚洲一区| 天天做天天爱天天爽综合区| 男人的天堂久久精品激情| 久久精品午夜视频| 国产另类视频| 一级毛片免费的| 波多野结衣一区二区三区88| 成人在线视频一区| 久久精品只有这里有| 丁香亚洲综合五月天婷婷| 亚洲日本中文字幕乱码中文| 国产精品流白浆在线观看| 97青草最新免费精品视频| 成人午夜福利视频| 久草热视频在线| 91福利在线观看视频| 91福利免费视频| 国产91小视频在线观看| 欧美在线免费| 亚洲日本精品一区二区| 农村乱人伦一区二区| 美女国产在线| 99中文字幕亚洲一区二区| 一级黄色片网| 国产成人91精品免费网址在线| 国产又黄又硬又粗| 日韩A级毛片一区二区三区| 国产69精品久久| 啊嗯不日本网站| 国产99视频精品免费视频7| 亚洲精品在线影院| 日韩精品免费在线视频| 久久综合色天堂av| a免费毛片在线播放| 园内精品自拍视频在线播放|