s13073 017 0425 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Callari et al.

Genome Medicine (2017) 9:35


DOI 10.1186/s13073-017-0425-1

METHOD Open Access

Intersect-then-combine approach:
improving the performance of somatic
variant calling in whole exome sequencing
data using multiple aligners and callers
Maurizio Callari, Stephen-John Sammut, Leticia De Mattos-Arruda, Alejandra Bruna, Oscar M. Rueda,
Suet-Feung Chin* and Carlos Caldas*

Abstract
Bioinformatic analysis of genomic sequencing data to identify somatic mutations in cancer samples is far from
achieving the required robustness and standardisation. In this study we generated a whole exome sequencing
benchmark dataset using the platinum genome sample NA12878 and developed an intersect-then-combine (ITC)
approach to increase the accuracy in calling single nucleotide variants (SNVs) and indels in tumour-normal pairs. We
evaluated the effect of alignment, base quality recalibration, mutation caller and filtering on sensitivity and false
positive rate. The ITC approach increased the sensitivity up to 17.1%, without increasing the false positive rate per
megabase (FPR/Mb) and its validity was confirmed in a set of clinical samples.
Keywords: Somatic mutation, Variant calling, Whole exome sequencing, NA12878, Platinum genome, Mutect2,
Strelka, BWA, Novoalign, Filtering

Background variants (SNVs) and small insertions and deletions


The rapid development of high-throughput sequencing (indels) in coding regions. Identifying somatic mutations
(or next generation sequencing (NGS)) technology has is more challenging than identifying germline variants
enabled great progress in cancer genomics. Decreases in for several reasons: (1) tumour samples can contain a
costs and increases in data output have resulted in the high amount of normal tissue contamination, (2) tumour
systematic collection of genome-scale data in large cells can have acquired major changes in ploidy and
tumour cohorts [1, 2], improving our understanding of DNA copy number and, finally, (3) somatic mutations
the mechanisms underlying tumour development, pro- can be present in a subset of tumour cells (i.e. subclonal
gression and response to treatments, in addition to setting mutations). As a consequence, lower variant allele fre-
the foundation for precision medicine [3, 4]. However, in quencies (VAFs) need to be detected, making them
contrast with the increasing ease in data generation, bio- harder to distinguish from technical noise [5].
informatic analyses have yet to reach a satisfactory level of A number of methods have been designed specifically
robustness and standardisation, both of which are essen- to identify somatic mutations. Several studies comparing
tial for correct data interpretation and eventual clinical their performances [68] have highlighted poor concord-
translation [5]. ance between methods [5, 9, 10]. In addition, not only the
Currently, whole exome sequencing (WES) offers the mutation caller, but also all the upstream computational
best trade-off between costs and amount of genetic in- steps can impact the final results. Unfortunately, identify-
formation obtained for detecting single nucleotide ing the best approach or appropriate parameters is ex-
tremely challenging because the ground truth is normally
* Correspondence: [email protected]; unknown.
[email protected] An important benchmark dataset has been generated

Equal contributors
CRUK Cambridge Institute, University of Cambridge, Cambridge, UK by the Platinum Genomes project, where a catalogue of

The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Callari et al. Genome Medicine (2017) 9:35 Page 2 of 11

highly accurate whole genome variant calls and Clinical samples


homozygous reference calls has been derived for sam- We included WES data from peripheral blood lympho-
ple NA12878 by integrating independent sequencing cytes (buffy coat) of 10 individuals. The samples were
data and the results of multiple pipelines (https:// collected and analysed as part of the study Cell-free
www.illumina.com/platinumgenomes). For example, it DNA in non-metastatic setting, approved by the Institu-
is the basis of the Genome Comparison and Analytic tional Review Board of the Vall dHebron University
Testing (GCAT) platform [11] that allows an easy Hospital, Barcelona, Spain (PR_AG_67-2013). DNA was
benchmarking of users pipelines for the identification extracted using the QIAamp DNA Mini Kit (Qiagen) ac-
of germline variants. cording to manufacturers instructions and quantified
To generate a benchmark dataset for the detection of using the Qubit Fluorometer assay (Life Technologies)
somatic mutations, we performed a WES experiment as previously described [12]. Two (or three) independ-
using two lymphoblastoid cell lines from the HapMap/ ent libraries were generated from each sample and se-
1000 Genomes Project (NA12878 and NA11840) in quenced as described in the next paragraphs. These
order to mimic a tumour-normal pair, with NA12878 libraries were sequenced on an Illumina HiSeq 2500 and
being the tumour and NA11840 being the normal. We Illumina HiSeq 2000 respectively (Illumina, San Diego,
diluted the NA12878 DNA with an increasing amount CA, USA).
of NA11840 DNA (from 0 to 99.8%) to estimate the per- A breast cancer sample for which two independent
formance in detecting mutations within a wide range of WES data were available was also included. This sample
VAFs. Using this dataset we aimed to: (1) evaluate the (ID: AB551) is part of a previously reported biobank [13].
effect of alignment and base quality recalibration on mu-
tation calls; (2) compare the performance of Mutect2 Whole exome sequencing
and Strelka in identifying SNVs and indels; (3) optimize Adapter-ligated indexed libraries were generated using
the mutation calling by parameter adjustment; (4) derive the Illumina Nextera Rapid Capture kit (Illumina) from
an intersect-then-combine (ITC) approach to merge in- 50 ng of DNA as per manufacturers instructions. The li-
formation from multiple tools to increase the sensitivity braries were quantified using a Qubit High Sensitivity
and decrease the false positive rate (FPR). The validity of dsDNA assay (Life Technologies). Five hundred nano-
the ITC approach was then confirmed in a set of clinical grams of adapter-ligated barcoded DNA from each sample
samples. from each library were pooled into a capture pool of 12.
Each capture pool was hybridised twice with enrichment
probes for the exome. The fragment sizes of enriched
Methods libraries were assessed using a bioanalyser (Agilent
Sample preparation Technologies, Folsom, CA, USA) and quantified using
Two lymphoblastoid cell lines, NA12878 and NA11840, KAPA Library Quantification Kits (Kapa Biosystems,
from the Human Genome Diversity Project (HGDP)- Wilmington, MA, USA).
CEPH collection were obtained from the Coriell Cell Paired-end 125-bp sequencing runs were performed
Repository. The NA11840 cell line was chosen from a on an Illumina HiSeq 2500 instrument, aiming for a mean
set of 17 available CEPH cell lines in our laboratory as read depth coverage of 100 for the NA12878 dilution
it shared the least number of SNVs with NA12878, so series.
as to generate the maximum number of virtual somatic
SNVs. The cell lines were grown as suspensions in RPMI Alignment
1640-Glutamax (Invitrogen, Waltham, MA, USA) supple- Burrows-Wheeler Aligner (BWA)/Genome Analysis Toolkit
mented with 10% foetal calf serum and 5% penicillin and (GATK) pipeline
streptomycin at 37 C and 5% CO2. The cell lines were After initial quality control using the FastQC application
passaged at 1:10 dilution, and 10 106 cells were har- (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/),
vested for DNA extractions. adapters and low-quality bases (Phred score below 20)
DNA was extracted from the cell lines using the were trimmed off using Trim Galore (v0.3.7) (http://
DNeasy Blood and Tissue DNA extraction kit (Qiagen, www.bioinformatics.babraham.ac.uk/projects/trim_galore/).
Manchester, UK) and quantified using a Qubit High Reads were aligned to the human reference genome
Sensitivity DNA quantification kit (Life Technologies, (hg19/GRCh37 decoy) using BWA-MEM (v0.7.12) and
Carlsbad, CA, USA). DNA from both cell lines was diluted default parameters. Local realignment and base quality
to obtain 100 ng/l stock concentrations. recalibration (BQR) were performed using the GATK,
To generate the serial dilutions of one cell line with the v3.4.46. Multiple bam files for the same sample (obtained
other, we mixed by volume to obtain the percentage on different sequencing lanes) were merged. Alignment
(volume/volume) as presented in Additional file 1 (n = 12). and coverage metrics as well as PCR duplicate marking
Callari et al. Genome Medicine (2017) 9:35 Page 3 of 11

were computed using Picard tools (v1.125) before and mapping_event filters, as germline variations frequently
after merging. Local realignment was repeated on all 12 occur close to each other in the genome.
samples together to ensure indel concordance between
samples. Strelka
We ran Strelka (v1.0.14) [15] for each virtual tumour-
Novocraft pipeline normal pair with recommended starting parameters for
Adapter and low-quality base (Phred score below 20) BWA in the configuration file and default parameters.
trimming, alignment and BQR were performed in a single The isSkipDepthFilters parameter was set to 1 to skip
step using Novoalign (v3.02). In a preliminary step, the depth filtration, as suggested by the authors. Mutations
alignment score threshold was varied between 50 and 300. called by default were those passing internal filters iden-
As shown in Additional file 2: Figure S1, modification of tified using Tier 1 reads and with a QSS_NT > 15 for
this threshold affects both alignment efficiency and the SNVs and a QSI_NT > 30 for indels.
running time, and we established that the best comprom-
ise between alignment time and efficiency was at a score Performance evaluation
threshold of 250. Bam files were locally realigned using By intersecting the experimental calls with the list of
the GATK, v3.4.46. Sorting and duplicate marking were platinum calls (9968 SNVs and 420 indels), we computed
done using Novosort (v3.02) before and after merging the number of true positives (TP), false negatives (FN)
the distinct bam files from the same sample. Local re- and false positives (FP). This task was performed using
alignment was repeated after merging on all 12 samples custom scripts that matched the genomic coordinates as
together. well as reference and alternative alleles. Sensitivity was
then defined as:
Identification of regions of interest
Platinum variant calls for sample NA12878 (the virtual TP
Sens
tumour) and confident regions (high confidence homozy- T P FN
gous reference regions plus platinum calls) [14] were down-
We also computed the false positive rate per megabase
loaded from https://www.illumina.com/platinumgenomes
(FPR/Mb) as:
(v7.0.0). Genotype data for sample NA11840 (the virtual
normal) were obtained from the 1000 Genomes website. FP
Platinum calls were intersected with the Nextera exome FPR=Mb  106
n ref
target regions, and variants shared with the NA11840 sam-
ple were excluded. Four multiallelic SNVs and 10 multialle- where n ref is our set of reference regions equal to
lic indels were also excluded. In total we identified a list of 36,582,697 bp.
9968 SNVs and 420 indels that are theoretically somatic
variants in our virtual tumour-normal pair. The confident Variant annotation
regions were also intersected with the Nextera exome target Somatic mutations called in the breast cancer clinical
regions, defining the regions of interest in which to search sample were annotated using Variant Effect Predictor
for mutations. After subtracting platinum call regions, a (VEP) (http://grch37.ensembl.org/).
total of 36,582,697 bp represented our set of reference
regions. Results
Experimental design and quality control
Somatic mutation calling We sequenced two cell lines from the HapMap/1000
Mutect2 Genomes Project, NA12878 (Platinum Genomes) and
Mutect2 (included in GATK 3.5) was run for each com- NA11840 to mimic a tumour-normal pair allowing iden-
bination of NA12878 dilution series (from 100% to 0.2% tification of somatic SNVs and indels. The NA12878
purity) with the NA11840 sample (tumour-normal mode) sample was also mixed with an increasing amount of
using default parameters, with the exception of the min- NA11840 (up to 99.8% by concentration, Additional file 1)
Pruning parameter which determines the minimum sup- to mimic normal contamination and the presence of sub-
port to not prune paths in the De Bruijn-like graph: this clonal somatic mutations. All samples were subjected to
was set at 3 (instead of the default value 2), as it dra- WES, and we obtained an average on target coverage of
matically reduced the running time without affecting 100 (Additional file 2: Figure S2A).
the number of variant calls (data not shown). In the A list of 9968 high confidence SNVs present in
platinum genome dilution experiment, we included all NA12878 and not in NA11840 was derived (see Methods).
mutations passing all the internal filters as well as muta- For these loci we computed the VAFs in our data to verify
tions that failed the clustered_events and/or homologous_ that the observed median values for heterozygous and
Callari et al. Genome Medicine (2017) 9:35 Page 4 of 11

homozygous SNVs matched the expected values in the di- Performances of the different pipelines were measured
lution series (Additional file 2: Figure S2B). It is worth in terms of sensitivity computed for each sample in the
noting that whilst the median VAFs matched the expected dilution series and FPR/Mb estimated in the most diluted
values, confirming dilution accuracy, a wide dispersion of sample. For FPR/Mb estimation, we only used the most
VAF values was observed, with 265 SNVs having a VAF of diluted sample because we observed new mutations in the
0 in the 100% NA12878 sample, mostly caused by low or least diluted NA12878 samples (likely to be caused by
no coverage in that specific locus. An additional group of genetic drift) that would be wrongly accounted as false
495 SNVs had coverage less than 10 in our dataset. We positives. This is supported by the fact that in the least
did not exclude these SNVs, as uneven coverage is com- diluted samples (100% and 80% NA12878) most of the
monly observed in WES data. Although the selected plat- hypothetical false positives overlap, whilst in the most
inum calls were not identified as SNVs in the publicly diluted samples none of them overlaps (Additional file 2:
available data for NA11840, some were clearly present in Figure S4).
our NA11840 data (556 mutations had a VAF greater than
0.2, Additional file 2: Figure S3). Again, these loci were Mutation calling using default parameters and effect of
retained because they might be either real SNVs or prob- base quality recalibration
lematic regions with higher noise. The inclusion of the When calling SNVs using default parameters, Mutect2
above-mentioned SNVs meant that our sensitivity would had higher sensitivity, particularly at low VAFs, but at
never reach 100%. the expense of a slightly higher FPR/Mb when compared
Raw sequencing data were processed using two dis- to Strelka (Fig. 2a, c, e). However, none of the approaches
tinct alignment pipelines (hereafter named BWA/GATK reached 80% sensitivity in the pure NA12878 sample.
and Novocraft; the statistics above have been computed Performances were lower in indel calling, with Strelka
on Novocraft bam files), and somatic SNVs and indels showing the lowest sensitivity but also a much lower
were called using Mutect2 and Strelka. Mutations were FPR/Mb in combination with the BWA/GATK pipeline
initially identified using default parameters and again (Fig. 2b, d, f ).
using optimised filtering criteria. We then applied an The base quality recalibration (BQR) step can consider-
intersect-then-combine (ITC) approach as detailed in ably change the performance. Indeed, when BQR was not
Fig. 1 and subsequent paragraphs. performed, low frequency SNVs were called with a higher

RAW data

BWA/GATK Novocraft
w/o BQR w/o BQR

BWA/GATK Novocraft
w/ BQR w/ BQR

Mutect2 Strelka Mutect2 Strelka

Mutect2 Strelka Mutect2 Strelka

Optimised Optimised Optimised Optimised


filtering filtering filtering filtering

Intersect Intersect

Combine

Fig. 1 Analysis workflow. Schematic representation of the workflow of analyses applied to the NA12878 platinum genome dilution series. Raw
data have been initially analysed using two alignment pipelines (BWA/GATK-based and Novocraft-based) with or without base quality recalibration.
Somatic calls were identified using two distinct tools: Mutect2 and Strelka. For base quality recalibrated data, mutation caller parameters have been
adjusted to improve overall performance. The intersection between mutations from the same caller but different alignment pipelines was selected to
reduce the number of false positives. Subsequently, the unions of these filtered calls from Mutect2 and Strelka were combined to obtain the final set
of calls
Callari et al. Genome Medicine (2017) 9:35 Page 5 of 11

a b

c d

e f

Fig. 2 Effect of alignment and base quality recalibration on sensitivity and FPR. Sensitivity in detecting SNVs (a, c) or indels (b, d) in our dilution
series using Mutect2 (a, b) or Strelka (c, d) and each of four different alignment pipelines. e, f FPR/Mb in SNV (e) or indel (f) calling as a function
of the alignment pipeline and mutation caller used

sensitivity, but associated with a dramatic increase in FPR/ Mutect2, we looked at the reasons for rejection of the
Mb (Fig. 2a, c, e). Therefore, for all subsequent analyses, false negative calls to understand which internal filters
base quality recalibrated data were used. In base quality caused rejection, thereby identifying which parameters
recalibrated data, although the alignment algorithms did needed adjustment (Additional file 2: Figure S5). We
not seem to have a major impact on sensitivity (Fig. 2a considered both the undiluted NA12878 and the 10% di-
and c), the SNV FPR/Mb was higher in both callers when luted sample, hypothesising that reasons for failure
used with BWA/GATK alignments (Fig. 2e). In addition, a might be different at high and low VAFs. Indeed, in the
higher FPR/Mb was observed when Strelka was used to undiluted NA12878, independent of the alignment pipe-
call indels on Novocraft-aligned data (Fig. 2f). line, most of the false negatives were rejected because
the alternative alleles were observed in the normal sam-
Optimising mutation caller parameters ple (i.e. NA11840). As previously mentioned, some of
To improve the performance of both mutation callers, these false negatives might be due to the presence of un-
various caller-specific filtering criteria were assessed. In reported SNVs in the NA11840 sample; however, a
Callari et al. Genome Medicine (2017) 9:35 Page 6 of 11

proportion is likely caused by background noise or improvement in performance using the aforementioned
technical cross-sample contamination (Additional file 2: parameters is summarised in Fig. 3.
Figure S3B).
Default Mutect2 parameters for the alt_allele_in_normal The intersect-then-combine (ITC) approach
filter allow for no more than one read bearing the alterna- After parameter optimisation, we hypothesised that
tive allele in the normal sample, and it must represent less some of the false positives might be caused by alignment
than 3% of the reads mapping over the locus. Taking ad- errors, and to test this hypothesis we compared the calls
vantage of a dataset where the ground truth (or a good ap- from the same mutation caller but from different align-
proximation of it) is known, we measured the change in ment pipelines. Interestingly, only a small percentage (an
performance as a function of the threshold applied. In- average of 1% for SNVs and 3.9% for indels) of true posi-
creasing the percentage of alternative allele present in the tive calls was discordant, whilst half of the false positives
normal sample will likely increase not only the sensitivity were called in one case but not the other (an average of
but also the FPR, mainly because one might call germline 50% for SNVs and 48% for indels) (Table 1). Therefore,
mutations as somatic. Therefore, we added an extra filter considering only the intersection of calls identified by
by computing the ratio between the VAF observed in the the same mutation caller but with two different alignment
tumour and the VAF observed in the normal (hereafter pipelines reduced the sensitivity slightly but reduced the
called T/N ratio). As expected, increasing the percentage FPR dramatically. Importantly, this makes it possible to
of alternative allele allowed in the normal increased both combine the calls from the two mutation callers (i.e.
sensitivity and FPR/Mb, but by applying the T/N ratio we Mutect2 and Strelka), significantly increasing the sensitiv-
could obtain higher sensitivity and lower FPR/Mb com- ity (because they still show a significant disagreement) but
pared with default parameters. This was true for both still at a low enough FPR. Notably, the subset of mutations
100% and 10% NA12878 samples and independent of the called by both Mutect2 and Strelka has an extremely high
alignment pipelines (Additional file 2: Figures S6, 7). true positive rate (Table 2). We checked that variants
Not surprisingly, in the 10% NA12878 sample the called by one caller but not the other were not caused by
main reason of failure was the t_lod_fstar filter, where a different representation of the same variant. We found
the log-likelihood ratio of the data under the variant and that only in one case the same insertion next to a repeti-
reference models has to exceed a specified threshold tive region was represented as 14:29261307 A- > AC in
(default = 6.3). Varying the threshold confirmed that it is Mutect2 and 14:29261305 A- > AAAC in Strelka.
a trade-off between sensitivity and FPR/Mb (Additional The ITC approach allowed us to achieve a sensitivity
file 2: Figures S6, 7). of 86.7% for SNV calling in the pure sample and 50.2%
In Strelka, mutation calls are separated in Tier 1 and sensitivity for indels; these values were significantly higher
Tier 2, where the first is a set of input data filtration and than the default performances for each single approach
model parameter settings with relatively stringent values, whilst simultaneously controlling for the FPR/Mb (Fig. 4).
whereas the second uses more permissive settings [15]. For example, the ITC approach showed the same FPR/Mb
It has been suggested by the Strelka authors to consider of the Strelka/Novocraft pipeline; however, the sensitivity
Tier 1 calls only and apply a QSS_NT (quality score was systematically higher across the dilution series. In par-
reflecting the joint probability of somatic variant and ticular, sensitivity in SNV detection increased up to 17.1%
genotype of the normal) threshold of 15. Including Tier in the 40% NA12878, whilst sensitivity in indel detection
2 calls increased both sensitivity and FPR/Mb, but by increased up to 16.4% in the 60% NA12878 sample.
also changing the QSS_NT threshold it was possible to Sensitivity was also estimated for the subset of SNVs and
increase the sensitivity and decrease the FPR/Mb. As be- indels heterozygous in the NA12878 sample (Additional
fore, this is true independently of the dilution and align- file 2: Figure S10). This allows a VAF approximation for
ment pipeline (Additional file 2: Figure S8). The same these SNVs and indels for each sample in the dilution
applies to indels, where the default QSI_NT threshold is series, hence allowing an estimation of the sensitivity as
30, although smaller improvements were observed a function of the VAF. In this subset of mutations, the
(Additional file 2: Figure S9). sensitivity in the 100% NA12878 sample (expected VAF =
Based on the analyses above, we selected the best 50%) was 87.8% for SNVs and 55.4% for indels.
thresholds to increase the sensitivity (in particular at We anticipated a significant number of false negatives
high VAF, without losing sensitivity at low VAFs), whilst caused either by low coverage or alternative allele drop-
in most cases reducing the FPR/Mb. For Mutect2 we in- out (see the Experimental design and quality control
cluded SNVs and indels with a percentage of alternative subsection). Indeed, we found that 28.6% of the false
allele in normal up to 7% but a T/N ratio higher than 5. negative SNVs using the ITC approach were due to low
In Strelka we included Tier 1 and Tier 2 calls with a coverage (<10) whilst an additional 9.9% showed a
QSS_NT > 25 for SNVs and QSI_NT > 35 for indels. The VAF = 0 in our data (but coverage >10).
Callari et al. Genome Medicine (2017) 9:35 Page 7 of 11

a b

c d

Fig. 3 Effect of parameter adjustments on sensitivity and FPR. Changes in sensitivity (a, c) and FPR/Mb (b, d) after optimised filtering of Mutect2
(a, b) or Strelka (c, d) calls in combination with the two alignment pipelines (BWA/GATK and Novocraft)

Finally, we looked at the false positive calls still present confirm its validity in clinical samples. We first used a
using the ITC approach (Additional file 2: Figure S11 set of 10 normal samples for which two or three inde-
and Additional file 3). Some of the false positive SNVs pendent replicates were available. We called somatic
were clustered and could be avoided using extra filtering mutations in each pair of replicates using one replicate
steps based on the distance between calls. In addition, as tumour and the other as normal and vice versa. In
we noticed the presence of several C > A transversions, this setting, we assumed that any mutations called are
probably caused by oxidative DNA damage during sam- false positives, giving us the opportunity to estimate the
ple preparation [16]. False positive indels were located in FPR/Mb. For the four combinations of alignment pipe-
low complexity/repetitive regions where polymerase slip- lines and mutation callers, the observed FPR/Mb values
page could introduce errors. were slightly higher than what was observed in the
benchmark dataset. Importantly, the FPR/Mb was the
Validation in clinical samples lowest using the ITC approach (Fig. 5a, b) and slightly
After developing our approach and testing the perform- lower than what was observed in the benchmark data-
ance in the platinum genome experiment, we aimed to set (Fig. 4b, d).

Table 1 Intersection of calls obtained using different alignment pipelines


TP - 100% NA12878a FP
BWA/GATK only (%) Intersection Novocraft only (%) BWA/GATK only (%) Intersection Novocraft only (%)
SNVs Mutect2 90 (1.1) 8435 91 (1.1) 23 (65.7) 12 15 (55.6)
Strelka 107 (1.3) 7928 37 (0.5) 12 (50.0) 12 5 (29.4)
Indels Mutect2 11 (6.4) 162 6 (3.6) 3 (60.0) 2 2 (50.0)
Strelka 7 (4.0) 169 3 (1.7) 0 (0.0) 2 8 (80.0)
a
Out of 9968 candidate SNVs and 420 true positive
Callari et al. Genome Medicine (2017) 9:35 Page 8 of 11

Table 2 Union of somatic mutations identified using different callers


TP - 100% NA12878 FP
Mutect2 only Intersection Strelka only Mutect2 only Intersection Strelka only
Union SNVs 712 7723 205 8 4 8
Union indels 42 120 49 2 0 2

Next, we looked at a tumour-normal pair for which mutations (Fig. 5c, d). Indeed, 37 extra mutations (over-
two independent replicates of the tumour sample were lapping between the two replicates) were identified with
available. In this case, mutations called in both cases are our approach compared with the second best (Mutect2
very likely to be true positives, while those not overlap- Novocraft). Interestingly, some of them were affecting
ping will be enriched in false positives. Consequently, an cancer-related genes, i.e. a missense mutation in TFE3, a
improvement in mutation calling performances should stop-gain mutation in KMT2C and a stop-gain mutation
lead to an increase in the percentage and absolute num- in the putative tumour suppressor gene RPS6KA2 [17]
ber of overlapping mutations. Importantly, one replicate (Additional file 4).
(R1) had a 52 average coverage whilst R2 had a 79
average coverage; this implies that most of the calls in
R1 should overlap the calls in R2, whilst a higher per- Discussion
centage of non-overlapping calls can be expected for R2. Cancer genomics has acquired a prominent role in oncol-
For each replicate we computed the total number and ogy, providing information on cancer biology and mecha-
the percentage of overlapping somatic mutations after nisms of resistance, and its clinical application is
applying each combination of alignment pipeline and becoming a reality [18, 19]. However, computational ana-
mutation caller or the ITC approach. The observed pat- lysis of sequencing-based data is facing a lack of standard-
tern fits with the difference in coverage between the two isation, as demonstrated by recent reports [5, 20]. In this
replicates and, more importantly, shows that the ITC study we focused on improving the identification of
strategy leads to the highest number of overlapping somatic SNVs and indels in WES data.

a b

c d

Fig. 4 Performance of the intersect-then-combine (ITC) approach. Sensitivity (a) and FPR/Mb (b) in identifying somatic SNVs after applying the ITC
strategy compared with performance using each single alignment pipeline in combination with each mutation caller. Sensitivity (c) and FPR/Mb
(d) in identifying somatic indels after applying the ITC strategy compared with performance using each single alignment pipeline in combination
with each mutation caller
Callari et al. Genome Medicine (2017) 9:35 Page 9 of 11

a SNVs
b Indels
7 0.6
6 0.5
5
0.4
FPR/Mb

FPR/Mb
4
0.3
3
0.2
2
1 0.1

0 0.0
Mutect2 BWA/GATK

Mutect2 BWA/GATK
Strelka BWA/GATK

Strelka BWA/GATK
Mutect2 Novocraft

Strelka Novocraft

Mutect2 Novocraft

Strelka Novocraft
ITC

ITC
c d

Fig. 5 Validation in clinical samples. In 10 normal samples for which two or three sets of WES data from independent libraries were available we ran
the different pipelines according to Fig. 1 using one replicate as tumour and the other as normal and vice versa, for a total of 28 comparisons (a, b).
In this setting, all called mutations are treated as false positives. Boxplots represent FPR/Mb distributions for the 28 comparisons as a function of the
different pipelines applied to identify somatic SNVs (a) or indels (b). The same set of analysis pipelines was applied to a breast cancer tumour sample
for which two sets of WES data from independent libraries (R1 and R2) as well as matched normal WES data were available (c, d). The percentage
(x-axis) and the total number (y-axis) of overlapping calls in the two replicates are plotted in c for R1 and d for R2. Sequencing coverage was 52 for
R1 and 79 for R2

Generating appropriate benchmark datasets to esti- complexity or repetitive genomic regions might not have
mate pipeline performance is not a trivial task. In recent been considered. Remarkably, we were able to confirm
reports, somatic mutations have been spiked in compu- our findings in clinical samples, obtaining similar FPR/
tationally [20] or derived after manual curation of high Mb values and evidence for an increase in sensitivity
coverage data [5]. The first approach is limited to SNVs using our ITC approach.
and overestimates the performances because it generates Note that, among mutations called with our approach
mutations only in regions with sufficient coverage. The and not with the second best performing pipeline, there
second is likely to generate an incomplete list of real were several missense mutations in proliferation and can-
mutations, in particular those having low frequency. The cer genes, among them a stop-gain mutation in RPS6KA2,
best available standard for germline mutation caller a putative tumour suppressor gene [17], and a stop-gain
benchmarking is represented by the platinum genome mutation in KMT2C, which has been found mutated in
sample NA12878 [11, 21]. Using this sample, we created several cancer types (http://cancer.sanger.ac.uk/cosmic).
here a benchmark dataset suitable for the evaluation of This highlights how the bioinformatic analysis can signifi-
new methods and pipelines aiming to identify somatic cantly impact downstream data interpretation and the
mutations. A similar approach has been proposed in [6], chances to identify the functionally relevant aberrations in
but we generated a dilution series experimentally and a tumour sample.
not in silico, mimicking more realistically the detection In many studies, precision and recall are usually com-
of low VAF mutations. Although not a perfect system, puted as metrics to estimate performance [5, 20]. We pre-
we believe it represents one of the best possible approxi- ferred to use sensitivity and FPR/Mb instead [6]. Precision,
mations to the ground truth. Limitations are linked with also known as true positive rate, is highly dependent on the
the fact that tumour genomes are more complex than number of real mutations in the sample. In our platinum
normal lymphoblastoid cell lines, and some low genome experiment, nearly 10,000 mutations could be
Callari et al. Genome Medicine (2017) 9:35 Page 10 of 11

called, keeping the precision high even in the presence of bam file to be generated. However, we note that the use of
hundreds of false positives. By contrast, the FPR/Mb gives a high performance computing and parallelisation is com-
direct estimation of the expected number of false positives, mon practice and would minimize this drawback. Indeed,
independently of the number of real mutations present. the observed increase in performance far outweighs any
Sensitivity and recall indicate the same metrics. Although drawbacks secondary to increased computing resources
the number of mutations in our benchmark dataset is required.
higher than in most cancer types, this is not a shortcoming. Our study clearly indicates the importance and advan-
On the contrary, from a statistical point of view, a higher tages of having a benchmark dataset to test somatic mu-
number of candidate mutations helps in obtaining a more tation calling pipelines and quantitatively measure their
robust estimation of sensitivity. performance. Therefore, the dilution series herein gener-
We chose to use tools that have been shown to out- ated represent a valuable resource that we are making
perform others and are commonly used by the commu- publicly available through the precisionFDA platform
nity and big cancer genomics consortia [5, 6]. Mutect2 (https://precision.fda.gov/).
has been recently released; therefore, the previous ver- The combination of multiple callers has been previously
sion has been more widely used and benchmarked. We suggested, but it has involved either taking the intersec-
present here the results obtained using Mutect2 because tion or the union of them, drastically losing sensitivity in
it has not been previously compared with other tools and the first case (in particular when one tool performs worse
also because of its ability to call both SNVs and indels. than the other) or hugely increasing the number of false
However, similar conclusions can be drawn (for SNVs) positives in the second. Here we propose a two-step strat-
using the older version of Mutect (data not shown). egy allowing us to merge the calls from two different tools
In our study we evaluated the impact of several factors whilst keeping the FPR/Mb low.
on the list of called mutations. The effect of alignment
on mutation calling has been recently reported as minor
Conclusions
[20]. We report here concordant results in terms of
Identification of somatic SNVs and indels in WES data
overall performances; however, we clearly show that most
has been suboptimal. Here we propose a computational
of the false positives are a consequence of misalignment.
approach based on the combination of two aligners and
Indeed, selecting the somatic mutations identified by the
two mutation callers to increase the sensitivity whilst con-
same caller but after alignment with two different algo-
trolling for the false positive rate. We also provide a bench-
rithms allowed us to remove around 50% of false positives
mark dataset based on the platinum genome NA12878 to
with a minimal loss (~1%) in sensitivity. A bigger impact
objectively test the performance of any bioinformatic
was the base quality recalibration step; when not applied,
pipeline for the identification of somatic mutations.
it causes a huge number of false positives. Interestingly, in
a recent comparison of whole genome sequencing pipe-
lines, only 5 of the 18 involved groups applied a base Additional files
quality recalibration step [5].
Overall, Mutect2 outperformed Strelka in terms of Additional file 1: NA12878 dilution series. (XLSX 36 kb)
sensitivity, particularly at lower VAFs, although showing Additional file 2: Figures S1S11. All supplementary figures. (PDF 2333 kb)
a tendency for higher FPR/Mb. Both tools benefited Additional file 3: List of false positive SNVs and indels using the ITC
from an adjustment of default filtering thresholds, an as- approach. (XLSX 43 kb)

pect often overlooked in previous reports. Importantly, Additional file 4: VEP annotation for mutations called using the ITC
approach and not by the second best (Mutect2/Novocraft) in AB551
we introduced a T/N ratio as an additional filtering cri- breast cancer sample. (XLSX 143 kb)
terion that, in combination with a more relaxed thresh-
old for the alternative allele in the matched normal,
allowed an increase of up to 9% sensitivity in Mutect2 Abbreviations
FPR/Mb: False positive rate per megabase; Indels: Small insertions and
calls whilst reducing the FPR/Mb. As a cautionary note, deletions; ITC: Intersect-then-combine; NGS: Next generation sequencing;
some of the threshold applied in the filtering optimisa- QSI_NT: Quality score reflecting the joint probability of a somatic variant
tion step might be in some way dependent of coverage, (indel) and NT; QSS_NT: Quality score reflecting the joint probability of a
somatic variant (SNV) and NT; SNV: Single nucleotide variant; VEP: Variant
library preparation or sequencing platform and might effect predictor; WES: Whole exome sequencing
not be generalised, but we expect the ITC approach here
proposed to be generalisable and valid even when pick- Acknowledgements
ing different tools. We estimated that the use of the ITC We are grateful to Cancer Research UK, the University of Cambridge and
approach approximately doubles the required CPU time Hutchison Whampoa Limited for their support. The Human Research Tissue
Bank is supported by the NIHR Cambridge Biomedical Research Centre. We
compared with a single aligner/single caller approach, and thank the Cancer Research UK Cambridge Institute Core Facilities that supported
extra storage is temporarily required for the additional aspects of this work: Genomics and Bio-repository.
Callari et al. Genome Medicine (2017) 9:35 Page 11 of 11

Funding exome and targeted deep sequencing data. Jordan IK, editor. PLoS One.
This project has received funding from Cancer Research UK and from the 2016;11:e0151664.
European Unions Horizon 2020 research and innovation programme under 11. Highnam G, Wang JJ, Kusler D, Zook J, Vijayan V, Leibovich N, et al. An
the Marie Skodowska-Curie grant agreement number 660060. analytical framework for optimizing variant discovery from personal
The funding bodies did not have any role in the design of the study, in the genomes. Nat Commun. 2015;6:6275.
collection, analysis or interpretation of data or in writing the manuscript. 12. De Mattos-Arruda L, Mayor R, Ng CKY, Weigelt B, Martnez-Ricarte F,
Torrejon D, et al. Cerebrospinal fluid-derived circulating tumour DNA better
Availability of data and materials represents the genomic alterations of brain tumours than plasma. Nat
Raw data for the NA12878-NA11840 dilution series have been uploaded to Commun. 2015;6:8839.
the precisionFDA platform (https://precision.fda.gov/). Please go to Files > 13. Bruna A, Rueda OM, Greenwood W, Batra AS, Callari M, Batra RN, et al. A
Explore > Added by: maurizio.callari. biobank of breast cancer explants with preserved intra-tumor heterogeneity
to screen anticancer compounds. Cell. 2016;167:26074.
Authors contributions 14. Eberle MA, Fritzilas E, Krusche P, Kllberg M, Moore BL, Bekritsky MA, et al. A
MC, SJS, SFC and CC conceived the study, SFC performed the dilution reference dataset of 5.4 million human variants validated by genetic inheritance
experiment and library preparations, MC, SJS and OR developed the from sequencing a three-generation 17-member pedigree. Genome Res.
computational approach, LDMA and AB provided the clinical samples and 2017;27:15764.
MC performed the analyses and drafted the manuscript. All authors read and 15. Saunders CT, Wong WSW, Swamy S, Becq J, Murray LJ, Cheetham RK.
approved the final manuscript. Strelka: accurate somatic small-variant calling from sequenced tumor-normal
sample pairs. Bioinformatics. 2012;28:18117.
16. Costello M, Pugh TJ, Fennell TJ, Stewart C, Lichtenstein L, Meldrim JC, et al.
Competing interests
Discovery and characterization of artifactual mutations in deep coverage
The authors declare that they have no competing interests.
targeted capture sequencing data due to oxidative DNA damage during
sample preparation. Nucleic Acids Res. 2013;41:e677.
Consent for publication
17. Bignone PA, Lee KY, Liu Y, Emilion G, Finch J, Soosay AER, et al. RPS6KA2, a
Not applicable.
putative tumour suppressor gene at 6q27 in sporadic epithelial ovarian
cancer. Oncogene. 2007;26:683700.
Ethics approval and consent to participate 18. Robinson D, Van Allen EM, Wu Y-M, Schultz N, Lonigro RJ, Mosquera J-M, et
All human samples used were collected after informed consent, and the al. Integrative clinical genomics of advanced prostate cancer. Cell. 2015;
study was fully compliant with the Helsinki Declaration. Buffy coat clinical 161:121528.
samples were collected and analysed as part of the study Cell-free DNA in 19. Swanton C, Soria JC, Bardelli A, Biankin A, Caldas C, Chandarlapaty S, et al.
non-metastatic setting, approved by the Institutional Review Board of the Consensus on precision medicine for metastatic cancers: a report from the
Vall dHebron University Hospital, Barcelona, Spain (PR_AG_67-2013). Breast MAP conference. Ann Oncol. 2016;27(8):1443-48.
cancer clinical samples are part of our previously reported biobank [13], and 20. Ewing AD, Houlahan KE, Hu Y, Ellrott K, Caloian C, Yamaguchi TN, et al.
all research was done with the appropriate approval by the National Research Combining tumor genome simulation with crowdsourcing to benchmark
Ethics Service (Cambridgeshire 2 REC reference number: 08/H0308/178). somatic single-nucleotide-variant detection. Nat Methods. 2015;12:62330.
21. Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, et al.
Integrating human sequence data sets provides a resource of benchmark
Publishers Note SNP and indel genotype calls. Nat Biotechnol. 2014;32:24651.
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

Received: 5 October 2016 Accepted: 24 March 2017

References
1. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K,
et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet.
2013;45:111320.
2. Pereira B, Chin S-F, Rueda OM, Vollan H-KM, Provenzano E, Bardwell HA,
et al. The somatic mutation profiles of 2,433 breast cancers refines their
genomic and transcriptomic landscapes. Nat Commun. 2016;7:11479.
3. Mardis ER, Wilson RK. Cancer genome sequencing: a review. Hum Mol Genet.
2009;18:R1638.
4. Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-
generation sequencing revolution and its impact on genomics. Cell.
2013;155:2738.
5. Alioto TS, Buchhalter I, Derdak S, Hutter B, Eldridge MD, Hovig E, et al. A
comprehensive assessment of somatic mutation detection in cancer using
whole-genome sequencing. Nat Commun. 2015;6:10001. Submit your next manuscript to BioMed Central
6. Xu H, DiCarlo J, Satya RV, Peng Q, Wang Y. Comparison of somatic mutation and we will help you at every step:
calling methods in amplicon and whole exome sequence data. BMC
Genomics. 2014;15:244. We accept pre-submission inquiries
7. Kim SY, Speed TP. Comparing somatic mutation-callers: beyond Venn Our selector tool helps you to find the most relevant journal
diagrams. BMC Bioinformatics. 2013;14:189.
We provide round the clock customer support
8. Roberts ND, Kortschak RD, Parker WT, Schreiber AW, Branford S, Scott HS,
et al. A comparative analysis of algorithms for somatic SNV detection in Convenient online submission
cancer. Bioinformatics. 2013;29:222330. Thorough peer review
9. ORawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, et al. Low concordance of
Inclusion in PubMed and all major indexing services
multiple variant-calling pipelines: practical implications for exome and
genome sequencing. Genome Med. 2013;5:28. Maximum visibility for your research
10. Krigrd AB, Thomassen M, Lnkholm A-V, Kruse TA, Larsen MJ. Evaluation
of nine somatic variant callers for detection of somatic mutations in Submit your manuscript at
www.biomedcentral.com/submit

You might also like