Mason - A Read Simulator For Second Generation Sequencing Data
Mason - A Read Simulator For Second Generation Sequencing Data
Mason - A Read Simulator For Second Generation Sequencing Data
Manuel Holtgrewe
B-10-06
October 2010
1. Introduction
Second generation sequencing technologies yield DNA sequence data at unprecedented
high throughput. The generated data has many applications, including genome re-
sequencing, and structural variant detection.
Various sequencing technologies are commercially available: Instruments based on py-
rosequencing by Roche/454 Life Sciences [MEA+ 05], reversible terminator chemistry by
Illumina [BBS+ 08], and sequencing by oligonucleotide ligation and detection by Applied
Biosystems are examples of second-generation sequencing. Helicos Biosciences offers the
first commercial instrument for single-molecule sequencing.
The Short Read Archive (SRA) offers huge numbers of freely available read data.
However, when developing, testing and evaluation software that processes sequencing
data, using real-world data only is not desirable. While evaluating performance on
real-world data is indispensable, simulated data nicely complements real data.
First, the original sample location in the genome is not known for real-world data
while it is available for simulated data. Second, the data in the SRA is often more than
one year old since authors publish their data in the SRA after the publication of their
paper. This means that the technology used for reads in the SRA is older, e.g. with
shorter read numbers org having no mate pairs. Third, simulation allows to consider
certain characteristics of data in an isolated way. For example, one can increase the
error rate in the simulated reads to show the robustness of an algorithm.
Because of this, many authors use simulated data for evaluating their algorithms.
Many authors use their own, possibly ad-hoc, software for generating such reads. How-
ever, there are also some publications that only deal with the simulation of reads.
In [Mye99], Myers describes celsim, a program for the simulation of Sanger reads.
Since the genomes of important model organisms were not known at that time, it also
allows the synthesis of genomes. In [ROA+ 08], Richter et al. describe MetaSim, a
program for the simulation of reads. The focus of MetaSim is metagenomics and it
allows to sample reads from a larger set of genomes and also to artificially let these
genomes evolve. Recently, in [BML+ 10] , Balzer et al. describe flowsim, a simulator for
pyrosequencing reads, based on the analysis of empirical data.
Table 1 shows properties of these read simulators and Mason, the software described
in this paper. The source code of Mason is freely available, it supports the generation
1
Feature celsim MetaSim flowsim Mason
Source code available X X
Sanger reads X X X
Sanger qualities X X
454 reads X X X
454 qualities X X
Illumina reads X X
Illumina qualities X
Table 1: Properties of read simulation software. Citations are: celsim [Mye99], MetaSim
[ROA+ 08], flowsim [BML+ 10]. Mason is described in this article.
of Illumina, 454 and Sanger reads. Our group has successfully used the simulator for
benchmarking software for read mapping, read correction and transcript quantification.
The read simulator has been implemented using C++ using the SeqAn library.
2. Simulation Models
The simulation program framework is described in Section 3. In this section, we describe
the simulation model parts that depend on the simulated technology: The length of the
physical sample (reference sequence infix), the model for simulating sequencing errors
and base quality values.
Following George Box’ words “All models are wrong but some are useful,” the aim
is not to find a model that fits reality completely. Rather, we want to find simple that
show important characteristics of the simulated sequencing technologies.
2
Read Set Species Length [bp] Submission Platform
SRR026674 fly 36 2009/09/28 Illumina GA 2
SRR026675 fly 36 2009/09/28 Illumina GA 2
SRR026676 fly 36 2009/09/28 Illumina GA 2
SRR049254 fly 100 2010/05/20 Illumina GA 2
SRR038098 yeast 20 2010/03/16 Illumina GA 2
SRR003673 yeast 36 2008/08/12 ”Illumina” (1G?)
Table 2: Properties of the read sets chosen for determining position-depending probabil-
ities.
3
errors percentage
SRR038098 SRR003673 SRR026674 SRR049254
0 95.39 78.41 74.77 43.52
1 3.58 11.98 17.25 20.85
2 1.02 5.61 5.64 10.87
3 0.01 3.97 2.33 6.61
4 0 0.01 0.01 4.51
5 0 0 0 3.34
6 0 0 0 2.68
7 0 0 0 2.23
8 0 0 0 1.94
9 0 0 0 1.73
10 0 0 0 1.55
11 0 0 0 0.05
• One for read sets SRR026674, SRR049254, SRR038098, and SRR003673 (sets A)
to show the variance between the results from different studies.
• One for the read sets SRR026674, SRR026675, and SRR026676 (sets B) to show
the variance within the same experiment.
The plots themselves can be found in Appendix A. Figures 1, 2, and 3 show the
positional mismatch, insertion and deletion error rates. We can see there are differences
between the insertion error rates in sets A (Figure 1a) and even those from the same
experiment (sets B, Figure 1b).
The reads from sets B were sequenced in 2008, being one of the first reads to be
sequenced in paired-end mode outside Illumina ([Tho10]). The authors consider these
reads to be of low quality mostly due to source prep. Newer but not yet published read
sets have a much better quality, according to the authors. Note that while the con-
sistently growing error rates in the order SRR026674, SRR026675, SRR026676 suggest
instrument drift, the authors did not confirm this. The runs were not necessarily done
in this order. Two runs were made at the same day, one on the next. A new flow cell is
used for each run, so cleaning is not an issue either.
Positional insertion rates can be seen in Figure 2. The rates being 0 for the first and
last base are caused by the alignment algorithm in RazerS: At the end of the program,
the reads are aligned semiglobally against the reference sequence and the gap penalties
are slightly larger than mismatch penalties. This means, alignments like these:
4
... CAACAAC-AACAACAACAA-CAACAACAACAA ...
|||||||||||
AAACAACAACAAA
The half-moon shape of the error rates in Figure 2b for sets B also have to be explained
to be alignment algorithm artifacts. Note that the insert rates are one order of magnitude
lower than the mismatch rates. The closer a mismatch occurs towards the ending of an
alignment, the higher is the probability that an inserted base shifts the leading or trailing
bases to match to a non-significant but random match. In sets SRR049254, SRR038098,
and SRR003673, the right tip of the half-moon is much larger than the left one. This
can be explained by the fact that the insertion rates are much higher towards the end
of the read, and for these reads, the mismatch error rate grows strong towards the end
than for read sets B.
A similar explanation holds for the shapes of the deletion rate curves shown in Figure 3.
We hope to correct such alignment algorithm biases by a multi read realignment program
such as seqcons [RKD+ 09] in the future. Currently, bugs in this program prevent us from
using it in a whole-genome setting.
We consider the insert and deletion probabilities to be independent of the position
and calibrate them with average rates from the middle of the analyzed reads. Thus, our
error model for Illumina reads is:
5
We first note that the qualities before and after deleted bases hardly differ. Second, the
qualities for inserted bases roughly follow those of neighbours of deleted bases. Third,
the qualities of matching and mismatching bases don’t differ much at the beginning but
quickly separatetowards the ending. The qualities of inserted and deleted bases lie in
between, first following the quality of matching bases up the relative position of 1/3.
In Section 2.1.2, we described methodological problems of determining whether a base
is inserted/deleted or mismatching. Because of this, inserted and deleted bases could in
fact be mismatching or matching bases. For now, we decide only to differentiate between
mismatches and the rest of error types. Hopefully, a realignment step can clarify whether
this is adequate or not.
We decide to model the qualities as position specific normal distributions. One for
mismatching bases, and one for all other bases. The position depending means follow a
linearly falling ramps, the standard deviations follow a linearly raising ramp.
• Read lengths are either uniformly sampled from an interval or normally distributed
with given mean and standard deviation.
• Insertions, deletions and mismatches are randomly distributed with position de-
pendent probabilities. The position dependent probabilities are computed by ramp
functions with configureable probabilities at the beginning and the end.
6
• The quality simulation is the same as for Illumina reads, as described in Sec-
tion 2.1.3.
3. Simulation Framework
This section gives a short overview of the implementation. The read simulation frame-
work works as follows:
First, the reference sequence is loaded. Alternatively, a random sequence with a given
background distribution can be generated.
Second, model specific parameters are computed or loaded from a file. For example,
the empirical error distribution for Illumina reads can be loaded in this step.
Third, haplotypes are simulated from the reference sequence: The reference sequence
is taken and changes are applied to it (actually, we only store a list with modifications for
shorter running times). At each position, a base substition, an insertion, or a deletion is
applied with user defined probabilites. The length of insertions and deletions is randomly
picked, inserted bases are picked uniformly at random.
Fourth, the reads are simulated. Section 2 describes the parts that depend on the
simulated technology. For each read:
• Depending on the simulation model, generate edit string and a buffer with inserted,
substituted bases.
• Simulate qualities, depending on the edit string and the simulation model.
• If mate pairs are to be simulated then pick location for the mate and perform the
upper steps for the mate.
• Add metainformation about the reads sample location, originally sample reference
infix and edit string into the sequence descriptor.
Fifth, the reads are written out into a FASTA/FASTQ file. Optionally, the program
writes out the alignment of the reads against the reference sequence in a SAM [LHW+ 09]
file.
7
4.2. Future Work
Using empirical distributions and simulating degradation for 454 reads following [BML+ 10]
would be very useful. Furthermore, a future version should support the simulation of
SOLiD color space reads.
The simulation of Helicos reads is another point. However, direct access to raw
sequencing Helicos sequencing data would be necessary for this, similar to the work
in [BML+ 10].
5. Acknowledgements
I thiank Anne-Kathrin Emde, David Weese, and Knut Reinert for enlightening discus-
sions on 2GS technologies.
References
[BBS+ 08] David R Bentley, Shankar Balasubramanian, Harold P Swerdlow, Geoffrey P
Smith, John Milton, Clive G Brown, Kevin P Hall, Dirk J Evers, Colin L
Barnes, Helen R Bignell, Jonathan M Boutell, Jason Bryant, Richard J
Carter, R Keira Cheetham, Anthony J Cox, Darren J Ellis, Michael R
Flatbush, Niall A Gormley, Sean J Humphray, Leslie J Irving, Mirian S
Karbelashvili, Scott M Kirk, Heng Li, Xiaohai Liu, Klaus S Maisinger,
Lisa J Murray, Bojan Obradovic, Tobias Ost, Michael L Parkinson, Mark R
Pratt, Isabelle M J Rasolonjatovo, Mark T Reed, Roberto Rigatti, Chiara
Rodighiero, Mark T Ross, Andrea Sabot, Subramanian V Sankar, Aylwyn
Scally, Gary P Schroth, Mark E Smith, Vincent P Smith, Anastassia Spiri-
dou, Peta E Torrance, Svilen S Tzonev, Eric H Vermaas, Klaudia Wal-
ter, Xiaolin Wu, Lu Zhang, Mohammed D Alam, Carole Anastasi, Ify C
Aniebo, David M D Bailey, Iain R Bancarz, Saibal Banerjee, Selena G Bar-
bour, Primo A Baybayan, Vincent A Benoit, Kevin F Benson, Claire Bevis,
Phillip J Black, Asha Boodhun, Joe S Brennan, John A Bridgham, Rob C
Brown, Andrew A Brown, Dale H Buermann, Abass A Bundu, James C
Burrows, Nigel P Carter, Nestor Castillo, Maria Chiara E Catenazzi, Simon
Chang, R Neil Cooley, Natasha R Crake, Olubunmi O Dada, Konstantinos D
Diakoumakos, Belen Dominguez-Fernandez, David J Earnshaw, Ugonna C
Egbujor, David W Elmore, Sergey S Etchin, Mark R Ewan, Milan Fedurco,
Louise J Fraser, Karin V Fuentes Fajardo, W Scott Furey, David George,
Kimberley J Gietzen, Colin P Goddard, George S Golda, Philip A Granieri,
David E Green, David L Gustafson, Nancy F Hansen, Kevin Harnish, Chris-
tian D Haudenschild, Narinder I Heyer, Matthew M Hims, Johnny T Ho,
Adrian M Horgan, Katya Hoschler, Steve Hurwitz, Denis V Ivanov, Maria Q
Johnson, Terena James, T A Huw Jones, Gyoung-Dong Kang, Tzvetana H
Kerelska, Alan D Kersey, Irina Khrebtukova, Alex P Kindwall, Zoya Kings-
8
bury, Paula I Kokko-Gonzales, Anil Kumar, Marc A Laurent, Cynthia T
Lawley, Sarah E Lee, Xavier Lee, Arnold K Liao, Jennifer A Loch, Mitch Lok,
Shujun Luo, Radhika M Mammen, John W Martin, Patrick G McCauley,
Paul McNitt, Parul Mehta, Keith W Moon, Joe W Mullens, Taksina New-
ington, Zemin Ning, Bee Ling Ng, Sonia M Novo, Michael J O’Neill, Mark A
Osborne, Andrew Osnowski, Omead Ostadan, Lambros L Paraschos, Lea
Pickering, Andrew C Pike, Alger C Pike, D Chris Pinkard, Daniel P Pliskin,
Joe Podhasky, Victor J Quijano, Come Raczy, Vicki H Rae, Stephen R
Rawlings, Ana Chiva Rodriguez, Phyllida M Roe, John Rogers, Maria C
Rogert Bacigalupo, Nikolai Romanov, Anthony Romieu, Rithy K Roth, Na-
talie J Rourke, Silke T Ruediger, Eli Rusman, Raquel M Sanches-Kuiper,
Martin R Schenker, Josefina M Seoane, Richard J Shaw, Mitch K Shiver,
Steven W Short, Ning L Sizto, Johannes P Sluis, Melanie A Smith, Jean
Ernest Sohna Sohna, Eric J Spence, Kim Stevens, Neil Sutton, Lukasz Sza-
jkowski, Carolyn L Tregidgo, Gerardo Turcatti, Stephanie Vandevondele,
Yuli Verhovsky, Selene M Virk, Suzanne Wakelin, Gregory C Walcott, Jing-
wen Wang, Graham J Worsley, Juying Yan, Ling Yau, Mike Zuerlein, Jane
Rogers, James C Mullikin, Matthew E Hurles, Nick J McCooke, John S
West, Frank L Oaks, Peter L Lundberg, David Klenerman, Richard Durbin,
and Anthony J Smith. Accurate whole human genome sequencing using
reversible terminator chemistry. Nature, 456(7218):53–9, November 2008.
[BML+ 10] S. Balzer, K. Malde, A. Lanzen, A. Sharma, and I. Jonassen. Characteris-
tics of 454 pyrosequencing data–enabling realistic simulation with flowsim.
Bioinformatics, 26(18):i420–i425, September 2010.
[DLBH08] Juliane C Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmel-
bauer. Substantial biases in ultra-short read data sets from high-throughput
DNA sequencing. Nucleic acids research, 36(16):e105, September 2008.
[HWRE10] Manuel Holtgrewe, David Weese, Knut Reinert, and Anne-Kathrin Emde.
Benchmark for Second-Generation Read Mapping. Unpublished., 2010.
[LHW+ 09] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils
Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. The Sequence
Alignment/Map format and SAMtools. Bioinformatics (Oxford, England),
25(16):2078–9, August 2009.
[MEA+ 05] Marcel Margulies, Michael Egholm, William E Altman, Said Attiya, Joel S
Bader, Lisa A Bemben, Jan Berka, Michael S Braverman, Yi-Ju Chen,
Zhoutao Chen, Scott B Dewell, Lei Du, Joseph M Fierro, Xavier V Gomes,
Brian C Godwin, Wen He, Scott Helgesen, Chun Heen Ho, Chun He Ho,
Gerard P Irzyk, Szilveszter C Jando, Maria L I Alenquer, Thomas P Jarvie,
Kshama B Jirage, Jong-Bum Kim, James R Knight, Janna R Lanza, John H
Leamon, Steven M Lefkowitz, Ming Lei, Jing Li, Kenton L Lohman, Hong
Lu, Vinod B Makhijani, Keith E McDade, Michael P McKenna, Eugene W
9
Myers, Elizabeth Nickerson, John R Nobile, Ramona Plant, Bernard P
Puc, Michael T Ronan, George T Roth, Gary J Sarkis, Jan Fredrik Si-
mons, John W Simpson, Maithreyan Srinivasan, Karrie R Tartaro, Alexan-
der Tomasz, Kari A Vogt, Greg A Volkmer, Shally H Wang, Yong Wang,
Michael P Weiner, Pengguang Yu, Richard F Begley, and Jonathan M Roth-
berg. Genome sequencing in microfabricated high-density picolitre reactors.
Nature, 437(7057):376–80, September 2005.
[Mye99] Gene Myers. A dataset generator for whole genome shotgun sequencing.
Proceedings / ... International Conference on Intelligent Systems for Molec-
ular Biology ; ISMB. International Conference on Intelligent Systems for
Molecular Biology, pages 202–10, January 1999.
[RKD+ 09] Tobias Rausch, Sergey Koren, Gennady Denisov, David Weese, Anne-Katrin
Emde, Andreas Döring, and Knut Reinert. A consistency-based consensus
algorithm for de novo and reference-guided sequence assembly of short reads.
Bioinformatics (Oxford, England), 25(9):1118–24, May 2009.
[ROA+ 08] Daniel C Richter, Felix Ott, Alexander F Auch, Ramona Schmid, and
Daniel H Huson. MetaSim: a sequencing simulator for genomics and metage-
nomics. PloS one, 3(10):e3373, January 2008.
[WER+ 09] David Weese, Anne-Katrin Emde, Tobias Rausch, Andreas Doring, and Knut
Reinert. RazerS–fast read mapping with sensitivity control. Genome Res,
19(9):1646–1654, September 2009.
10
A. Positional Error Rate Plots
6
SRR026674_1
SRR049254_1
SRR038098_1
SRR003673
5
4
error rate [%]
0
0 20 40 60 80 100
position
2.2
SRR026674_1
SRR026675_1
2 SRR026676_1
1.8
1.6
error rate [%]
1.4
1.2
0.8
0.6
0.4
0 5 10 15 20 25 30 35
position
11
0.7
SRR026674_1
SRR049254_1
SRR038098_1
0.6 SRR003673
0.5
error rate [%]
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100
position
0.07
SRR026674_1
SRR026675_1
SRR026676_1
0.06
0.05
error rate [%]
0.04
0.03
0.02
0.01
0
0 5 10 15 20 25 30 35
position
12
0.2
SRR026674_1
SRR049254_1
0.18 SRR038098_1
SRR003673
0.16
0.14
0.12
error rate [%]
0.1
0.08
0.06
0.04
0.02
0
0 20 40 60 80 100
position
0.045
SRR026674_1
SRR026675_1
0.04 SRR026676_1
0.035
0.03
error rate [%]
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35
position
13
B. Positional Quality Value Plots
60
SRR026674_1
SRR049254_1
SRR038098_1
SRR003673
50
40
mean quality
30
20
10
0
0 20 40 60 80 100
position
60
SRR026674_1
SRR026675_1
SRR026676_1
50
40
mean quality
30
20
10
0
0 5 10 15 20 25 30 35
position
14
60
SRR026674_1
SRR049254_1
SRR038098_1
SRR003673
50
40
mean quality
30
20
10
0
0 20 40 60 80 100
position
60
SRR026674_1
SRR026675_1
SRR026676_1
50
40
mean quality
30
20
10
0
0 5 10 15 20 25 30 35
position
15
60
SRR026674_1
SRR049254_1
SRR038098_1
SRR003673
50
40
mean quality
30
20
10
0
0 20 40 60 80 100
position
60
SRR026674_1
SRR026675_1
SRR026676_1
50
40
mean quality
30
20
10
0
0 5 10 15 20 25 30 35
position
16
60
SRR026674_1
SRR049254_1
SRR038098_1
SRR003673
50
40
mean quality
30
20
10
0
0 20 40 60 80 100
position
60
SRR026674_1
SRR026675_1
SRR026676_1
50
40
mean quality
30
20
10
0
0 5 10 15 20 25 30 35
position
17
60
SRR026674_1
SRR049254_1
SRR038098_1
SRR003673
50
40
mean quality
30
20
10
0
0 20 40 60 80 100
position
60
SRR026674_1
SRR026675_1
SRR026676_1
50
40
mean quality
30
20
10
0
0 5 10 15 20 25 30 35
position
18