Jiz 286
Jiz 286
Jiz 286
SUPPLEMENT ARTICLE
Next generation sequencing (NGS) combined with bioinformatics has successfully been used in a vast array of analyses for infectious
disease research of public health relevance. For instance, NGS and bioinformatics approaches have been used to identify outbreak
origins, track transmissions, investigate epidemic dynamics, determine etiological agents of a disease, and discover novel human
pathogens. However, implementation of high-quality NGS and bioinformatics in research and public health laboratories can be
challenging. These challenges mainly include the choice of the sequencing platform and the sequencing approach, the choice of bi-
oinformatics methodologies, access to the appropriate computation and information technology infrastructure, and recruiting and
retaining personnel with the specialized skills and experience in this field. In this review, we summarize the most common NGS and
bioinformatics workflows in the context of infectious disease genomic surveillance and pathogen discovery, and highlight the main
challenges and considerations for setting up an NGS and bioinformatics-focused infectious disease research public health labora-
tory. We describe the most commonly used sequencing platforms and review their strengths and weaknesses. We review sequencing
approaches that have been used for various pathogens and study questions, as well as the most common difficulties associated with
these approaches that should be considered when implementing in a public health or research setting. In addition, we provide a re-
view of some common bioinformatics tools and procedures used for pathogen discovery and genome assembly, along with the most
common challenges and solutions. Finally, we summarize the bioinformatics of advanced viral, bacterial, and parasite pathogen
characterization, including types of study questions that can be answered when utilizing NGS and bioinformatics.
Keywords. bioinformatics; public health; infectious disease; capacity building; pathogen discovery; genome assembly;
metagenomics; advanced characterization; next generation sequencing; high-throughput sequencing.
Next-generation sequencing (NGS) technology, or high- origins of pathogen emergence [11–17]. This, coupled with
throughput sequencing, combined with bioinformatics has be- improvements in sequencing error rates and simpler labora-
come a powerful tool for detection, identification, and analyses tory approaches, and the decreasing costs of NGS and com-
of human pathogens. Its advantages over conventional methods putational requirements, has made NGS and bioinformatics a
are many, as sequences produced can be used for more accurate more achievable and increasingly desirable feature of research
detection and characterization of pathogens, screening for pres- and public health laboratories around the world. However,
ence of resistance mutations/genes, vaccine escape variants, re- NGS is powerful but complex and nuanced, requiring signifi-
combination or reassortment, and virulence and pathogenicity cant experience and expertise for production of accurate and
factors [1–10]. The assembly and analyses of pathogen genomes informative results. In addition, implementation of NGS and
can shed light on pathogen spread, contact tracing, dynamics bioinformatics methods as routine surveillance and tracking
of epidemics, and even possible sources, times, and geographic tools necessitates specialized information technology (IT) and
quality management systems that can meet the goals of public
health laboratories.
Correspondence: Irina Maljkovic Berry, PhD, Walter Reed Army Institute of Research, 503
Many challenges exist in setting up a high-quality NGS and
Robert Grant Ave, 20910, Silver Spring, MD ([email protected]). bioinformatics laboratory capacity, such as choosing the right
The Journal of Infectious Diseases® 2020;221(S3):S292–307 sequencing platform, wet lab sequencing method, bioinfor-
Published by Oxford University Press for the Infectious Diseases Society of America 2019.
This work is written by (a) US Government employee(s) and is in the public domain in the US.
matics analyses tools, personnel with the right kind of skills
DOI: 10.1093/infdis/jiz286 and experience, and computational and IT infrastructure to
Sequencing Observed
Platform/Year Final Error Computational
Released Applications Rate, % Runtime Resources Advantages Disadvantages
Sanger ABI A ~0.1 20 min–48 None needed High quality, long reads, low Low throughput, high cost, substitution errors,
3730xl/ 2002 h cost for small studies sequenced material has to be pure to produce
good-quality sequence data
PacBio RSII/ V, M, E, HE, ~13 one 0.5–4 h Cluster needed Used in methylome research Indels, large lab footprint, expensive
2010 RT, CP, EP pass; <1
multipass
Ion Torrent/ A, V, M, E, ~1 4–7 h Powerful desktop Lower cost instrument, up- Higher error rate with homopolymer issues, more
PGM318/ HE, D, (chip) or cluster gradable, simple machine hands-on time, fewer overall reads, higher cost/
2010 PS MB, indel issues
ABI SOLiD A, V, M, E, ~5 one 6–10 Cluster needed Independent flow cell lanes, Longevity of platform, shorter reads, more gaps
Abbreviations: A, amplicon sequencing; C, ChIP-seq; CP, complex population sequencing; D, diagnostics; E, eukaryotic genome; EP, epigenetics; HE, human/exome genomics; M, microbial
genome; MB, mega base; ME, metagenomics; ML, methylation studies; MT, metatranscriptomics; PS, pathogen surveillance; RT, RNAseq/transcriptomics; SV, single nucleotide polymor-
phism/variation studies; V, viral genome.
a
https://nanoporetech.com/products/minit.
user support community. However, the field is increasingly software stabilization, the MinION may be an excellent ad-
demanding sequencing closer to the disease, and while the dition to the arsenal of current sequencing technologies for
MinION provides portability, the high error rates and the routine surveillance, especially in smaller laboratories with
continuous chemistry and software changes make this plat- limited resources. For instance, the MinION was successfully
form difficult to implement in routine public health surveil- used in the ZiBRA project for real-time Zika virus surveillance
lance laboratories. If used in a public health laboratory, the of mosquitoes and humans in Brazil, and in Guinea to perform
results may need to be validated with a different platform [15]. real-time surveillance during the ongoing Ebola outbreak [12,
However, with further improvements of this technology, like 36]. For the Ebola outbreak, results were obtained within 24
the most recent advances in laboratory-independent sample hours of receiving a positive sample, and sequencing on the in-
extraction and library preparation, portable computational strument took as little as 15 minutes, highlighting the potential
support (MinIT), and with additional error reduction and of the MinION for a rapid response to an ongoing outbreak.
Sequencing Organism or
Platform Sample Type Goal Wet Lab Design Software or Pipeline Benefits Achieved
Illumina MiSeq Dengue virus Surveillance, Direct sample and ngs_mapper pipeline, PhyML, Rapid surveillance design, pre-
transmission viral isolate amplicon BEAST diction of burden of disease,
sequencing intercountry movement [11, 13]
Illumina MiSeq Enterovirus A71 Surveillance Viral isolate amplicon CLC Genomic Workbench, Circulation of genogroup C, new
and random/unbiased ClustalW, BLAST genogroup E, genetic exchanges,
sequencing emergence of pathogenic
lineages, recombination [30]
Ion Torrent, Zika virus Outbreak Viral isolate amplicon Mira, Geneious, MAFFT, Path-O- Clarification of cross-border viral
GS-FLX/ sequencing Gen, BEAST spread dynamics, hypothesis
GS-Junior testing for viral origin, gene var-
iant detection [14]
Abbreviations: CLIA, Clinical Laboratory Improvement Amendments; CSF, cerebrospinal fluid; FDA, Food and Drug Administration; MLST, multilocus sequence typing; PCR, polymerase chain
reaction; SNP, single nucleotide polymorphism; SOP, standard operating procedure.
LABORATORY APPROACHES AND SEQUENCING etiologic agents, or for sequencing of bacterial isolates [23, 35].
METHODS In cases of suspected low pathogen abundance or detection of
In addition to the variety of sequencing platforms on the market, pathogens in samples containing high host nucleic acid content,
there are a variety of applications within NGS to consider. pathogen enrichment or host depletion procedures should be
For instance, metagenomics and unbiased sequencing may considered. Specific pathogen genomic amplification may be
be useful to broaden pathogen detection, elucidate unknown applied for samples where the agent is known, as is common
Positive
Targeted DNAseq Library Negative selection
stan
Limite
• Study design/hypothesis
tial
Figure 1. Schematic diagram of common sequencing laboratory workflows and approaches. Abbreviations: CRISPR, clustered regularly interspaced short palindromic
repeats; CSF, cerebrospinal fluid; QC, quality control.
background, while aiming to preserve the nucleic acid derived removing background noise. This is commonly done through
from the pathogens of interest. Degradation of genomic back- hybridization-based target capture by probes, which are used to
ground can be performed through broad-spectrum digestion with pull out nucleic acid of interest for downstream amplification and
nucleases, such as DNase I for DNA background, or by removing sequencing. Probe-based enrichment has been used to allow for
abundant RNA species (rRNA, mtRNA, globin mRNA) using detection of viral genomes in Ebola virus outbreaks, Zika virus
sequence-specific RNA depletion kits [16, 45, 46]. Gu et al [47] epidemics and respiratory virus surveillance [29, 45, 49, 50]. Pan-
used clustered regularly interspaced short palindromic repeats viral probes have been shown to successfully identify diverse types
(CRISPR) approach to target and deplete human mitochondrial of pathogens in different clinical fluid and respiratory samples,
rRNA in clinical CSF samples, resulting in improved read cov- and have been used for sequencing and characterization of novel
erage of meningitis and encephalitis-associated pathogens in viruses [51–54]. In a precision public health surveillance approach,
those samples. Generally, however, subtraction approaches lead Cummings et al [52] used pan-viral probe capture to enrich
to a certain degree of loss of the targeted pathogen genome, as pathogens in samples from patients with influenza-negative se-
poor recovery may occur during the cleanup and additional vere acute respiratory infections (SARI). This approach resulted in
enrichment steps [48]. These approaches may thus not ini- identification of an unrecognized outbreak of measles-associated
tially be suitable for less experienced laboratories, or should be SARI, as well as detection of SARI associated with a novel
accompanied by an additional alternative approach. picobirnavirus. Pan-viral probes can also be used for preemptive
A simpler approach resulting in less loss of target is positive en- screening of environmental samples (of vector and animal origins)
richment, which is used to increase pathogen signal rather than for existence of emerging and even novel pathogen threats, thereby
PhiX control Quality score and End trimming Read output Host Control Background negative
analysis duplicate read removal by quality statistics removal pathogen results control removal
1 Analyses and tools tailored to less experienced personnel or laboratories with more Bioinformatics workflow considerations :
limited computational resources • Available computational infrastructure
• Pipeline or analysis program requirements
2 Analyses and tools that may require specific additional training and/or custom standard-
• Pipeline output inspection/verification
operating procedures when used by less experienced personnel
• Reproducibility of computational outputs
3 Analyses and tools requiring more skilled/experienced personnel (command line, bash
scripting, interpretation) and/or higher computational infrastructure
Figure 2. Bioinformatics workflow and considerations for sequence analysis. Nondashed boxes describe analyses types and dashed boxes describe tools that can be used
for these analyses. Abbreviations: MLST, multilocus sequence typing; SNP, single nucleotide polymorphism.
depth and comprehensiveness of sequencing, technique effi- and long-read platforms and sequencing approaches. If read
ciency and analysis workflow robustness, data interpretation, quality and depth requirements are not met, the consensus se-
as well as clinical (symptoms, epidemiological, environmental quence should not be considered. Usually, such incomplete/
context) and pathogen biological insights. In a case report from gapped genomes are filled by additional sequencing [4, 15]. The
Mongkolrattanothai et al [31] an 11-year-old patient with head- quality of the assembly will also be affected by the assembly al-
ache, back pain, and nausea went through several diagnoses gorithm used. Many genome assembly and consensus calling
including Epstein-Barr virus, human herpesvirus 7, residual algorithms exist, and they vary greatly in their complexity, ac-
complications from a recent Salmonella infection, and puta- curacy, speed, and flexibility (Table 2). In general, 2 main ge-
tive tuberculosis disease. Finally, metagenomics sequencing re- nome assembly approaches exist, reference-based (mapping
vealed presence of Brucella, which was then further confirmed based) assembly and de novo assembly.
by both PCR and agglutinin test. The persistent symptoms, NGS Reference-based assembly is a very useful and accurate tool
and PCR testing showing Brucella, and positive confirmatory se- for assembly of known genomes, and can be especially bene-
rology allowed for a diagnosis of chronic neurobrucellosis [31]. ficial for laboratories with limited computational capacity or
Thus, the results of an NGS and bioinformatics metagenomic those with high sequencing throughput and/or when time is of
analysis, especially in diagnostic settings, should be confirmed the essence [11, 12, 29]. For instance, reference mapping was
with a different method, such as PCR. used for assembly of >500 dengue genomes from Thailand,
and combined with other data the results revealed that most
Genome Assembly: Putting the Pieces Together of dengue infections are obtained close to home [11]. During a
One of the main factors that plays a role in the accuracy and com- Legionella outbreak in a large Australian hospital, NGS and ge-
pleteness of a genome assembly is sequencing read quality and nome assembly through reference mapping was employed to, in
read depth of coverage. These aspects differ between short-read real time, distinguish the bacterial outbreak isolates [23]. In a