Discovery of Unfixed Endogenous Retrovirus Insertions in Diverse Human Populations
Discovery of Unfixed Endogenous Retrovirus Insertions in Diverse Human Populations
Discovery of Unfixed Endogenous Retrovirus Insertions in Diverse Human Populations
Contributed by John M. Coffin, February 11, 2016 (sent for review November 25, 2015; reviewed by Norbert Bannert, Robert Belshaw, and Jack Lenz)
Endogenous retroviruses (ERVs) have contributed to more than ongoing exogenous replication, and retain one or more ORFs
8% of the human genome. The majority of these elements lack (8, 13–15). HML-2 expression has been observed in tumor-
function due to accumulated mutations or internal recombination derived tissues as well as normal placenta in the form of RNAs,
resulting in a solitary (solo) LTR, although members of one group proteins, and noninfectious retrovirus-like particles (3, 16–19).
of human ERVs (HERVs), HERV-K, were recently active with members These unique properties raise the possibility that some HML-2
that remain nearly intact, a subset of which is present as insertionally group members are still capable of replication by exogenous
polymorphic loci that include approximately full-length (2-LTR) and transmission from rare intact proviruses, from the generation of
solo-LTR alleles in addition to the unoccupied site. Several 2-LTR infectious recombinants via copackaged viral RNAs, or from
insertions have intact reading frames in some or all genes that are rare viruses still in circulation in some populations. A naturally
expressed as functional proteins. These properties reflect the activity occurring infectious provirus has yet to be observed, although
of HERV-K and suggest the existence of additional unique loci within the well-studied “K113” provirus, which is not in the GRCh37
MICROBIOLOGY
humans. We sought to determine the extent to which other poly- (hg19) reference genome but maps to chr19:21,841,544, has
morphic insertions are present in humans, using sequenced genomes intact ORFs (9) and engineered recombinant HML-2 provi-
from the 1000 Genomes Project and a subset of the Human Genome ruses are infectious in cell types, including human cells (20, 21).
Diversity Project panel. We report analysis of a total of 36 non- The goal of this study was to enhance our understanding of
reference polymorphic HERV-K proviruses, including 19 newly report- such elements by identifying and characterizing additional
ed loci, with insertion frequencies ranging from <0.0005 to >0.75 polymorphic HML-2 insertions in the population.
that varied by population. Targeted screening of individual loci iden- The wealth of available human whole-genome sequence
tified three new unfixed 2-LTR proviruses within our set, including an (WGS) data should, in principle, provide the information needed
intact provirus present at Xq21.33 in some individuals, with the po- to identify transposable elements (TEs), including proviruses, in
tential for retained infectivity. the sequenced population. However, algorithms for routine analysis
of short-read (e.g., Illumina) paired-end sequence data exclude
| |
HERV-K HML-2 human endogenous retrovirus | reads that do not match the reference genome. Based on read
|
1000 Genomes Project Human Genome Diversity Project
Significance
D uring a retrovirus infection, a DNA copy of the viral RNA
genome is permanently integrated into the nuclear DNA of
the host cell as a provirus. The provirus is flanked by short target
The human endogenous retrovirus (HERV) group HERV-K con-
tains nearly intact and insertionally polymorphic integrations
site duplications (TSDs), and consists of an internal region among humans, many of which code for viral proteins. Ex-
encoding the genes for replication that is flanked by identical pression of such HERV-K proviruses occurs in tissues associated
LTRs. Infection of cells contributing to the germ line may result with cancers and autoimmune diseases, and in HIV-infected
in a provirus that is transmitted to progeny as an endogenous individuals, suggesting possible pathogenic effects. Proper
retrovirus (ERV), and may reach population fixation (1). In- characterization of these elements necessitates the discrimi-
deed, more than 8% of the human genome is recognizably of nation of individual HERV-K loci; such studies are hampered by our
retroviral origin (2). The majority of human ERVs (HERVs) incomplete catalog of HERV-K insertions, motivating the identifi-
represent ancient events and lack function due to accumulated cation of additional HERV-K copies in humans. By examining
mutations or deletions, or from recombination leading to the >2,500 sequenced genomes, we have discovered 19 previously
formation of a solitary (solo) LTR; however, several HERVs unidentified HERV-K insertions, including an intact provirus with-
have been coopted for physiological functions to the host (3). out apparent substitutions that would alter viral function, only the
The HERV-K (HML-2) proviruses (4–9), so-named for their second such provirus described. Our results provide a basis for
use of a Lys tRNA primer and similarity to the mouse mammary future studies of HERV evolution and implication for disease.
tumor virus (human MMTV like) (10), represent an exception to
the antiquity of most HERVs. HML-2 has contributed to at least Author contributions: J.H.W., Z.H.W., J.M.K., and J.M.C. designed research; J.H.W., Z.H.W.,
120 human-specific insertions, and population-based surveys in- M.M., and R.P.S. performed research; J.H.W., Z.H.W., M.M., and R.P.S. contributed new
reagents/analytic tools; J.H.W., Z.H.W., R.P.S., J.M.K., and J.M.C. analyzed data; and J.H.W.,
dicate as many as 15 unfixed sites, including 11 loci with more or
Z.H.W., J.M.K., and J.M.C. wrote the paper.
less full-length proviruses (5, 6, 8, 9). To distinguish the latter
Reviewers: N.B., Robert Koch Institute; R.B., University of Plymouth; and J.L., Albert Einstein
from recombinant solo-LTRs, we refer to these elements as Medical School.
“2-LTR” insertions throughout this study. The majority of these The authors declare no conflict of interest.
insertions are estimated to have occurred within the past ∼2 My, Data deposition: The sequences reported in this paper have been deposited in the
the youngest after the appearance of anatomically modern hu- GenBank database (accession nos. KU054242–KU054309).
mans (4, 8, 11). Population modeling has implied a relatively 1
J.H.W. and Z.H.W. contributed equally to this work.
constant rate of HML-2 accumulation since the Homo-Pan di- 2
To whom correspondence should be addressed. Email: [email protected].
Downloaded by guest on June 18, 2020
vergence (5, 12, 13). All known insertionally polymorphic HML- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
2 proviruses have signatures of purifying selection, implying 1073/pnas.1602336113/-/DCSupplemental.
the hg19 reference to confirm the position of the preintegration, or empty, (10q24.2b and 15q13.1), or insertions that could not be mapped to the hg19
site per call. assembly (10q26.3, 12q24.32, and 22q11.23b).
MICROBIOLOGY
First, we identified insertions based on read pair signatures
Validation and Sequencing. We validated the presence of 34 of the
using the program RetroSeq (28) (Fig. 1, Left). To improve the
36 candidate insertions in at least one individual predicted to
detection of insertions present in multiple samples, we combined
have the insertion (Table 1 and Dataset S1). The remaining two
reads within a population (1KGP) or study (HGDP) (32). Ex-
sites (at 10q24.2 and 15q13.1) were predicted to have an unusual
cluding calls within ±500 bp of a reference HML-2 sequence, we
inverted repeat structure based on assemblies of supporting
obtained 140.3 ± 56.1 candidate calls per pool. Next, we applied
reads at either site (Fig. S2), and could not be conclusively
confirmed by sequencing, possibly due to hairpin formation. For
the 34 validated nonreference sites, we confirmed 29 sites as
having solo-LTRs and five sites with 2-LTR proviruses (at 8q24.3c,
Data
primer (black arrow) to infer the presence of a full-length allele. Represen- searched for, but did not find evidence of, this site in subsequent
tative products are shown in a genotyping gel to the right. PCR screens of other samples.
*Reported originally in the sequenced Neandertal (Ne) or Denisovan (De) by Agoni et al. (42) or Lee et al. (51), or in modern humans (K) by Marchi et al. (12) or
Lee et al. (41).
†
Alleles detected. LTR, solo LTR; pre, preinsertion site; pro, 2-LTR provirus.
‡
Previously PCR validated as solo-LTR by Lee et al. (41).
§
Insertion is located within an encompassing structural variant not present in the hg19 reference.
Estimated Frequencies of Unfixed HML-2 Loci. We performed in the HML-2 insertion at each locus. Values reported below cor-
silico read-based genotyping to obtain estimations of the allele respond to allele frequencies unless otherwise noted.
frequencies of 27 nonreference insertions with clear integration Estimated frequencies of the variable HML-2 insertions pre-
coordinates, and extended the analysis to include 13 annotated sent in the reference genome ranged from ∼0.25 to >0.99 in
polymorphic HML-2 loci from the hg19 human reference (5, 8) genotyped samples (Fig. 2, Upper). Sites with the highest estimated
(Dataset S2). Briefly, reference and alternate alleles represent- frequencies corresponded to those loci previously reported with a
ing each HML-2 locus were recreated, and individual genotypes solo-LTR or provirus present, but not a preinsertion site, based on
were then inferred based on the remapping of proximal Illumina limited PCR screens of those sites (6) (at 1p31.1, 3q13.2, 7p22.1,
reads to the reconstructed alleles per site per sample (Materials 12q14.1, and 6q14.1 in Fig. 2). This pattern is consistent with
and Methods). Given the larger size of the HML-2 LTR (∼968 bp) variability at these sites based predominantly on the 2-LTR and
and relatively short reads in these data, 2-LTR and solo-LTR solo-LTR states. Genotyping of the insertions at 11q22.1 and 8p23.1a
Downloaded by guest on June 18, 2020
insertions are indistinguishable in read-based genotyping alone, (K115) implied the presence of both insertion and preinsertion al-
such that genotypes were based on the presence or absence of leles, also consistent with PCR screens in other reports (6, 9, 33, 45),
noting the higher frequency of K115 within our samples (∼53%) ancestry, with nine of 13 loci inferred in <5% of all samples but
than in those reports (up to ∼34% depending on ancestry). Four mostly limited to African populations, although insertions were
unfixed reference solo-LTRs ranged in frequencies from ∼0.25 to as also detected in non-African samples at ∼0.005 to ∼0.016 in
high as ∼0.93, also consistent with previous analysis of these sites those populations (e.g., at 5q14.1 and Xq21.33 in Fig. 3). The
(5). Extending the analysis to the 85 remaining human-specific solo-LTR insertion at 1p31.1c was only identified in a single
HML-2 insertions that are suitable for genotyping in the human sample and was not detected in any other sample by genotyping;
reference (81 solo-LTRs and four full-length proviruses) (5, 8) was however, this observation does not exclude the possibility of its
consistent with sample-wide fixation among the vast majority of presence in some individuals, given the variability in read cov-
these loci; just eight loci had evidence of the nonreference allele
erage between samples (Discussion). Nine of the 10 common
among genotyped samples (Fig. S4 and Dataset S3).
Estimated frequencies of the nonreference HML-2 insertions insertions (detected in >10% of all samples), including the K113
were inferred to be from <0.0005 (the insertion having clear provirus, have been previously reported in searches of WGS data
support in one or few individuals) to >0.75 of genotyped samples (12, 41). A comparison of the overall presence of each HML-2
(Fig. 2, Lower). More than half of the nonreference HML-2 in- insertion, calculated as the proportion of individuals with evi-
sertions were rare, with 15 insertions detected at frequencies of dence of the insertion, was generally in agreement with those
<5% and six insertions in <1% of all samples; just four of these reports (Fig. S5). The presence of K113 was estimated at a
Downloaded by guest on June 18, 2020
loci have been previously reported (12). Sites with the lowest higher prevalence across samples here than in previous re-
allele frequencies were predominantly in individuals of African ports, in ∼27% of all samples and as high as ∼52% in African
due to deletion of the majority of its 5′ LTR (Fig. 4). This tions, and gray bars indicate the elements for which the frequency could
truncation has also been observed in a few reference LTR5Hs not be determined.
MICROBIOLOGY
being the noninfectious K113 (19p12b) that shares 98.9% amino estimations are shown for each site. n.d., not determined. The black vertical
acid identity to HERV-KCON (9). Its potential for generation of line indicates a frameshift mutation (as indicated “+1 bp”); black lines with
infectious virus is currently under investigation. asterisks are used to indicate positions of stop codons where present. Reading
frames are shown for the Xq21.33 2-LTR provirus as colored as in A. Black
Discussion vertical lines within the frames indicate the positions of base changes that are
observed in other full-length HML-2 proviruses. Red vertical lines are used to
We report 36 nonreference HML-2 insertions, including 19 indicate base changes that are unique to the sequenced Xq21.33 provirus.
previously identified loci, from analysis of WGS read data from
more than 2,500 globally sampled individuals. Seventeen of the
36 sites were recently reported in humans (12, 41), although with element discoveries and sequence-based analysis, but also un-
limited validation or element characterization. Here, we take full derscore the necessity of additional experimental validation steps
advantage of the 1KGP and HGDP WGS read data to identify and characterization of candidate proviruses.
nonreference viral-genome junctions from assembled anchored Eight of our validated loci have been recently reported in the
read pairs and individual unmapped reads, and use these data to genomes of two sequenced archaic samples (42, 51) in addition
estimate the presence of each of these elements within our to modern humans (12). We confirmed an additional three
sampled populations. We validated the presence of 34 of the 36 reported “archaic” sites in our data [19p12e and 10q24.2b, re-
loci, including five loci with 2-LTR proviruses (including K113) spectively: “De11” and “De12” in the study by Agoni et al. (42);
and 29 solo-LTRs, and report the complete sequences for 30 of 19q13.43: “Ne5” in the study by Lee et al. (51)], but found no
these insertions, including a 2-LTR provirus at Xq21.33 that ap- evidence of the remaining eight reported archaic events. Prop-
pears to be intact. We provide a thorough analysis of unfixed erties of these 11 HML-2 loci are more consistent with insertion
HML-2 insertions that complements and builds on previous stud- before the most recent ancestor with modern humans ∼0.6–
ies, and should enable future examination of the HML-2 group. 0.8 Mya (52) than with introgression. For example, the 2-LTR in-
We used the available reads from each sample for in silico sertions at 19p12e and Xq21.33 are most prevalent in samples of
genotyping of a subset of sites to infer the population-wide fre-
African ancestry, and LTR divergences indicate their respective
quencies of unfixed HML-2 elements, which is impractical on
insertions to have been ∼1.8–3.3 Mya and ∼0.67–1.3 Mya, con-
this scale in standard PCR-based screens. The inferred allele
sistent with this time frame. Both sites are rare, with sample-wide
frequencies of the nonreference insertions ranged from 0.05 to
allele frequencies estimated at 0.0103 and 0.0157 (∼0.026–0.069
>75% of genotyped samples and varied between populations,
generally with the highest presence in African populations. With in the African sample) in our data (Dataset S3). Of the
the exception of two previously sequenced sites in our set (dup1 remaining genotyped loci also in archaic genomes, each was also
and 12q24.11), all nonreference insertions were validated in most represented in African ancestry, with exception of the in-
samples of African ancestry, as has been observed for all HERV- sertions at 11q12.2 and 5q14.1 (sample-wide allele frequencies
K loci characterized to date, implying their insertion before the estimated at ∼0.046 and 0.026) that appeared most frequently in
human migration out of Africa ∼45,000–60,000 y ago (50). These populations from the Americas or of East Asian ancestries but
two insertions could not be confidently mapped to the hg19 are also present in African populations, again implying ancient
reference, and were therefore excluded from genotyping. All but events (50). Given their overall distribution, it is likely these
one nonreference insertion was identified in more than one insertions are also older, although our ability to estimate in-
individual, with the exception being the 1p31.1c solo-LTR vali- sertion times is limited, given their presence as solo-LTRs.
dated in NA18867. Genotyping of that site failed to reveal (but We confirmed the presence of full-length proviruses at four
does rule out) its presence in other individuals. Analysis of the loci, including the Xq21.33 provirus, which appears to be intact
surrounding region revealed the presence of several SNPs that and without obvious defects, which implies the potential for
were unique to NA18867 within the 1KGP panel, suggesting that replication competence and is now under further investigation.
1p31.1c may be associated with a very rare haplotype, rather than Given a genomic mutation rate of ∼2.2 × 10−9 changes per site
Downloaded by guest on June 18, 2020
a de novo event, in the absence of comprehensive screening. per year (53), an ERV could maintain infectivity over very long
These observations support the utility of short read data for periods, and a number of infectious ERVs are known in other
1. Boeke JD, Stoye JP (1997) Retrotransposons, endogenous retroviruses, and the evo- 7. Medstrand P, Mager DL (1998) Human-specific integrations of the HERV-K endoge-
lution of retroelements. Retroviruses, eds Hughes S, Varmus H (Cold Spring Harbor nous retrovirus family. J Virol 72(12):9782–9787.
Laboratory Press, Plainview, NY), pp 343–435. 8. Subramanian RP, Wildschutte JH, Russo C, Coffin JM (2011) Identification, charac-
2. McPherson JD, et al.; International Human Genome Mapping Consortium (2001) A terization, and comparative genomic distribution of the HERV-K (HML-2) group of
physical map of the human genome. Nature 409(6822):934–941. human endogenous retroviruses. Retrovirology 8:90.
3. Jern P, Coffin JM (2008) Effects of retroviruses on host genome function. Annu Rev 9. Turner G, et al. (2001) Insertional polymorphisms of full-length endogenous retro-
Genet 42:709–732. viruses in humans. Curr Biol 11(19):1531–1535.
4. Barbulescu M, et al. (1999) Many human endogenous retrovirus K (HERV-K) proviruses 10. Ross SR (2008) MMTV infectious cycle and the contribution of virus-encoded proteins to
are unique to humans. Curr Biol 9(16):861–868. transformation of mammary tissue. J Mammary Gland Biol Neoplasia 13(3):299–307.
5. Belshaw R, et al. (2005) Genomewide screening reveals high levels of insertional 11. Jha AR, et al. (2011) Human endogenous retrovirus K106 (HERV-K106) was infectious
polymorphism in the human endogenous retrovirus family HERV-K(HML2): Implica- after the emergence of anatomically modern humans. PLoS One 6(5):e20234.
tions for present-day activity. J Virol 79(19):12507–12514. 12. Marchi E, Kanapin A, Magiorkinis G, Belshaw R (2014) Unfixed endogenous retroviral
6. Hughes JF, Coffin JM (2004) Human endogenous retrovirus K solo-LTR formation and insertions in the human population. J Virol 88(17):9529–9537.
13. Belshaw R, et al. (2004) Long-term reinfection of the human genome by endogenous
Downloaded by guest on June 18, 2020
insertional polymorphisms: Implications for human and viral evolution. Proc Natl Acad
Sci USA 101(6):1668–1672. retroviruses. Proc Natl Acad Sci USA 101(14):4894–4899.
MICROBIOLOGY
26. Martin AR, et al. (2014) Transcriptome sequencing from diverse human populations stander or tumorigenic accomplice? Int J Cancer 137(6):1249–1257.
reveals differentiated regulatory architecture. PLoS Genet 10(8):e1004549. 50. Henn BM, Cavalli-Sforza LL, Feldman MW (2012) The great human expansion. Proc
27. McKenna A, et al. (2010) The Genome Analysis Toolkit: A MapReduce framework for Natl Acad Sci USA 109(44):17758–17764.
analyzing next-generation DNA sequencing data. Genome Res 20(9):1297–1303. 51. Lee A, et al. (2014) Novel Denisovan and Neanderthal retroviruses. J Virol 88(21):
28. Keane TM, Wong K, Adams DJ (2013) RetroSeq: Transposable element discovery from 12907–12909.
next-generation sequencing data. Bioinformatics 29(3):389–390. 52. Reich D, et al. (2010) Genetic history of an archaic hominin group from Denisova Cave
29. Jurka J, et al. (2005) Repbase Update, a database of eukaryotic repetitive elements. in Siberia. Nature 468(7327):1053–1060.
Cytogenet Genome Res 110(1-4):462–467. 53. Kumar S, Subramanian S (2002) Mutation rates in mammalian genomes. Proc Natl
30. Smit AFA, Hubley R, Green P (2013) RepeatMasker Open-4.0. Available at www. Acad Sci USA 99(2):803–808.
repeatmasker.org. Accessed August 5, 2013. 54. Young GR, et al. (2012) Resurrection of endogenous retroviruses in antibody-deficient
31. Huang X, Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res mice. Nature 491(7426):774–778.
9(9):868–877. 55. Matsuda H, Hamet P, Tremblay J (2014) Hypertension-related, calcium-regulated
32. Wildschutte JH, Baron A, Diroff NM, Kidd JM (2015) Discovery and characterization gene (HCaRG/COMMD5) and kidney diseases: HCaRG accelerates tubular repair.
of Alu repeat sequences via precise local read assembly. Nucleic Acids Res 43(21): J Nephrol 27(4):351–360.
10292–10307. 56. Itahana Y, et al. (2015) The uric acid transporter SLC2A9 is a direct target gene of
33. Wildschutte JH, Ram D, Subramanian R, Stevens VL, Coffin JM (2014) The distribution the tumor suppressor p53 contributing to antioxidant defense. Oncogene 34(14):
of insertionally polymorphic endogenous retroviruses in breast cancer patients and 1799–1810.
cancer-free controls. Retrovirology 11(1):62. 57. Stacey D, et al.; IMAGEN Consortium (2012) RASGRF2 regulates alcohol-induced re-
34. Larkin MA, et al. (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21): inforcement by influencing mesolimbic dopamine neuron activity and dopamine re-
2947–2948. lease. Proc Natl Acad Sci USA 109(51):21128–21133.
35. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: Molecular Evo- 58. Parson J (1995) Miropeats: Graphical DNA sequence comparisons. Comput Appl Biosci
lutionary Genetics Analysis version 6.0. Mol Biol Evol 30(12):2725–2729. 11(6):615–619.
Downloaded by guest on June 18, 2020