Widespread Whole Genome Duplications

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Molecular Plant

Research Article

Widespread Whole Genome Duplications


Contribute to Genome Complexity and Species
Diversity in Angiosperms
Ren Ren1,5, Haifeng Wang2,5, Chunce Guo1,5, Ning Zhang1,3, Liping Zeng1, Yamao Chen1,
Hong Ma1,4,* and Ji Qi1,*
1
State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Ministry of Education Key Laboratory of
Biodiversity Science and Ecological Engineering and Institute of Biodiversity Science, Institute of Plant Biology, Center for Evolutionary Biology, School of Life
Sciences, Fudan University, Shanghai, China
2
State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, Fujian Agriculture and Forestry University, Fuzhou, Fujian 350002, China
3
Department of Botany, National Museum of Natural History, MRC 166, Smithsonian Institution, Washington, DC, USA
4
Institutes of Biomedical Sciences, Fudan University, Shanghai, China
5These authors contributed equally to this article.
*Correspondence: Hong Ma ([email protected]), Ji Qi ([email protected])
https://doi.org/10.1016/j.molp.2018.01.002

ABSTRACT
Gene duplications provide evolutionary potentials for generating novel functions, while polyploidization or
whole genome duplication (WGD) doubles the chromosomes initially and results in hundreds to thousands
of retained duplicates. WGDs are strongly supported by evidence commonly found in many species-rich
lineages of eukaryotes, and thus are considered as a major driving force in species diversification. We per-
formed comparative genomic and phylogenomic analyses of 59 public genomes/transcriptomes and 46
newly sequenced transcriptomes covering major lineages of angiosperms to detect large-scale gene dupli-
cation events by surveying tens of thousands of gene family trees. These analyses confirmed most of the
previously reported WGDs and provided strong evidence for novel ones in many lineages. The detected
WGDs supported a model of exponential gene loss during evolution with an estimated half-life of approx-
imately 21.6 million years, and were correlated with both the emergence of lineages with high degrees of
diversification and periods of global climate changes. The new datasets and analyses detected many novel
WGDs widely spread during angiosperm evolution, uncovered preferential retention of gene functions in
essential cellular metabolisms, and provided clues for the roles of WGD in promoting angiosperm radiation
and enhancing their adaptation to environmental changes.
Key words: whole genome duplication, duplicate gene, polyploidization, angiosperm, phylogenomics
Ren R., Wang H., Guo C., Zhang N., Zeng L., Chen Y., Ma H., and Qi J. (2018). Widespread Whole Genome
Duplications Contribute to Genome Complexity and Species Diversity in Angiosperms. Mol. Plant. 11, 414–428.

INTRODUCTION of functionally related genes (Otto, 2007). WGDs are found in


the histories of diverse eukaryotes, including Saccharomyces
Genomic variations, including single nucleotide polymorphisms cerevisiae (Wolfe and Shields, 1997), Danio rerio (Postlethwait
and structural variations, provide raw materials for evolution et al., 2000) and Arabidopsis thaliana (Blanc et al., 2000; Vision
of novel gene functions that can be retained due to drift and et al., 2000; Simillion et al., 2002; Bowers et al., 2003). Ancient
selection (Stebbins, 1999). Gene duplicates (GDs) initially WGDs have been strongly supported by large-scale phyloge-
have redundant functions and could potentially alter gene nomic analyses in the ancestors of vertebrates (Panopoulou
dosages and/or reshape genome structure (Long et al., 2003; and Poustka, 2005), seed plants, and angiosperms (Blanc and
Mitchell-Olds and Schmitt, 2006). GDs often result from Wolfe, 2004; Doyle et al., 2008; Jiao et al., 2011). WGDs are
polyploidizations or whole genome duplications (WGDs), which thought to enhance organismal adaptation to environmental
generate a copy of all genes (Pontes et al., 2004; Madlung
et al., 2005), including genes involved in the same networks or
pathways, with greater potential than single gene duplication Published by the Molecular Plant Shanghai Editorial Office in association with
for neo/subfunctionalizations of duplicates by the co-evolution Cell Press, an imprint of Elsevier Inc., on behalf of CSPB and IPPE, SIBS, CAS.

414 Molecular Plant 11, 414–428, March 2018 ª The Author 2018.
Genomic and Phylogenomic Analysis of WGDs Molecular Plant
challenges by offering genomic novelties and complexities calibrations and estimated divergence times, providing a basis
(Hegarty and Hiscock, 2008) and promote reproductive for evaluating the rate of loss of gene duplicates from WGDs.
isolation and species diversification via reciprocal loss or Furthermore, we also investigated post-WGD duplicate retention
subfunctionalization of gene duplicates from WGDs in different in those lineages/organisms that exhibited multiple rounds of
populations of a species (Lynch and Conery, 2000; Sémon and polyploidization, e.g., the likelihood of loss of ancient duplicates
Wolfe, 2007). However, recent polyploids seem to exhibit compared between plants with and without recent WGDs,
higher extinction rates than diploids, in part due to improper providing clues for understanding the balance between loss of
chromosome pairing in meiosis (Madlung et al., 2005) or lack of redundant genes and retention of duplicates for potential evolu-
mating partners (Vanneste et al., 2014). Therefore, the impact of tionary novelties.
WGD on organismal diversity is not clear.

WGDs are particularly prevalent in angiosperms, the group with


RESULTS
the highest biodiversity in land plants with more than 13 000 Genome/Transcriptome Sequencing of Angiosperm
genera and 300 000 species (Christenhusz and Byng, 2016). Species and Homolog Identification
Multiple polyploidization events have been uncovered during Even with multiple detected angiosperm WGDs, many angio-
angiosperm history, including a whole genome triplication in the sperm orders and families have not yet been examined for
ancestor of core eudicots (Bowers et al., 2003; Jaillon et al., lineage-specific WGDs. To examine whether more angiosperm
2007; Tang et al., 2008). Furthermore, many species-rich angio- groups also experienced WGDs, and to investigate gene con-
sperm families exhibit strong evidence of WGD in their histories, tents of retained duplicates following WGDs and their correlation
including Brassicaceae, Fabaceae, Asteraceae, Solanaceae, with species diversity, we performed phylogenomic analyses of
Poaceae, and Orchidaceae (Blanc and Wolfe, 2004; Paterson gene families from 105 organisms, including 70 eudicots (Rosids,
et al., 2004; Bertioli et al., 2009; Tang et al., 2010; Jiao Asterids, Gunnerales, Santalales, Vitales, and others), 22 mono-
et al., 2011, 2014; Cai et al., 2015; Huang et al., 2016b); cots, six Magnoliids, two Chloranthales, one Ceratophyllales,
recent polyploidizations have been detected in Brassica rapa and four basal angiosperms. Among these, 36 species have
(Wang et al., 2011), Glycine max (Blanc and Wolfe, 2004; genomes in the Phytozome database (Goodstein et al., 2012)
Schmutz et al., 2010), Solanum tuberosum (Blanc and Wolfe, and 23 species have transcriptomes in previous studies
2004; Sato et al., 2012), Gossypium raimondii (Blanc and Wolfe, (Supplemental Figure 1 and Supplemental Table 1). In addition,
2004; Wang et al., 2012a), Linum usitatissimum (Wang et al., transcriptome sequences of 46 species were obtained in
2012b), Malus domestica (Velasco et al., 2010), and Musa this laboratory (Supplemental Tables 1 and 2, project
acuminate (D’Hont et al., 2012). It has been suggested that PRJNA421868 at NCBI Sequence Read Archive) to cover major
WGDs have played an important role in angiosperm radiation clades of angiosperms. The 69 transcriptome datasets include
and environmental adaptation (Jiao et al., 2011). However, 12 573–288 265 assembled unigenes for individual species,
most of the previously reported WGDs are associated with with an average of 97 568 and median of 97 638. In addition,
large groups, in part because plant large-scale datasets tend to we compared these assembled unigenes with 969 and 357
be from agricultural and economic crops, which mostly belong universally conserved orthologs (UCOs) among angiosperms
to large families. Much less is known about possible WGDs for and eukaryotes (Yang et al., 2015b), respectively, as an
members of smaller lineages. Therefore, it is important to include indication of the transcriptome quality. Over 95% of the
lineages with less diversity to further explore the possible effects angiosperm UCOs could be detected in 57 of the 69 datasets,
of WGD on species richness. while the remaining 12 datasets contain over 70% of the
angiosperm UCOs. For the eukaryote UCOs, 60 datasets cover
To survey across the expanse of angiosperms, we generated 95% of them while the other nine have over 67%. A major
transcriptomic datasets for 46 species representing 41 families goal of this study was to detect lineage-specific WGDs at
lacking genome sequences. These datasets contained members the level of order, family, or genus; thus we performed
of families in Rosids, Asterids, monocots, and three basal angio- sequence comparisons among species within each of the three
sperm lineages (Magnoliids, Ceratophyllales, and Chloranthales) large groups with multiple orders: Rosids, Asterids, and mono-
with key positions in the angiosperm phylogeny (Zeng et al., 2014, cots. In addition, to detect WGDs associated with Magnoliids
2017). In particular, among the 64 families with transcriptomic and two small orders, Chloranthales and Ceratophyllales, we
datasets, 29 have 500 or fewer species, with 19 having fewer analyzed these in a combined group (MCC). In total, 50 950,
than 100 species, allowing investigation of WGDs in relatively 49 157, 37 753, and 21 588 gene families (Supplemental
small groups. Phylogenomic detection of hundreds or more Figure 2) with four or more genes were identified by homolog
GDs at the same node in a species tree has been considered clustering for Rosids, Asterids, monocots, and the MCC group,
as strong support for a WGD event at the node of angiosperms respectively. The number of gain/loss events for these gene
by several studies (D’Hont et al., 2012; Jiao et al., 2014; families were estimated using the Dollo parsimony method (see
McKain et al., 2016). In addition, analyses of rates of Methods and Supplemental Figure 3). Lastly, maximum-
synonymous substitution values have provided evidence for likelihood trees were reconstructed for each gene family.
WGDs in several plant species (Blanc and Wolfe, 2004). We
performed phylogenomic analyses of gene families on 105
large-scale angiosperm datasets and detected WGD events on Phylogenomic Comparisons Revealed Large-Scale
many lineages, indicating that WGDs have likely occurred widely Gene Duplications in Many Lineages
during angiosperm evolution. These proposed WGDs were To detect WGD, either in the ancestor of groups with two
further dated by sequence divergence of duplicates with fossil or more taxa or on terminal lineages represented by a
Molecular Plant 11, 414–428, March 2018 ª The Author 2018. 415
Molecular Plant Genomic and Phylogenomic Analysis of WGDs

A B Figure 1. Comparisons of Ks Distributions


of Duplicated Genes on Different Evolu-
tionary Stages Divided by Phylogenetic In-
formation for Arabidopsis thaliana.
(A) Numbers of GDs at different nodes (nodes 1–5,
filled circles) of the species tree are colored in red,
blue, green, orange, and purple, respectively.
(B) The distribution of Ks for GDs at each of the
nodes (nodes 1–5), in colors consistent with (A).
The black dotted line denotes the summarized GD
counts from all nodes.
(C) Distribution of Ks for gene pairs recognized by
using all-against-all BLAST matches.
(D) Distribution of Ks for gene pairs recognized by
C D using reciprocal best hits of all-against-all BLAST
matches. Three bell curves represent significant
components fitted to Ks frequency by using
mixture model test of the EMMIX software
(McLachlan et al., 1999).

each candidate WGD (panels B in


Supplemental Figures 4–65). In these
graphs, Ks values for GDs supporting the
same WGD as detected by gene tree
analyses were plotted. However, Ks
values of other gene pairs not found at
the same node in the gene tree were not
single species, we used a phylogenomic approach of exam- included; for example, gene pairs that did not meet the
ining many gene trees by mapping GDs onto a reference minimal length requirement or were species specific were not
species tree, as has been successfully used to detected used in the gene phylogenies and not included in graphs
WGDs shared by angiosperms, core eudicots, monocots, shown in panel B of these figures.
and other groups (Tang et al., 2010; Jiao et al., 2011, 2012).
Here, gene family trees were compared with the species tree For example, 556 GDs were detected in A. thaliana after its
from recent studies of angiosperm phylogeny (Zeng et al., divergence from Arabidopsis lyrata (Figure 1A). Consistently,
2014, 2017). When hundreds or more GDs occur at the same the Ks peak value of these 556 pairs in A. thaliana was
node on the species tree, they provide strong support for 0.13 (Figure 1B), supporting young ages of the duplication
a WGD. Large-scale GDs were detected on many nodes/ events. However, the number of GDs was too small to
lineages in Rosids, Asterids, monocots, and the MCC support WGD. In addition, progressively larger Ks peak
group, covering all 22 WGDs (Supplemental Table 3 and values were estimated for the 204 GDs (0.29) corresponding
Supplemental Figure 1) previously detected using genome to the last common ancestor (LCA) of A. thaliana and
sequences. A. lyrata, for 335 GDs (0.38) at the LCA of A. thaliana,
A. lyrata, and Capsella rubella, and for 2624 GDs (with Ks
peak as 0.80) at the LCA of Brassicaceae (known as the
Ks Distributions of Duplicated Genes as Further ‘‘alpha’’ WGD) (Vision et al., 2000; Bowers et al., 2003),
Evidence for Large-Scale Duplications accounting for 76.4% of GDs in A. thaliana after divergence
Synonymous substitution rate (Ks) between gene pairs, as an from Tarenaya hassleriana (formerly Cleome hassleriana;
estimate of divergence age between paralogs, has been widely Cleomaceae). Finally, 550 GDs in A. thaliana after divergence
used as evidence for large-scale duplications when many pa- from Carica papaya (Caricaceae) but before the divergence of
ralogs exhibit Ks values within a close range (Blanc and Cleomaceae have a broad distribution of Ks values (peak
Wolfe, 2004; Adams and Wendel, 2005; Velasco et al., 2010; value at 1.88), indicating an older duplication and providing
Jiao et al., 2011) and especially when the organisms lack a more precise placement of the ‘‘beta’’ WGD previously
completely sequenced genomes, such as the analyses of Ks reported for the ancestor of Brassicaceae (Bowers et al.,
evidence for WGDs in A. thaliana, cotton, maize, soybean, 2003; Blanc and Wolfe, 2004). It is worth noting that
potato, and tomato (Blanc and Wolfe, 2004), and consistent the summarized Ks distribution of all five GD groups
with other studies (Vision et al., 2000; Bowers et al., 2003; above can be resolved into only three peaks, highly similar
Schnable et al., 2009; Schmutz et al., 2010; Sato et al., 2012; to those of gene pairs from BLAST searches (Figure 1C
Wang et al., 2012a). To provide further evidence for the and 1D). Thus the strategy of grouping GDs according
WGD events detected by phylogenomic analyses of gene to their phylogenetic positions relative to taxon groups
families described in the previous section (Supplemental provided better resolution for the identification of possible
Figure 1), we investigated Ks distributions of the duplicated WGDs, and was then applied to gene pairs from other
genes identified from the phylogenetic analyses for plant genomes/transcriptomes and corresponding to putative
416 Molecular Plant 11, 414–428, March 2018 ª The Author 2018.
Genomic and Phylogenomic Analysis of WGDs Molecular Plant
Figure 2. A Model for Exponential Decrease
over Time of Number of Duplication Events,
with Actually Detected GDs.
Each dot denotes GDs corresponding to a WGD
at a node on the species tree in Supplemental
Figure 1. The analyses here of public genome
and transcriptome datasets provided strong
support for WGDs (blue and purple dots,
respectively), consistent with previous reports.
Red dots are newly detected WGDs in this
study, and gray dots represent GDs for nodes
not strong enough to be considered as WGD
here. The blue solid curve represents proposed
theoretical decay rate of GDs after duplication,
estimated based on comparison of retained GD
numbers with corresponding median Ks values
from previously reported WGDs associated with
available genomes, while the purple curve is
based on reported WGDs with transcriptomic
data in this study. The one-fold SDs for the two
decay curves are colored by blue and yellow
backgrounds, respectively. Linear regression
analysis of Ks values and evolutionary time
(calibrated by using fossil records [Magallón
et al., 2015]) of previously reported WGDs is
illustrated in the top right panel.

revealed that the half-life of gene dupli-


cates in the form of Ks was approxi-
mately 0.34 (Figure 2, blue curve), or
WGDs (panels B in Supplemental Figures 4–65), as further 21.6 MY (Figure 2) as calibrated by using fossil records
support for the detected WGD events. (Magallón et al., 2015). This is more than double those for
human and mouse (around 8 MY), suggesting that duplicates
from WGDs are retained more and for longer in angiosperms
Estimation of Loss Rate of Gene Duplicates from Known than in mammals. The deviation from the curve might be due
WGDs to Recognize Novel Ones to variations in gene loss and mutation rates, life span, or
Considering that the numbers of GDs resulting from polyploid- completeness of genome sequences.
ization decrease during evolution, with only a small fraction of
GDs from ancient WGDs being still maintained today, integra- Because transcriptome sequencing is usually incomplete, the
tion of both numbers and ages of GDs is necessary for evalu- numbers of GDs detected using transcriptome datasets are
ating potential novel WGDs detected in each lineage. One likely underestimates, with deviations from exponential decay
duplicate in many pairs from WGD is often lost soon after dupli- depending on the size and quality of different transcriptome da-
cation (Lynch and Conery, 2003; Zhang et al., 2012), with tasets. Consequently, comparison of the 15 known WGDs sup-
further losses in other gene pairs during evolution. A model of ported by transcriptomes revealed a much lower decay rate
‘‘half-life decay’’ was proposed for the rate of duplicate loss of GDs (Ks = 0.61, the purple curve in Figure 2). Assuming
from animal WGDs, with half-life for duplicates in human of that the loss of duplicate actually follows decay with the
7.5 million years (MY), 3.2 MY for Drosophila melanogaster, underestimated half-life due to the use of transcriptomic data,
and 1.7 MY for Caenorhabditis elegans (Lynch and Conery, 18 novel large-scale GDs were detected to exhibit similar ex-
2003), possibly related to their generation times. To estimate tents of deviation (red dots in Figure 2; Supplemental Table 3).
the possible half-life of plant GDs from WGDs, we examined Among these, two novel WGDs are older ones shared by two
the plant species/lineages with detected WGDs for possible re- species: one for Boraginales, as 463 of 1010 GDs (45.8%)
lationships between Ks values (Supplemental Figures 4–65) and have two paralogs retained in both Bothriospermum chinense
GD numbers and found a possible strong correlation between and Trigonotis peduncularis; the other for Dipsacales, as 92 of
these two parameters (Figure 2), with those for species with 253 GDs (36.4%) retain both paralogs in Dipsacus laciniatus
sequenced genomes falling on both sides of a theoretical and Lonicera japonica. It is worth noting that, although the
decay curve. For example, G. max, owning the largest number of detected GDs (212, with Ks = 0.91) on Lamiales is
number of GDs (14 664), had the smallest Ks value of 0.13 slightly lower than the criterion we set, 85.8% of them (182
among those with sequenced genomes, whereas Musa of 212 GDs) retain both paralogs from Mimulus guttatus,
acuminata had only 3201 detected GDs but with a Ks peak Paulownia, and Mentha canadensis, strongly supporting WGD
as 0.45. Comparison of GD numbers with Ks values of in Lamiales. Other possible WGDs were detected with greater
the 22 known WGDs supported by genome sequencing deviation from the theoretical curve, possibly due to more
Molecular Plant 11, 414–428, March 2018 ª The Author 2018. 417
Molecular Plant Genomic and Phylogenomic Analysis of WGDs
rapid loss of duplicates or limitations of transcriptome datasets. Ks = 0.09) (Figure 3, red dots). Moreover, three older WGDs
Nevertheless, WGD is still a more reasonable explanation for the were supported for Boraginales, Lamiales, and Dipsacales,
hundreds of GDs detected than independent duplication of respectively (Figure 3, red dots), with GDs shared by two
many genes. species in each of these orders.

There is strong evidence for WGDs detected in monocots


Comprehensive Evaluation of Known and Novel WGDs (Figure 3), including those that were previously proposed for
Based on Loss Rate of Duplicates Poaceae (Tang et al., 2010; Jiao et al., 2014; Ming et al., 2015;
In Rosids, 19 lineages/plants exhibited large numbers of GDs McKain et al., 2016), Zingiberales (D’Hont et al., 2012),
(Supplemental Table 3), supporting WGDs, including those in Alismatales (Yi et al., 2005), and other smaller lineages (Huang
12 lineages/plants (Figure 3, blue dots) consistent with et al., 2003; Cui et al., 2006; Schnable et al., 2009). The position
previous reports (Vision et al., 2000; Bowers et al., 2003; of a previously reported WGD on Asparagaceae (McKain et al.,
Tuskan et al., 2006; Bertioli et al., 2009; Schmutz et al., 2010; 2012; Cai et al., 2015) was reassigned by our results to
Velasco et al., 2010; Jiao et al., 2011; Wang et al., 2011, Asparagales (represented by representatives of Asparagaceae,
2012a, 2012b; Cheng et al., 2013; Myburg et al., 2014; Iridaceae, and Orchidaceae). Moreover, the detection of large-
Bredeson et al., 2016). In particular, the Faboideae (Fabaceae, scale gene duplicates supports novel WGDs in monocot lineages
or legumes) had an ancient WGD before the diversification of represented by Curcuma longa (3516 GDs, Ks = 0.11) and
its members (Bertioli et al., 2009) and experienced further Pandanus utilis (1024 GDs, Ks = 0.18). We further provided evi-
polyploidization in its descendants (Schmutz et al., 2010), dence for a novel WGD in Illicium henryi (1003 GDs, Ks = 0.16)
e.g., G. max. Consistently, 2821 GDs (with Ks = 0.72, (Figure 3, red dot), and for previously reported WGDs in
Supplemental Table 3) were detected in the LCA of Faboideae Chloranthales (228 GDs, Ks = 0.89) (Ehrendorfer et al., 1968)
(Supplemental Figure 1), with an additional 14 664 GDs (Ks = and Cabomba caroliniana (1509 GDs, Ks = 0.16) (O’rgaard,
0.13) for the G. max lineage. In addition, we identified six 1991) based on chromosome number comparisons.
Rosid lineages/plants with large-scale GDs supporting novel
WGDs (Figure 3, red dots; Supplemental Table 3), e.g., Oxalis In short, our analyses of 105 large datasets not only obtained phy-
corniculata (1400 GDs with Ks = 0.72), Toona sinensis (2236 logenomic support for nearly all previously reported WGDs based
GDs, Ks = 0.073), Hypericum perforatum (2653 GDs, Ks = on genomic sequences and many of WGDs proposed from cDNA
0.13), Euonymus carnosus (1410 GDs, Ks = 0.12), Pelargonium sequences, but also identified novel WGD events (Figure 3,
hortorum (1282 GDs, Ks = 0.36), and Bryophyllum pinnatum red dots) in three largest clades of angiosperms, Rosids
(1250 GDs, Ks = 0.25). (6 WGDs), Asterids (9 WGDs), and monocots (3 WGDs) (see
also Supplemental Figure 1). However, our results did not
In Asterids, gene family analyses found evidence for 16 WGDs support a reported WGD in Zostera marina (Franssen et al.,
(Figure 3, colored dots) (Jones and Reed, 2007; Shi et al., 2010; 2014), whose transcriptomic dataset yielded relatively few
Sato et al., 2012; Iorizzo et al., 2016; Huang et al., 2016b). assembled genes (Supplemental Table 2).
Among the Asterid families is the largest angiosperm one,
Asteraceae, with over 24 000 species and accounting for nearly
10% of angiosperm (Stevens and Davis, 2005). We detected Potential Functions of Retained Duplicates from WGDs
4100 GDs (with Ks = 0.20) and 1093 GDs (Ks = 0.91) for two Genes encoding transcription factors and signaling pathway
Asteraceae species, Flourensia thurifera and Lactuca sativa, components are more likely retained than is average among
respectively, and 614 GDs (Ks = 1.28) shared by both, GDs from WGD according to analyses of a few species (Aury
consistent with previous findings of multiple WGDs in both the et al., 2006; McGrath et al., 2014). To obtain clues about other
LCA of Asteraceae and its subclades (Barker et al., 2016; possible biological functions of GDs using the large number of
Huang et al., 2016b; Reyes-Chin-Wo et al., 2017). We also datasets here, we examined the gene ontology (GO) categories
found 3532 GDs (Ks = 0.69) for the LCA of Solanum of the GDs for each lineage with large-scale GD events and found
lycopersicum and S. tuberosum (Supplemental Table 3), in that they were enriched in GO categories of both external and
agreement with a reported polyploidization before its internal biological processes (Figure 4). Among GO categories
diversification (Sato et al., 2012). Several of the WGDs detected for external processes are cell wall metabolism, transporters,
here are supported by results from different approaches and secondary metabolism. For example, the EXPA family
(Figure 3, purple dots); the detection of 1444 GDs (Ks = of expansins promotes cell wall loosening (Sampedro and
0.16) here in Hydrangea macrophylla (Cornales) after its Cosgrove, 2005), extending the previous findings of the
divergence from Philadelphus incanus strongly supports the expansion of the EXPA family in A. thaliana (Sampedro et al.,
polyploidization in H. macrophylla proposed according to 2005) to more species. GO categories for enriched internal
results from flow cytometry (Jones and Reed, 2007). Similarly functions included amino acid and protein metabolism, such as
supported WGDs in the lineages represented by Olea europaea protein targeting, synthesis, and post-translational modification.
and Actinidia auguta provide strong evidence for previously Similar findings were reported from analyses of Brassicaceae
proposed polyploidization events (Shi et al., 2010; Edger et al., and Poaceae genes encoding SET domain proteins important
2017). In addition, lineage-specific novel WGDs are supported for histone modifications (Zhang and Ma, 2012). The greater
by large numbers of GDs detected for Trigonotis peduncularis rates of retention of gene duplicates for these regulatory
(1764 GDs, Ks = 0.13), Vinca major (2025 GDs, Ks = 0.10), functions suggest that WGDs provide functional diversities
Paulownia (2504 GDs, Ks = 0.12), Dipsacus laciniatus or novelties to enhance adaptation of plants to changing
(2050 GDs, Ks = 0.17), and Hedera nepalensis (2709 GDs, environments, including both abiotic and biotic factors.
418 Molecular Plant 11, 414–428, March 2018 ª The Author 2018.
Genomic and Phylogenomic Analysis of WGDs Molecular Plant
Figure 3. Illustration of Phylogenetic Relationships of
105 Angiosperms and Proposed WGDs.
Previously reported WGDs by using genome and tran-
scriptome information are denoted by blue dots and purple
dots, respectively. Newly detected WGDs are represented by
red dots. Numbers of detected GDs associated with previ-
ously reported and newly detected WGDs are available in
Supplemental Figure 1. Branch lengths on the tree are not to
scale.

Molecular Plant 11, 414–428, March 2018 ª The Author 2018. 419
Molecular Plant Genomic and Phylogenomic Analysis of WGDs

Figure 4. Enrichment Analysis for Duplicated Genes on Different GO Categories for Each WGD Shown in Figure 3.

Retention of Duplicated Genes in Species with Both possibly because the number of paralogs is important whereas
Ancestral and Recent WGDs their specific sequences are less important.
Angiosperm groups have experienced multiple rounds of WGDs,
resulting in hundreds to thousands of retained GDs, with possible The ancient GDs that show increased copies due to recent WGDs
differential retention or loss of GDs among different lineages. It is (type I in Figure 5A) have three possible patterns of retention
possible that later duplication relieves the selection pressure on (Figure 5B): each of two ancient paralogs retained their recent
gene copies and allows a previously maintained copy to be duplicates, thus leaving four paralogs; one of the ancient
lost. As a first test for possible effects of recent polyploidization paralogs lost a recent duplicate; or each of the ancient clades
on preservation of ancient gene duplicates, we compared the retained one copy. By adopting the ages of these recent
GDs between G. raimondii and T. cacao, as their common WGDs, associated with estimated exponential decay rate of
ancestor experienced at least one ancestral WGD (the ‘‘gamma’’ GDs, we obtained expected GD numbers for each of the three
WGD), while G. raimondii underwent a more recent WGD (Wang types for each recent WGD event to compare with the
et al., 2012a) after its divergence from T. cacao. Among the 1067 corresponding observed GD numbers (Supplemental Table 5).
detected ancestral GDs (Figure 3), 36 of them (type I GD in On average, the detected GDs for type I-I, I-II, and I-III were
Figure 5A) had homologs in G. raimondii for both duplicates close to the expected GD numbers (Figure 5C), suggesting
while T. cacao lost one copy after the divergence of the two that the loss of gene duplicates follows a random process,
lineages. On the contrary, 91 GDs (type II GD in Figure 5A) especially for ‘‘younger’’ recent WGDs (lighter dots in
experienced gene loss in G. raimondii rather than in T. cacao, Figure 5C). However, the ‘‘older’’ recent WGDs (darker dots in
more than twice the number of type I GDs. These results Figure 5C) showed more detected GD for type I-I and fewer
suggest that ancient GDs are more likely lost in species for type I-III than expected. As ‘‘older’’ GDs are more
with recent polyploidization. We thus compared 18 pairs likely to have undergone functional divergence, e.g., neo-/
of organisms/lineages (Supplemental Table 4) with sister subfunctionalization, there is probably greater selection
relationship on the species tree, in which one lineage pressure for the retention of these duplicates than that for the
experienced a recent polyploidization while the other did not. In ‘‘younger’’ ones, which are probably still (partially) functionally
eight of the 18 pairs, the number of type I GDs were redundant and under less selection pressure.
significantly smaller than that of type II GDs (two-fold changes),
e.g., Brassicaceae/Sapindales, Cucurbitales/Rosales, and DISCUSSION
G. raimondii/T. cacao, while there were only four pairs of
lineages having an opposite situation, suggesting that ancestral Occurrence of WGDs Is Correlated with Global Climate
GDs have a higher possibility of loss after the occurrence of Changes
extra recent WGDs. It is possible that increased copies of one Global climate in the Cenozoic Era experienced several
ancient paralog are able to replace another ancient paralog, sharp fluctuations in a relatively short period, including
420 Molecular Plant 11, 414–428, March 2018 ª The Author 2018.
Genomic and Phylogenomic Analysis of WGDs Molecular Plant
A Figure 5. Comparison of GD Retention be-
tween Lineages with Multiple Rounds of
WGDs with Their Sister Clades without
Recent WGDs.
(A) Possible patterns of loss of gene duplicates
from an ancient WGD, shared by Gossypium rai-
mondii (labeled by ‘‘G’’) and Theobroma cacao
(labeled by ‘‘T’’). G. raimondii underwent a further
WGD after the divergence from T. cacao, which
B lacks a recent WGD. The number under each tree
denotes the counts of corresponding type GD
retention.
(B) Illustration of retention of recent gene
duplicates when both ancient duplicates are
preserved.
(C) Comparison of observed numbers gene pairs
and expected counts for the three patterns in (B).
C Each dot represents for a node on the species
tree with multiple rounds of studied or proposed
WGDs, and is colored by level of its median
Ks value.

tors. Future studies are needed to gain


further understanding of the possible rela-
tionship between environmental changes
and polyploidy.

The idea that additional copies of many


genes due to WGDs could provide evolu-
Paleocene–Eocene Thermal Maximum (PETM), Early Eocene tionary advantages also suggested that WGDs are more likely
Climatic Optimum (EECO), Eocene–Oligocene transition, Mid- associated with ancestors of large diverse groups. Previously
Eocene Climatic Optimum (MECO), and Mid-Miocene Climatic identified WGDs were indeed often associated with large families,
Optimum (MMCO) (Zachos et al., 2008). They are often such as those for Brassicaceae (Bowers et al., 2003) and
associated with mass extinction events, e.g., Grande Coupure Poaceae (Paterson et al., 2004), but this could be because only
occurred in Europe (Hooker et al., 2004) close to the transition members of large families were analyzed due to the availability
of Eocene–Oligocene, with an abrupt cooling icehouse climate of sequenced genomes or transcriptomes. Here, we have
(Zachos et al., 2001). expanded the number of species with large datasets, including
members of small groups. Significantly larger species numbers
It was suggested that polyploids might be more successful under are found in a number of lineages with early WGD than their
environmental stresses due to the opportunities created by sister lineages (Figure 7A and 7B), e.g., Brassicaceae (3709
changing environments to realize the evolutionary benefits of species versus 150 species in Cleomaceae), Malvaceae (4225
WGDs (Vanneste et al., 2014). Interestingly, the two WGDs species versus 2070/705/1630 species in Rutaceae/Meliaceae/
(Figure 6) experienced by M. acuminata (44.6 MY) and the LCA Sapindaceae), Fabaceae (19 560 species versus 2625/1125/
of Solanum (41.9 MY) were very close to the time of MECO 170/2520 species in Urticaceae/Moraceae/Cannabaceae/
(41.5 MY), suggesting greater chances than random for Rosaceae), Asteraceae (25 040 species versus 3780/1450
becoming polyploids at such times. Furthermore, the two species on Apiaceae/Araliaceae), and Poaceae (11 337 species
WGDs in G. raimondii (35.6 MY) and Morchella esculenta (37.5 versus 2585 species in Arecaceae). We further detected
MY) were close to the Eocene–Oligocene transition (33.5 MY), species radiation for three large clades (Figure 7C):
while the four WGDs in Populus trichocarpa (18.6 MY), Asparagales, core Lamiales, and core monocots. In addition,
M. domestica (15.9 MY), F. thurifera (13.7 MY), and Actinidia we examined the orders or families included here with or
arguta (16.1 MY) were close to MMCO (16–18 MY). We then without detected WGDs, and found that indeed large orders
performed a simulated analysis to measure the co-occurrence had a greater percentage with WGDs than smaller orders
of detected WGDs with random time points representing (p value <0.05, Figure 7D and 7E), suggesting that WGDs could
climate changes. The time intervals of marked climate changes have contributed to the species richness of large groups. At the
(Zachos et al., 2008) from the most adjacent WGD were same time, the presence of WGD in some small groups
significantly shorter than those based on simulated time points suggests that WGDs did not always associate with large
(p value <0.05), supporting a correlation between major groups, possibly due to environmental factors that did not
geological events and polyploidy. However, many observed promote diversification. Alternatively, the smaller groups might
WGDs were dated far from PETM, EECO, and other well-known have had less time to expand, as supported by the relatively
rapid climate transitions, suggesting that evolutionary advan- young age (small Ks values) of the WGDs in smaller groups
tages of polyploids could be related to other environmental fac- (Figure 7F–7I).
Molecular Plant 11, 414–428, March 2018 ª The Author 2018. 421
Molecular Plant Genomic and Phylogenomic Analysis of WGDs

Figure 6. Comparison of Time Spans with Global Climate Changes with Predicted Ages of WGDs Calibrated by Using Fossil Data and
Computational Estimation.

Scale of Reciprocal Loss of Gene Duplicates for 2012), although RGL of even one duplicate pair might result in
Increasing Genomic Diversity of Plants reproductive isolation (Mizuta et al., 2010). Here we examined
Reciprocal loss (RGL) of gene duplicates in different populations 3168 GDs from alpha WGD on the LCA of Brassicaceae, and
has been proposed to facilitate reproductive isolation due to detected 7–56 RGLs (Supplemental Figure 66) among pairs of
copy number variations and functional diversification (Werth and six species with whole genome information. For example, seven
Windham, 1991; Lynch and Conery, 2000). However, dozens to of 3168 GDs have been reciprocally lost in the genomes of
hundreds of RGL of GDs are required to provide strong A. thaliana and A. lyrata during the 12 MY after their divergence
reproductive barriers between populations (McGrath and Lynch, (Huang et al., 2016a), suggesting that these reciprocally lost

422 Molecular Plant 11, 414–428, March 2018 ª The Author 2018.
Genomic and Phylogenomic Analysis of WGDs Molecular Plant
A B C

D E

F G

H I

Figure 7. Comparison of Speciation Rate for Different Families of Angiosperms Evolved in Figure 3.
(A) Illustration of speciation rates for different lineages, on which WGDs are labeled in colors as shown in Figure 3. Changes of speciation rate for each
lineage were obtained by using BAMM and are represented by gradient colors.
(B) Comparison between species richness of groups with WGDs and those of corresponding sister groups.
(C) Display of speciation rate variations among three clades along with evolutionary time.
(D) Percentage of orders with or without WGDs and having fewer than 2000 species, 2000–5000 species, and more than 5000 species, respectively.

(legend continued on next page)

Molecular Plant 11, 414–428, March 2018 ª The Author 2018. 423
Molecular Plant Genomic and Phylogenomic Analysis of WGDs
genes might not be the major cause of the isolation of the two Estimation of Gain and Loss of Gene Families
species. Furthermore, 14 RGL genes were detected between Homolog groups from the above procedure were adopted for estimating
A. thaliana and C. rubella, whose estimated divergence time gain and loss of gene duplicates at each branch on the species tree of
was 18 MY (Huang et al., 2016a), and 36 RGLs between the 105 angiosperms. First, the species tree was reconstructed based
A. thaliana and even earlier diverged Eutrema salsugineum (34 on current understanding of angiosperm phylogeny (Zeng et al., 2014,
MY) (Huang et al., 2016a). Unlike the small numbers of RGLs, 2017). Second, presence/absence of species was labeled by 1/0
over 90% of GDs from the alpha WGD are preserved in all depending on appearance of at least one gene for each homolog group.
Third, a potential gain/loss event for each homolog group at each
six Brassicaceae species analyzed here. A recent study
branch on the species tree was determined by applying the Dollo
(Postlethwait et al., 2004) suggests that functional divergence of
parsimony method in the PHYLIP package (Felsenstein, 1989). Last,
gene duplicates through subfunctionalization also provides total gains and losses across different gene families were performed
reproductive incompatibility when organisms from different using custom scripts (available upon request).
populations are crossed. It is possible that subfunctionalization
is more important than RGLs in the contribution to species Reconstruction of Phylogenetic Trees for Each Gene Family
diversity, although further studies are still needed. Multiple alignments were performed for proteins with alignable lengths of
100 amino acids or longer for each homolog clusters having sequences
In this study, we analyzed newly generated transcriptome data- from multiple species by using MUSCLE v3.8.31 (Edgar, 2004) with
sets for 46 angiosperms in combination with 59 public angio- default parameters, trimmed by using trimAl v1.4 (Capella-Gutiérrez
sperm genomes/transcriptomes, covering most major clades of et al., 2009) with options ‘‘-gt 0.1 -resoverlap 0.75 -seqoverlap 80’’, then
eudicots, monocots, and basal angiosperms. By sequence com- transferred to a nucleic acid alignment matrix by PAL2NAL (Suyama
parison, tens of thousands of homologs were identified for each et al., 2006). Maximum-likelihood trees were reconstructed by
applying RAxML (Stamatakis, 2006) to the nucleic acid sequences with
species, yielding a total of 21 588–50 950 gene families for Ro-
evolutionary model of ‘‘GTRCAT,’’ and a bootstrap significance test was
sids, Asterids, and other major angiosperm lineages. Detailed ex-
performed with 100 replicates.
amination of GDs on the corresponding single gene family trees
detected many novel WGD events, either ancient or relatively Detection of Gene Duplication Events
recent, expanding the number of angiosperm lineages with
Gene duplication on different lineages was detected by comparing gene
WGDs during their evolutionary histories. These new datasets family trees with a reference species tree (Causier et al., 2005; Madlung
and newly identified gene duplicates provide useful resources et al., 2005), as this procedure was applied in several recent studies (Jiao
for further genomic and genetic studies. et al., 2012; Cannon et al., 2015; Li et al., 2015; Yang et al., 2015a). In
brief (Supplemental Figure 67), for each gene family tree the LCA was
METHODS assigned for each gene clade, determined by taxon groups of the species
carrying the genes in the clade. The nodes on the gene family tree with
Plant Sampling, RNA Extraction, and High-Throughput bootstrap support smaller than 50% were not considered in subsequent
Sequencing analyses. We then examined the species corresponding to all genes in a
Sampling information of 105 species is shown in Supplemental Table 1. clade before and two sister clades after a putative duplication. When a
Total RNA was extracted using plant MiniPrep Kits (ZYMO Research, gene duplication node involves two genes from a single species, a GD
USA) and purified using RNAeasy Plant Mini Kits (Qiagen, CA). RNA was counted for the lineage represented by the species. When at least
quality was examined on a Bioanalyzer 2100 (Agilent, CA) with RNA one of two clades (paralogs) of a duplication node included two sister
integrity number values greater than 8 for all samples. The total mRNA lineages (each represented by a single species) in the species tree, a GD
was sequenced using an Illumina HiSeq2000 instrument and subjected on the LCA of the two lineages was counted. When the node with a
to 100 cycles of paired-end (2 3 100 bp) sequencing (available at NCBI candidate WGD included three or more species, a GD was counted when
Sequence Read Archive under project ID PRJNA421868 with sample two requirements were met: (1) the two paralogous clades shared two or
accession numbers from SAMN08159217 to SAMN08159263). Unigenes more species; (2) the difference in the depths of the two paralogous
were assembled by using Trinity v2.3.2 (Grabherr et al., 2011) and TGICL clades was 1 or zero, where the depth of a paralogous clade was defined
v2.1 (Pertea et al., 2003). Proteomes of 36 organisms with whole genome as the number of branches in the species tree from the LCA of the gene
information were downloaded from the Phytozome database version 8.0 clade to the root of the species tree. Numbers of duplications were
(Goodstein et al., 2012), transcriptomes of 23 species from previous summarized on the species tree by iterating all single gene family trees.
studies (Supplemental Table 1). Synonymous substitution rate (Ks) was evaluated for each pair of
duplicates in each plant genome by KaKs_calculator (Zhang et al., 2006)
Ortholog Identification and Homolog Clustering using the MA method (model averaging on a set of candidate models).
We identified homologs for each protein from the 105 species in this study
by an all-against-all BlastP search with e-value cutoff at 10 5. Global pro- Evaluation of WGD Candidates Based on Number of
tein identities of each BLAST match were calculated by using Inparanoid Paralogous Pairs, Ks Distribution, and Loss Rate of Gene
(O’Brien et al., 2005), to filter out matches exhibiting poor global identities Duplicates
(lower than 50%) or gene coverage (<50%). Redundant alternatively One paralog in many paralogous pairs from a WGD is often lost during
spliced transcripts from the same organism were removed using IsoSVM evolution, leading to a small probability of detecting GDs from ancient
v07.2005 (Spitzer et al., 2006). Homologous groups were predicted using WGDs. To evaluate each candidate WGD having hundreds or thousands
OrthoMCL v1.4 (Li et al., 2003) with the inflation value of 2.0. of GDs observed in the above analysis, we estimate the rate of GD loss (l)

(E) Percentage of families with or without WGDs and consisting of fewer than 300 species, 300–2500 species, and more than 2500 species, respectively.
p Values were calculated using Fisher’s exact test.
(F–I) Comparison of GD numbers with Ks for orders and families are shown in (F) and (G), while their detailed species numbers are given in (H) and (I),
respectively. Each dot represents an order/family and is colored red (size >5000), blue (2000–5000), or green (<2000), with size proportional to GD or
species numbers (log-transformed).

424 Molecular Plant 11, 414–428, March 2018 ª The Author 2018.
Genomic and Phylogenomic Analysis of WGDs Molecular Plant
based on known WGDs by applying linear regression on detected GD We assume that all genes within the same cluster have similar functions.
numbers (N) and peak values of Ks (log-transformed), based on a simple For each homolog cluster, the GO term found in the majority of genes
survivorship assumption (individual gene pairs have an equal rate of losing was used to represent the functional annotation of the cluster. GO
one duplicate, thus their life spans are completely random). Considering enrichment analysis of WGDs involved in this study were performed by
change of GD numbers with respect to change of Ks as dN = lNdKs, comparison of counts of GDs with those of all homolog clusters from
and the initial condition as N = N0 when Ks = 0, we have each GO category using Fisher’s exact test, and the heatmap was
plotted in R environment.
lKs
N = N0 e ;
Estimation of Speciation Rate
Diversification rates were estimated by BAMM version 2.1.0 (Rabosky
where N0 represents total gene number in an ancient genome/clade
et al., 2014). The phylogeny and divergent time of 84 families involved in
before a WGD and is estimated by appearance frequency of the
this study was reconstructed based on recent studies of angiosperm
genome/clade in single gene family trees. An accurate estimation of
phylogeny and related molecular clock estimates (Zeng et al., 2014,
N and N0 requires consideration of the effects of gene gain and loss espe-
2017). The species richness data were obtained from the Angiosperm
cially for single-copy genes, incomplete gene sets of transcriptomes, and
Phylogeny Group web site V12 (http://www.mobot.org). Two chains
other possibilities, and thus calls for further studies. Here, we applied the
running simultaneously for a total of 10 million generations were
following equation for estimating the rate of GD loss on transcriptomic
performed during the analysis, and tree space was sampled per 5000
data by choosing an initial condition as N = PN0 when Ks = 0, where
generations. The MCMC convergence assessing and visualization of
P stands for the probability of the discovery of both paralogs of a gene
BAMM output were performed using BAMM. To display the speciation
pair from transcriptome sampling:
rate variations through time for some large clades, we also used BAMM
to draw the density plots.
N = N0 em lKs
;

SUPPLEMENTAL INFORMATION
where m = lnP. Both m and l are estimated based on known WGDs supported Supplemental Information is available at Molecular Plant Online.
by transcriptome data. Latterly, GDs on each transcriptome/clade are clas-
sified into ‘‘newly identified’’ WGDs when the following criteria are met: (1)
FUNDING
the observed N/N0 fall in the range of one SD of the theoretical survivorship This work was supported by grants from the National Natural Science
curve; (2) lineage-specific WGDs should have GD numbers larger than 1000; Foundation of China (grant number 31570224 to J.Q. and 91531301,
(3) for older WGDs shared by two or more species/lineages, considering the 31670209 to H.M.) and funds from the State Key Laboratory of Genetic
incompleteness of transcriptome datasets, 30% or more of duplication Engineering at Fudan University.
events are required to retain both paralogs for each lineage/species. Finally,
as the Ks distribution of small-scale GDs should be L-shaped and are mostly
associated with very small peak values, those candidate WGDs with Ks peak AUTHOR CONTRIBUTIONS
J.Q. and H.M. designed the study and managed the project. L.Z. and N.Z.
values less than 0.02 were not retained, as a cautionary measure.
prepared RNA samples and performed sequencing. H.W. performed raw
data analysis, transcriptome assembly, gene prediction, and annotation.
Dating WGDs Using Ks and Divergence Time of Associated
R.R. and C.G. performed phylogenetic analyses. J.Q. and R.R. performed
Lineages Calibrated by Fossil Data
comparative genomic and transcriptomic analyses. Y.C. performed data
The ages of WGDs detected in this study were estimated under the collection and revised the manuscript. J.Q. and H.M. wrote the manu-
assumption that synonymous mutations are accumulated at a constant script. All authors read and approved the final manuscript.
rate: specifically, if the position of each WGD is surrounded by two adja-
cent species divergence events in the species tree, i.e., the time of species
ACKNOWLEDGMENTS
divergence prior to the WGD was considered as the upper limit of age of
We thank Professors Ming Li (Institute of Theoretical Physics) and Hon-
WGD (denoted by Tprior) while the divergence time after the WGD was
gxing Yang (Shanghai Chenshan Plant Science Research Center) for dis-
used as its lower limit (denoted by Tpost). The time of emergence of
cussion, and the Computing Center of Beijing Institutes of Life Science for
each lineage on the species tree was collected from a recent study based
assistance in computation. No conflict of interest declared.
on fossil data and computational estimation on chloroplast genes
(Magallón et al., 2015) for obtaining both Tprior and Tpost for each WGD.
Received: November 8, 2017
The time of a given WGD, Twgd, was determined by using the function
Revised: December 13, 2017
Accepted: January 2, 2018
Kswgd Kspost  Published: January 6, 2018
T wgd = T post + 3 T prior T post ;
Ksprior Kspost
REFERENCES
where Ksprior and Kspost represent the peak Ks values of orthologs be-
Adams, K.L., and Wendel, J.F. (2005). Polyploidy and genome evolution
tween the two divergent lineages before and after the WGD event, respec-
in plants. Curr. Opin. Plant Biol. 8:135–141.
tively, and are obtained by comparison of reciprocal best matched genes
from all-against-all BLAST between paired species, while Kswgd repre- Aury, J.M., Jaillon, O., Duret, L., Noel, B., Jubin, C., Porcel, B.M.,
sents the average Ks value of all gene duplicates from the WGD of interest. Ségurens, B., Daubin, V., Anthouard, V., Aiach, N., et al. (2006).
For a lineage-specific WGD, a simplified formula is adopted to estimate Global trends of whole-genome duplications revealed by the ciliate
Twgd by setting both Tpost and Kspost to zero as no further species diver- Paramecium tetraurelia. Nature 444:171–178.
gence event exists in the species tree. For more accurate age estimates Barker, M.S., Li, Z., Kidder, T.I., Reardon, C.R., Lai, Z., Oliveira, L.O.,
of WGD, lineages with divergence times close to the WGD should be Scascitelli, M., and Rieseberg, L.H. (2016). Most Compositae
included for calibration. (Asteraceae) are descendants of a paleohexaploid and all share
a paleotetraploid ancestor with the Calyceraceae. Am. J. Bot.
Gene Ontology Annotation on Homolog Clusters 103:1203–1211.
Functional categories (GO) of genes from the species with whole genome Bertioli, D.J., Moretzsohn, M.C., Madsen, L.H., Sandal, N.,
information were obtained from MapMan database (Usadel et al., 2009). Lealbertioli, S.C., Guimarães, P.M., Hougaard, B.K., Fredslund,

Molecular Plant 11, 414–428, March 2018 ª The Author 2018. 425
Molecular Plant Genomic and Phylogenomic Analysis of WGDs
J., Schauser, L., and Nielsen, A.M. (2009). An analysis of synteny of of the seagrasses Zostera marina and Nanozostera noltii under a
Arachis with Lotus and Medicago sheds new light on the structure, simulated heatwave confirm functional types. Mar. Genomics
stability and evolution of legume genomes. BMC Genomics 10:10–45. 15:65–73.
Blanc, G., Barakat, A., Guyot, R., Cooke, R., and Delseny, M. (2000). Goodstein, D.M., Shu, S., Howson, R., Neupane, R., Hayes, R.D., Fazo,
Extensive duplication and reshuffling in the Arabidopsis genome. J., Mitros, T., Dirks, W., Hellsten, U., Putnam, N., et al. (2012).
Plant Cell 12:1093–1101. Phytozome: a comparative platform for green plant genomics.
Blanc, G., and Wolfe, K.H. (2004). Widespread paleopolyploidy in model Nucleic Acids Res. 40:D1178–D1186.
plant species inferred from age distributions of duplicate genes. Plant Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A.,
Cell 16:1667–1678. Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., et al.
Bowers, J.E., Chapman, B.A., Rong, J., and Paterson, A.H. (2003). (2011). Full-length transcriptome assembly from RNA-Seq data
Unravelling angiosperm genome evolution by phylogenetic analysis without a reference genome. Nat. Biotechnol. 29:644–652.
of chromosomal duplication events. Nature 422:433–438. Hegarty, M.J., and Hiscock, S.J. (2008). Genomic clues to the
Bredeson, J.V., Lyons, J.B., Prochnik, S.E., Wu, G.A., Ha, C.M., evolutionary success of polyploid plants. Curr. Biol. 18:R435–R444.
Edsinger-Gonzales, E., Grimwood, J., Schmutz, J., Rabbi, I.Y.,
D’Hont, A., Denoeud, F., Aury, J., Baurens, F.C., Carreel, F.,
Egesi, C., et al. (2016). Sequencing wild and cultivated cassava
Garsmeur, O., Noel, B., Bocs, S., Droc, G., and Rouard, M. (2012).
and related species reveals extensive interspecific hybridization and
The banana (Musa acuminata) genome and the evolution of
genetic diversity. Nat. Biotechnol. 34:562–570.
monocotyledonous plants. Nature 488:213–217.
Cai, J., Liu, X., Vanneste, K., Proost, S., Tsai, W., Liu, K., Chen, L., He,
Hooker, J.J., Collinson, M.E., and Sille, N.P. (2004). Eocene-Oligocene
Y., Xu, Q., and Bian, C. (2015). The genome sequence of the orchid
mammalian faunal turnover in the Hampshire Basin, UK: calibration to
Phalaenopsis equestris. Nat. Genet. 47:65–72.
the global time scale and major cooling event. J. Geol. Soc. Lond.
Cannon, S.B., McKain, M.R., Harkess, A., Nelson, M.N., Dash, S., 161:161–172.
Deyholos, M.K., Peng, Y., Joyce, B., Stewart, C.N., Jr., Rolf, M.,
et al. (2015). Multiple polyploidy events in the early radiation of Huang, C.H., Sun, R., Hu, Y., Zeng, L., Zhang, N., Cai, L., Zhang, Q.,
nodulating and nonnodulating legumes. Mol. Biol. Evol. 32:193–210. Koch, M.A., Alshehbaz, I.A., and Edger, P.P. (2016a). Resolution
of Brassicaceae phylogeny using nuclear genes uncovers nested
Capella-Gutiérrez, S., Silla-Martı́nez, J.M., and Gabaldón, T. (2009). radiations and supports convergent morphological evolution. Mol.
trimAl: a tool for automated alignment trimming in large-scale Biol. Evol. 33:394–412.
phylogenetic analyses. Bioinformatics 25:1972–1973.
Huang, C.H., Zhang, C., Liu, M., Hu, Y., Gao, T., Qi, J., and Ma, H.
Causier, B., Castillo, R., Zhou, J., Ingram, R., Xue, Y., Schwarz-
(2016b). Multiple polyploidization events across asteraceae with
Sommer, Z., and Davies, B. (2005). Evolution in action: following
two nested events in the early history revealed by nuclear
function in duplicated floral homeotic genes. Curr. Biol. 15:1508–1512.
phylogenomics. Mol. Biol. Evol. 33:2820–2835.
Cheng, S., van den Bergh, E., Zeng, P., Zhong, X., Xu, J., Liu, X.,
Huang, S., Su, X., Haselkorn, R., and Gornicki, P. (2003). Evolution of
Hofberger, J., de Bruijn, S., Bhide, A.S., Kuelahoglu, C., et al.
switchgrass (Panicum virgatum L.) based on sequences of the
(2013). The Tarenaya hassleriana genome provides insight into
nuclear gene encoding plastid acetyl-CoA carboxylase. Plant Sci.
reproductive trait and genome evolution of crucifers. Plant Cell
164:43–49.
25:2813–2830.
Christenhusz, M.J.M., and Byng, J.W. (2016). The number of known Iorizzo, M., Ellison, S., Senalik, D., Zeng, P., Satapoomin, P., Huang,
plants species in the world and its annual increase. Phytotaxa J., Bowman, M., Iovene, M., Sanseverino, W., Cavagnaro, P.,
261:201–217. et al. (2016). A high-quality carrot genome assembly provides new
insights into carotenoid accumulation and asterid genome evolution.
Cui, L., Wall, P.K., Leebens-Mack, J.H., Lindsay, B.G., Soltis, D.E., Nat. Genet. 48:657–666.
Doyle, J.J., Soltis, P.S., Carlson, J.E., Arumuganathan, K.,
Barakat, A., et al. (2006). Widespread genome duplications Jaillon, O., Aury, J.M., Noel, B., Policriti, A., Clepet, C., Casagrande,
throughout the history of flowering plants. Genome Res. 16:738–749. A., Choisne, N., Aubourg, S., Vitulo, N., Jubin, C., et al. (2007). The
grapevine genome sequence suggests ancestral hexaploidization in
Doyle, J.J., Flagel, L.E., Paterson, A.H., Rapp, R.A., Soltis, D.E., Soltis,
major angiosperm phyla. Nature 449:463–467.
P.S., and Wendel, J.F. (2008). Evolutionary genetics of genome
merger and doubling in plants. Annu. Rev. Genet. 42:443–461. Jiao, Y., Leebens-Mack, J., Ayyampalayam, S., Bowers, J.E., McKain,
M.R., McNeal, J., Rolf, M., Ruzicka, D.R., Wafula, E., Wickett, N.J.,
Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high
et al. (2012). A genome triplication associated with early diversification
accuracy and high throughput. Nucleic Acids Res. 32:1792–1797.
of the core eudicots. Genome Biol. 13:R3.
Edger, P.P., Smith, R.D., McKain, M.R., Cooley, A.M., Vallejo-Marin,
M., Yuan, Y., Bewick, A.J., Ji, L., Platts, A.E., Bowman, M.J., Jiao, Y., Li, J., Tang, H., and Paterson, A.H. (2014). Integrated syntenic
et al. (2017). Subgenome dominance in an interspecific hybrid, and phylogenomic analyses reveal an ancient genome duplication in
synthetic allopolyploid, and a 140 year old naturally established monocots. Plant Cell 26:2792–2802.
neo-allopolyploid monkeyflower. bioRxiv https://doi.org/10.1101/ Jiao, Y., Wickett, N.J., Ayyampalayam, S., Chanderbali, A.S.,
094797. Landherr, L., Ralph, P.E., Tomsho, L.P., Hu, Y., Liang, H., Soltis,
Ehrendorfer, F., Krendl, F., Habeler, E., and Sauer, W. (1968). P.S., et al. (2011). Ancestral polyploidy in seed plants and
Chromosome numbers and evolution in primitive angiosperms. angiosperms. Nature 473:97–100.
Taxon 17:337–353. Jones, K.D., and Reed, S.M. (2007). Analysis of ploidy level and its effects
Felsenstein, J. (1989). PHYLIP—phylogeny inference package. on guard cell length, pollen diameter, and fertility in Hydrangea
Cladistics 5:164–166. macrophylla. Hort Sci. 42:483–488.
Franssen, S.U., Gu, J., Winters, G., Huylmans, A.K., Wienpahl, I., Li, L., Stoeckert, C.J., Jr., and Roos, D.S. (2003). OrthoMCL:
Sparwel, M., Coyer, J.A., Olsen, J.L., Reusch, T.B., and identification of ortholog groups for eukaryotic genomes. Genome
Bornberg-Bauer, E. (2014). Genome-wide transcriptomic responses Res. 13:2178–2189.

426 Molecular Plant 11, 414–428, March 2018 ª The Author 2018.
Genomic and Phylogenomic Analysis of WGDs Molecular Plant
Li, Z., Baniaga, A.E., Sessa, E.B., Scascitelli, M., Graham, S.W., consequences for comparative genomics. Proc. Natl. Acad. Sci. USA
Rieseberg, L.H., and Barker, M.S. (2015). Early genome 101:9903–9908.
duplications in conifers and other seed plants. Sci. Adv. 1:e1501084. Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, R.,
Long, M., Betran, E., Thornton, K., and Wang, W. (2003). The origin of Karamycheva, S., Lee, Y., White, J., Cheung, F., Parvizi, B., et al.
new genes: glimpses from the young and old. Nat. Rev. Genet. (2003). TIGR Gene Indices clustering tools (TGICL): a software
4:865–875. system for fast clustering of large EST datasets. Bioinformatics
Lynch, M., and Conery, J.S. (2000). The evolutionary fate and 19:651–652.
consequences of duplicate genes. Science 290:1151–1155. Pontes, O., Neves, N., Silva, M., Lewis, M.S., Madlung, A., Comai, L.,
Lynch, M., and Conery, J.S. (2003). The evolutionary demography of Viegas, W., and Pikaard, C.S. (2004). Chromosomal locus
duplicate genes. J. Struct. Funct. Genomics 3:35–44. rearrangements are a rapid response to formation of the allotetraploid
Arabidopsis suecica genome. Proc. Natl. Acad. Sci. USA 101:18240–
Madlung, A., Tyagi, A.P., Watson, B., Jiang, H., Kagochi, T., Doerge,
18245.
R.W., Martienssen, R., and Comai, L. (2005). Genomic changes in
synthetic Arabidopsis polyploids. Plant J. 41:221–230. Postlethwait, J., Amores, A., Cresko, W., Singer, A., and Yan, Y.L.
(2004). Subfunction partitioning, the teleost radiation and the
Magallón, S., Gómezacevedo, S., Sánchezreyes, L.L., and Hernández-
annotation of the human genome. Trends Genet. 20:481–490.
Hernández, T. (2015). A metacalibrated time-tree documents the early
rise of flowering plant phylogenetic diversity. New Phytol. 207:437–453. Postlethwait, J.H., Woods, I.G., Ngo-Hazelett, P., Yan, Y.L., Kelly,
P.D., Chu, F., Huang, H., Hill-Force, A., and Talbot, W.S. (2000).
McGrath, C.L., Gout, J.F., Johri, P., Doak, T.G., and Lynch, M. (2014).
Zebrafish comparative genomics and the origins of vertebrate
Differential retention and divergent resolution of duplicate genes
chromosomes. Genome Res. 10:1890–1902.
following whole-genome duplication. Genome Res. 24:1665–1675.
Rabosky, D.L., Grundler, M., Anderson, C., Title, P., Shi, J.J.,
McGrath, C.L., and Lynch, M. (2012). Evolutionary significance of whole-
Brown, J.W., Huang, H., and Larson, J.G. (2014). BAMMtools:
genome duplication. In Polyploidy and Genome Evolution, P.S. Soltis
an R package for the analysis of evolutionary dynamics on
and D.E. Soltis, eds. (Berlin Heidelberg: Springer), pp. 1–20.
phylogenetic trees. Methods Ecol. Evol. 5:701–707.
McKain, M.R., Tang, H., McNeal, J.R., Ayyampalayam, S., Davis, J.I.,
Reyes-Chin-Wo, S., Wang, Z., Yang, X., Kozik, A., Arikit, S., Song, C.,
dePamphilis, C.W., Givnish, T.J., Pires, J.C., Stevenson, D.W.,
Xia, L., Froenicke, L., Lavelle, D.O., Truco, M.J., et al. (2017).
and Leebens-Mack, J.H. (2016). A phylogenomic assessment of
Genome assembly with in vitro proximity ligation data and whole-
ancient polyploidy and genome evolution across the Poales. Genome
genome triplication in lettuce. Nat. Commun. 8:14953.
Biol. Evol. 8:1150–1164.
Sampedro, J., and Cosgrove, D.J. (2005). The expansin superfamily.
McKain, M.R., Wickett, N., Zhang, Y., Ayyampalayam, S., McCombie,
Genome Biol. 6:242.
W.R., Chase, M.W., Pires, J.C., dePamphilis, C.W., and Leebens-
Mack, J. (2012). Phylogenomic analysis of transcriptome data Sampedro, J., Lee, Y., Carey, R.E., dePamphilis, C., and Cosgrove,
elucidates co-occurrence of a paleopolyploid event and the origin of D.J. (2005). Use of genomic history to improve phylogeny and
bimodal karyotypes in Agavoideae (Asparagaceae). Am. J. Bot. understanding of births and deaths in a gene family. Plant J.
99:397–406. 44:409–419.
McLachlan, G.J., Peel, D., Basford, K.E., and Adams, P. (1999). Sato, S., Tabata, S., Hirakawa, H., Asamizu, E., Shirasawa, K., Isobe,
The EMMIX software for the fitting of mixtures of normal and S., Kaneko, T., Nakamura, Y., Shibata, D., and Aoki, K. (2012).
t-components. J. Stat. Softw. 4:1–14. The tomato genome sequence provides insights into fleshy fruit
Ming, R., VanBuren, R., Wai, C.M., Tang, H., Schatz, M.C., Bowers, evolution. Nature 485:635–641.
J.E., Lyons, E., Wang, M.L., Chen, J., Biggers, E., et al. (2015). The Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T., Nelson, W.,
pineapple genome and the evolution of CAM photosynthesis. Nat. Hyten, D.L., Song, Q., Thelen, J.J., and Cheng, J. (2010). Genome
Genet. 47:1435–1442. sequence of the palaeopolyploid soybean. Nature 463:178–183.
Mitchell-Olds, T., and Schmitt, J. (2006). Genetic mechanisms and Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F., Pasternak,
evolutionary significance of natural variation in Arabidopsis. Nature S., Liang, C., Zhang, J., Fulton, L., and Graves, T. (2009). The
441:947–952. B73 maize genome: complexity, diversity, and dynamics. Science
Mizuta, Y., Harushima, Y., and Kurata, N. (2010). Rice pollen hybrid 326:1112–1115.
incompatibility caused by reciprocal gene loss of duplicated genes. Sémon, M., and Wolfe, K.H. (2007). Reciprocal gene loss between
Proc. Natl. Acad. Sci. USA 107:20417–20422. Tetraodon and zebrafish after whole genome duplication in their
Myburg, A.A., Grattapaglia, D., Tuskan, G.A., Hellsten, U., Hayes, ancestor. Trends Genet. 23:108–112.
R.D., Grimwood, J., Jenkins, J., Lindquist, E., Tice, H., Bauer, D., Shi, T., Huang, H., and Barker, M.S. (2010). Ancient genome duplications
et al. (2014). The genome of Eucalyptus grandis. Nature 510:356–362. during the evolution of kiwifruit (Actinidia) and related Ericales. Ann.
O’Brien, K.P., Remm, M., and Sonnhammer, E.L. (2005). Inparanoid: a Bot. 106:497–504.
comprehensive database of eukaryotic orthologs. Nucleic Acids Res. Simillion, C., Vandepoele, K., Van Montagu, M.C., Zabeau, M., and
33:D476–D480. Van de Peer, Y. (2002). The hidden duplication past of Arabidopsis
O’rgaard, M. (1991). The genus Cabomba (Cabombaceae)—a taxonomic thaliana. Proc. Natl. Acad. Sci. USA 99:13627–13632.
study. Nord. J. Bot. 11:179–203. Spitzer, M., Lorkowski, S., Cullen, P., Sczyrba, A., and Fuellen, G.
Otto, S.P. (2007). The evolutionary consequences of polyploidy. Cell (2006). IsoSVM—distinguishing isoforms and paralogs on the protein
131:452–462. level. BMC Bioinformatics 7:110.
Panopoulou, G., and Poustka, A.J. (2005). Timing and mechanism Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based
of ancient vertebrate genome duplications—the adventure of a phylogenetic analyses with thousands of taxa and mixed models.
hypothesis. Trends Genet. 21:559–567. Bioinformatics 22:2688–2690.
Paterson, A.H., Bowers, J.E., and Chapman, B.A. (2004). Ancient Stebbins, G.L. (1999). A brief summary of my ideas on evolution. Am. J.
polyploidization predating divergence of the cereals, and its Bot. 86:1207–1208.

Molecular Plant 11, 414–428, March 2018 ª The Author 2018. 427
Molecular Plant Genomic and Phylogenomic Analysis of WGDs
Stevens, P.F., and Davis, H. (2005). The angiosperm phylogeny Werth, C.R., and Windham, M.D. (1991). A model for divergent, allopatric
website—a tool for reference and teaching in a time of change. Proc. speciation of polyploid pteridophytes resulting from silencing of
Am. Soc. Inform. Sci. Technol. 42. duplicate-gene expression. Am. Nat. 137:515–526.
Suyama, M., Torrents, D., and Bork, P. (2006). PAL2NAL: robust Wolfe, K.H., and Shields, D.C. (1997). Molecular evidence for an ancient
conversion of protein sequence alignments into the corresponding duplication of the entire yeast genome. Nature 387:708–713.
codon alignments. Nucleic Acids Res. 34:W609–W612.
Yang, Y., Moore, M.J., Brockington, S.F., Soltis, D.E., Wong, G.K.,
Tang, H., Bowers, J.E., Wang, X., Ming, R., Alam, M., and Paterson, Carpenter, E.J., Zhang, Y., Chen, L., Yan, Z., Xie, Y., et al. (2015a).
A.H. (2008). Synteny and collinearity in plant genomes. Science Dissecting molecular evolution in the highly diverse plant clade
320:486–488. caryophyllales using transcriptome sequencing. Mol. Biol. Evol.
Tang, H., Bowers, J.E., Wang, X., and Paterson, A.H. (2010). 32:2001–2014.
Angiosperm genome comparisons reveal early polyploidy in the Yang, Z., Wafula, E.K., Honaas, L.A., Zhang, H., Das, M., Fernandez-
monocot lineage. Proc. Natl. Acad. Sci. USA 107:472–477. Aparicio, M., Huang, K., Bandaranayake, P.C., Wu, B., Der, J.P.,
Tuskan, G.A., Difazio, S., Jansson, S., Bohlmann, J., Grigoriev, I., et al. (2015b). Comparative transcriptome analyses reveal core
Hellsten, U., Putnam, N., Ralph, S., Rombauts, S., Salamov, A., parasitism genes and suggest gene duplication and repurposing as
et al. (2006). The genome of black cottonwood, Populus trichocarpa sources of structural novelty. Mol. Biol. Evol. 32:767–790.
(Torr. & Gray). Science 313:1596–1604. Yi, T., Li, H., and Li, D. (2005). Chromosome variation in the genus Pinellia
Usadel, B., Poree, F., Nagel, A., Lohse, M., Czedik-Eysenberg, A., and (Araceae) in China and Japan. Bot. J. Linn. Soc. 147:449–455.
Stitt, M. (2009). A guide to using MapMan to visualize and compare
Zachos, J., Pagani, M., Sloan, L., Thomas, E., and Billups, K. (2001).
Omics data in plants: a case study in the crop species, Maize. Plant
Trends, rhythms, and aberrations in global climate 65 Ma to present.
Cell Environ. 32:1211–1229.
Science 292:686–693.
Vanneste, K., Baele, G., Maere, S., and Van de Peer, Y. (2014). Analysis
Zachos, J.C., Dickens, G.R., and Zeebe, R.E. (2008). An early Cenozoic
of 41 plant genomes supports a wave of successful genome
perspective on greenhouse warming and carbon-cycle dynamics.
duplications in association with the Cretaceous-Paleogene boundary.
Nature 451:279–283.
Genome Res. 24:1334–1347.
Velasco, R., Zharkikh, A., Affourtit, J., Dhingra, A., Cestaro, A., Zeng, L., Zhang, N., Zhang, Q., Endress, P.K., Huang, J., and Ma, H.
Kalyanaraman, A., Fontana, P., Bhatnagar, S.K., Troggio, M., (2017). Resolution of deep eudicot phylogeny and their temporal
Pruss, D., et al. (2010). The genome of the domesticated apple diversification using nuclear genes from transcriptomic and genomic
(Malus x domestica Borkh.). Nat. Genet. 42:833–839. datasets. New Phytol. 214:1338–1354.

Vision, T.J., Brown, D.G., and Tanksley, S.D. (2000). The origins of Zeng, L., Zhang, Q., Sun, R., Kong, H., Zhang, N., and Ma, H. (2014).
genomic duplications in Arabidopsis. Science 290:2114–2117. Resolution of deep angiosperm phylogeny using conserved nuclear
genes and estimates of early divergence times. Nat. Commun. 5:4956.
Wang, K., Wang, Z., Li, F., Ye, W., Wang, J., Song, G., Yue, Z., Cong, L.,
Shang, H., and Zhu, S. (2012a). The draft genome of a diploid cotton Zhang, L., and Ma, H. (2012). Complex evolutionary history and diverse
Gossypium raimondii. Nat. Genet. 44:1098–1103. domain organization of SET proteins suggest divergent regulatory
interactions. New Phytol. 195:248–263.
Wang, X., Wang, H., Wang, J., Sun, R., Wu, J., Liu, S., Bai, Y., Mun, J.H.,
Bancroft, I., and Cheng, F. (2011). The genome of the mesopolyploid Zhang, N., Zeng, L., Shan, H., and Ma, H. (2012). Highly conserved low-
crop species Brassica rapa. Nat. Genet. 43:1035–1039. copy nuclear genes as effective markers for phylogenetic analyses in
angiosperms. New Phytol. 195:923–937.
Wang, Z., Hobson, N., Galindo, L., Zhu, S., Shi, D., Mcdill, J., Yang, L.,
Hawkins, S., Neutelings, G., and Datla, R. (2012b). The genome of Zhang, Z., Li, J., Zhao, X.Q., Wang, J., Wong, G.K., and Yu, J. (2006).
flax (Linum usitatissimum) assembled de novo from short shotgun KaKs_calculator: calculating Ka and Ks through model selection and
sequence reads. Plant J. 72:461–473. model averaging. Genomics Proteomics Bioinformatics 4:259–263.

428 Molecular Plant 11, 414–428, March 2018 ª The Author 2018.

You might also like