Pauli Et Al. Genome Res. 2012 Mar 22 (3) 577 - 591

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Resource

Systematic identification of long noncoding RNAs


expressed during zebrafish embryogenesis
Andrea Pauli,1,7,8 Eivind Valen,2,7 Michael F. Lin,3,4 Manuel Garber,4
Nadine L. Vastenhouw,1 Joshua Z. Levin,4 Lin Fan,4 Albin Sandelin,2
John L. Rinn,4,5 Aviv Regev,3,4,6,8 and Alexander F. Schier1,4,8
1
Department of Molecular and Cellular Biology (MCB), Harvard University, Cambridge, Massachusetts 02138, USA; 2The
Bioinformatics Centre, Department of Biology and the Biotech, Research and Innovation Centre (BRIC), University of Copenhagen,
Copenhagen DK-2200, Denmark; 3Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA; 4The Broad
Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA; 5Department of Stem Cell and Regenerative Biology (SCRB),
Harvard University, Cambridge, Massachusetts 02138, USA; 6Howard Hughes Medical Institute (HHMI), Chevy Chase, Maryland
20815, USA

Long noncoding RNAs (lncRNAs) comprise a diverse class of transcripts that structurally resemble mRNAs but do not
encode proteins. Recent genome-wide studies in humans and the mouse have annotated lncRNAs expressed in cell lines
and adult tissues, but a systematic analysis of lncRNAs expressed during vertebrate embryogenesis has been elusive. To
identify lncRNAs with potential functions in vertebrate embryogenesis, we performed a time-series of RNA-seq exper-
iments at eight stages during early zebrafish development. We reconstructed 56,535 high-confidence transcripts in 28,912
loci, recovering the vast majority of expressed RefSeq transcripts while identifying thousands of novel isoforms and
expressed loci. We defined a stringent set of 1133 noncoding multi-exonic transcripts expressed during embryogenesis.
These include long intergenic ncRNAs (lincRNAs), intronic overlapping lncRNAs, exonic antisense overlapping
lncRNAs, and precursors for small RNAs (sRNAs). Zebrafish lncRNAs share many of the characteristics of their mam-
malian counterparts: relatively short length, low exon number, low expression, and conservation levels comparable to that
of introns. Subsets of lncRNAs carry chromatin signatures characteristic of genes with developmental functions. The
temporal expression profile of lncRNAs revealed two novel properties: lncRNAs are expressed in narrower time windows
than are protein-coding genes and are specifically enriched in early-stage embryos. In addition, several lncRNAs show
tissue-specific expression and distinct subcellular localization patterns. Integrative computational analyses associated in-
dividual lncRNAs with specific pathways and functions, ranging from cell cycle regulation to morphogenesis. Our study
provides the first systematic identification of lncRNAs in a vertebrate embryo and forms the foundation for future
genetic, genomic, and evolutionary studies.
[Supplemental material is available for this article.]

Large-scale genomic studies have identified a significant number 2011; Wang et al. 2011). Other lncRNAs may act as decoys in the
of transcripts that do not code for proteins (Kapranov et al. 2002, sequestration of miRNAs (Poliseno et al. 2010), transcription fac-
2007; Bertone 2004; Carninci et al. 2005; ENCODE Project Con- tors (Hung et al. 2011), or other proteins (Tripathi et al. 2010). Yet
sortium et al. 2007; Ponjavic et al. 2007; Fejes-Toth et al. 2009; others may serve as precursors for the generation of sRNAs (Kapranov
Guttman et al. 2009, 2010; Cabili et al. 2011). Such noncoding et al. 2007; Wilusz et al. 2008; Fejes-Toth et al. 2009).
RNAs (ncRNAs) can be broadly classified as either small (<200 Although most lncRNAs have not been functionally charac-
nucleotides [nt]; sRNAs) or large (>200 nt; lncRNAs) based on the terized, an emerging theme is their role in the regulation of gene
size of their mature transcripts. While miRNAs (microRNAs), the expression in either cis or trans. Several trans-acting lncRNAs have
best-studied class of sRNAs, regulate their mRNA targets post- been identified, including HOTAIR (Rinn et al. 2007), TP53COR1
transcriptionally (Bartel 2009), mRNA-like lncRNAs act by a range (also known as lincRNA-p21) (Huarte et al. 2010), and PANDA (Hung
of mechanisms (for reviews, see Koziol and Rinn 2010; Pauli et al. et al. 2011). Moreover, knockdown of more than 100 individual long
2011; Wang and Chang 2011). For example, several lncRNAs have intergenic ncRNAs (lincRNAs) in mouse embryonic stem cells led to
been shown to interact with and modulate the activity of the widespread changes in gene expression that could not be explained
chromatin modifying machinery (Rinn et al. 2007; Nagano et al. by a cis-acting mechanism (Guttman et al. 2011). Other well-de-
2008; Pandey et al. 2008; Zhao et al. 2008, 2010; Khalil et al. 2009; scribed lncRNAs act in cis. For example, mammalian X chromosome
Huarte et al. 2010; Tian et al. 2010; Tsai et al. 2010; Guttman et al. inactivation and allelic imprinting depend on lncRNAs that mediate
the silencing of neighboring genes by recruiting repressive chro-
7 matin modifiers (Sleutels et al. 2002; Mancini-Dinardo et al. 2006;
These authors contributed equally to this work.
8
Corresponding authors. Nagano et al. 2008; Pandey et al. 2008; Zhao et al. 2008). Additional
E-mail [email protected]. recently identified cis-acting RNAs activate the expression of neigh-
E-mail [email protected]. boring genes (Kim et al. 2010; Ørom et al. 2010; Wang et al. 2011).
E-mail [email protected].
Article published online before print. Article, supplemental material, and pub- Collectively, these studies have demonstrated that lncRNAs can have
lication date are at http://www.genome.org/cgi/doi/10.1101/gr.133009.111. a profound impact on gene regulation in both cis and trans.

22:577–591 Ó 2012 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/12; www.genome.org Genome Research 577
www.genome.org
Pauli et al.

Existing annotations of mammalian lncRNAs are derived we obtained about 200–300 million reads per stage (more than two
from large-scale studies of cultured cells (Kapranov et al. 2002; billion reads in total) (Supplemental Table 1). Eighty-eight percent
Rinn et al. 2003; Carninci et al. 2005; ENCODE Project Consortium of the reads passed initial quality thresholds, of which ;80% could
et al. 2007; Dinger et al. 2008; Guttman et al. 2009, 2010) or adult be aligned to the latest assembly (Zv9) of the zebrafish genome se-
tissue samples (Ponjavic et al. 2009; Cabili et al. 2011). Such rela- quence (Methods; Supplemental Table 1).
tively homogenous and abundant samples have facilitated the We assembled transcripts using a step-wise protocol (Methods;
identification of low abundance, cell type–specific transcripts. Fig. 1A). Briefly, we used TopHat (Trapnell et al. 2009) to align all
However, this strategy is likely to miss lncRNAs that are only reads per time-point, including those that span splice junctions. We
expressed during narrow developmental time windows. To fully then reconstructed transcripts using two assemblers—Cufflinks
characterize vertebrate lncRNAs, it is therefore necessary to sys- (Trapnell et al. 2010) and Scripture (Guttman et al. 2010), resulting
tematically search for lncRNAs that are expressed during specific in the assembly of a total number of 316,373 nonredundant tran-
developmental stages. script isoforms from 143,626 loci across all embryonic stages.
Here, we report the systematic identification and character- We defined a ‘‘high-confidence’’ set of 56,535 embryonic tran-
ization of developmental lncRNAs. We leveraged the ability to scripts, following a similar strategy as described by Cabili et al.
obtain large numbers of developmentally synchronous zebrafish (2011). Specifically, we developed a filtering pipeline aimed at re-
embryos in order to perform a time-series of eight RNA-seq ex- ducing the number of transcripts that might be erroneously as-
periments (200–300 million reads per stage) from shortly after sembled or below significance thresholds (Supplemental Fig. 1A).
fertilization to early larval stages. As a measure of quality of our We first required a transcript be assembled at least twice: either iden-
data set, we were able to reconstruct the vast majority of annotated tified by both assemblers or in at least two developmental stages.
zebrafish RefSeq genes and a large fraction of Ensembl gene models Next, we removed transcripts there were likely to be assembly ar-
(Flicek et al. 2011). In contrast to recent smaller-scale RNA-seq tifacts or run-on fragments or that did not pass our high-confi-
studies that focused on protein-coding genes (Aanes et al. 2011; dence thresholds (Methods; Supplemental Fig. 1A). This resulted in
Vesterlund et al. 2011), we annotated and analyzed lncRNAs at a final set of 56,535 embryonic transcripts from 28,912 loci (on
high temporal resolution. We combined RNA-seq–based de novo average, 1.95 transcripts per locus) (Supplemental Fig. 1B,C), of
transcript identification with a stringent filtering of putative pro- which 50,904 were multi-exonic and 5631 were single exon tran-
tein-coding transcripts to define a high-confidence set of 1133 scripts. We will henceforth refer to this set as the ‘‘embryonic
multi-exonic noncoding transcripts. Our lncRNA catalog includes transcriptome,’’ and all subsequent analyses are based on it.
397 intergenic, 184 intronic overlapping, and 566 antisense ex-
onic overlapping transcripts, many of which are expressed in a
developmentally regulated manner. We characterized each lncRNA The embryonic transcriptome has high coverage, quality,
by diverse features, including transcript structure, evolutionary and depth
conservation, developmental expression, and associated chroma- To estimate the quality and coverage of our embryonic tran-
tin marks. Our expression pattern data revealed several intriguing scriptome, we compared it to the current RefSeq and Ensembl gene
properties of zebrafish lncRNAs. Notably, lncRNAs are expressed in annotations. Compared to the 15,175 zebrafish RefSeq genes, our
particularly narrow developmental windows and in specific cell embryonic transcriptome provides more than a threefold increase
types. Moreover, lncRNAs are particularly numerous in the very in the number of identified transcripts (56,535) from nearly twice
early embryo. Computational analysis of expression correlation as many loci (28,912), suggesting that the increase in the number
with functional gene sets associated subsets of lncRNAs with de- of individual transcripts is due to both novel isoforms of known
velopmental processes ranging from cell cycle regulation to mor- genes and novel loci (Fig. 1B). Notably, of the 13,942 RefSeq genes
phogenesis. Collectively, the systematic annotation and character- that are expressed (FPKM [fragments per kilobase of exon per
ization of lncRNAs expressed during zebrafish embryogenesis opens million fragments mapped] > 0) during the stages covered by our
the way for future genetic, genomic, and evolutionary studies. data set, 90% (12,527/13,942) have transcript evidence (exonic
overlapping transcripts) in our embryonic transcriptome, and 70%
of those (8751/12,527) are identical to RefSeq isoforms (Supple-
Results mental Fig. 2A). In addition, 3532 of our transcripts are variants of
known RefSeq genes (novel isoforms and partial transcripts), many
Assembly of a high-confidence embryonic transcriptome of which extend the existing exon–intron structures with addi-
To systematically discover noncoding transcripts with potential tional 59 or 39 exons (Supplemental Fig. 2A).
functions during early vertebrate development, we performed large- Compared to the most recent Ensembl Zv9 gene models
scale cDNA sequencing experiments across zebrafish embryogene- (52,873 transcripts in 31,711 loci), our embryonic transcriptome is
sis. We chose eight time-points that mark important developmental of similar size (Supplemental Fig. 2B). The Ensembl gene set in-
stages (Fig. 1A): (1) shortly after fertilization (two- to four-cell stage); tegrates transcript annotations from several sources and includes
(2) at the time when zygotic transcription of the genome is initiated RNA-seq transcript models built from a total of 376 million reads
(1000-cell stage); (3–5) during blastula and gastrula stages (dome, derived from embryonic, larval, and adult zebrafish (Sanger In-
shield, and bud stages), when cell fates are specified and large-scale stitute). Thus, the comparable transcriptome size and larger read
cell movements occur; and (6–8) at late embryonic and early larval numbers in our purely embryonic RNA-seq experiments again sug-
developmental stages, when organs are forming (28 h post fertil- gest that our embryonic transcriptome is of high depth. Moreover,
ization [hpf], 48 hpf, and 120 hpf) (see overview in Fig. 1A). Poly- we have transcript evidence for ;74% of Ensembl gene loci of
adenylated RNA was purified from approximately 1000 embryos per comparable (>160 nt) transcript sizes, corresponding to ;68%
time-point and converted into cDNA libraries for strand-specific, of our embryonically identified loci (Fig. 1B). This high degree of
paired-end 76 bp sequencing on Illumina’s HiSeq platform (see overlap provides independent confirmation for a large fraction of
Methods; Parkhomchuk et al. 2009; Levin et al. 2010). On average, our transcriptome.

578 Genome Research


www.genome.org
LncRNA expression during zebrafish embryogenesis

Figure 1. Overview of the RNA-seq–based embryonic transcriptome assembly. (A) Overview of the RNA-seq–based transcript reconstruction pipeline
that was employed to identify embryonically expressed transcripts in zebrafish. Stage-specific transcriptomes were reconstructed from a time-series of
eight embryonic stages: two to four cell, 1000 cell, dome, shield, bud, 28 h post fertilization (hpf), 48 hpf, and 120 hpf. Stage-specific drawings of
representative embryos are adapted from Kimmel et al. (1995) (with permission from Wiley Ó 1995). A schematic outline of the process of transcriptome
reconstruction is shown at the bottom for three genes. Reads were mapped to either the + (blue) or – (red) strand using TopHat. Gaps inferred from
mapping each of the two paired-end reads are indicated as dashed gray lines; dashed black arrows indicate splice-junctions inferred from a gap in mapping
of a single read; and the deduced final transcript structures reconstructed by Scripture or Cufflinks are depicted at the bottom. (B) Overlap between loci
from the RNA-seq–based embryonic transcriptome assembly (blue) and previously annotated genes (gray): RefSeq genes (left) and Ensembl loci >160 bp
(right). The majority of known loci (84% of RefSeq loci and 74% of Ensembl loci >160 bp) are recovered in the embryonic transcriptome. Note that the
number of loci in the Ensembl transcriptome is based on comparison with loci of the embryonic transcriptome (which were used as reference), which
reduced the number of 27,751 Ensembl loci (>160 bp) to 26,587.

Compared to two recent RNA-seq–based transcriptome stud- Our transcript assemblies are also consistent with chromatin
ies in zebrafish embryos (Aanes et al. 2011; Vesterlund et al. 2011), marks known to be associated with promoters (Zhou et al. 2011) (see
our data set is of significantly higher depth: a total of about 220 also below). The fraction of marked protein-coding loci of our em-
million (Vesterlund et al. 2011) and about 100 million (Aanes bryonic transcriptome (44% for H3K4me3 only, 19% for H3K4me3
et al. 2011) mapped reads from four and six embryonic stages, and H3K27me3) is nearly identical to the fraction of marked RefSeq
respectively, versus about 1.5 billion mapped reads in our study loci (46% for H3K4me3 only, 16% for H3K4me3 and H3K27me3)
(Supplemental Table 1). Moreover, we identify many more known (see Fig. 5). This suggests that (1) our embryonic transcriptome is
and novel transcribed loci: about 4000 ‘‘novel transcribed re- of a quality comparable to RefSeq genes, and (2) many of our RNA-
gions’’ reported by Vesterlund et al. (2011) and Aanes et al. (2011) seq–based transcript structures contain complete 59 ends.
versus more than 9000 novel loci in our embryonic tran-
scriptome with no previous annotations in RefSeq or Ensembl.
This suggests that our embryonic transcriptome provides a highly Identification of a stringent set of embryonic lncRNAs
comprehensive and more complete assembly than that previously To identify mRNAs that exert their biological function as lncRNAs,
available. we developed a highly stringent filtering pipeline aimed at re-

Genome Research 579


www.genome.org
Pauli et al.

moving transcripts with evidence for protein-coding potential Second, we removed transcripts that had similarity to known
(Methods; Fig. 2). We identified putative lncRNAs by considering proteins or protein domains based on blastx, blastp, and HMMER
their phylogenetic conservation across species, homology with (Pfam domains) (Eddy 2009). This filter retained 2531 putative
known proteins and protein domains, and potential ORFs. noncoding transcripts (Fig. 2B). The excluded transcripts had not
Four filters were used. First, we used PhyloCSF (phylogenetic been captured by the PhyloCSF filter because they typically received
coding substitution frequency; see Methods), to score the coding low PhyloCSF scores due to poorly aligned sequences (complete
potential of transcripts using phylogenetic alignments (Lin et al. branch lengths [CBLs] of zero) (Methods; Supplemental Fig. 4B).
2011). PhyloCSF exploits the fact that protein-coding sequences— Third, we removed any remaining transcript of uncertain cod-
but not lncRNAs and other sequences—tend to have a higher rate of ing potential by applying a maximal ORF filter. Consistent with the
synonymous versus nonsynonymous substitutions (Supplemental traditional cutoff for protein-coding transcripts (Okazaki et al. 2002),
Figs. 3, 4A). We chose a PhyloCSF threshold of less than 20 because it we excluded any transcript with a maximal ORF > 100 amino acids
retained the majority of RefSeq ncRNAs (Supplemental Figs. 3, 4A) (aa). For transcripts that were not scored by PhyloCSF due to lacking
but removed 96.2% of protein-coding RefSeq transcripts. This filter sequence alignments (CBL = 0), we used a more stringent maximal
retained 4867 putative noncoding transcripts (Fig. 2). ORF cutoff of 30 aa. The ORF filter retained 1301 transcripts.
Finally, to exclude potentially incom-
plete transcript structures, we removed any
transcript that had sense exonic overlap
with a protein-coding transcript. The re-
sulting set contained 902 lncRNAs (mean
PhyloCSF score of 5) (Fig. 2B; Supple-
mental Fig. 3).

Identification of antisense overlapping


embryonic lncRNAs
Some putative noncoding transcripts had
antisense exonic overlap with protein-
coding genes. Examination of the range
of PhyloCSF scores obtained for antisense
strands of sense-coding transcripts re-
vealed that transcripts with a high-scor-
ing sense strand also tended to score
relatively high on the antisense strand
(Supplemental Fig. 4C). Thus, PhyloCSF
scores of antisense exonic overlapping
transcripts can be confounded by high
coding potential on the opposite strand.
To address this issue and ‘‘rescue’’
noncoding antisense transcripts, we
employed a modified filtering pipeline
with four additional criteria (Fig. 2; for
details, see Methods): (1) The putative non-
coding transcript had a lower PhyloCSF
score than the overlapping coding tran-
script; (2) its highest PhyloCSF score was
obtained in the region of overlap (e.g.,
Supplemental Fig. 4D); (3) its PhyloCSF
score was less than 300; and (4) the sense/
antisense exonic overlap did not exceed
81% of the sense strand. This approach
retained 231 multi-exonic antisense tran-
Figure 2. Overview of the stringent filtering pipeline that defined a conservative set of 1,133 scripts and resulted in a final stringent set
lncRNAs. (A) Filters at a glance: overview of classification criteria used to define noncoding transcripts. of 1133 lncRNAs (Fig. 2B).
(B) Detailed outline of the filtering pipeline that defined a conservative set of 1133 multi-exonic, em-
bryonically expressed lncRNAs. The following filtering criteria were used: (1) Phylogenetic Codon Sub-
stitution Frequency (PhyloCSF) score <20 (left branch of the top node) or rescue by the antisense pipeline Genomic characterization of
(right branch of the top node [dashed lines]: PhyloCSFsense < 300 and PhyloCSFsense < PhyloCSFanti and embryonic lncRNAs
highest scoring region [HSR] overlapping with an exon on the opposite strand); (2) no known protein
homologs based on blastx, blastp, and HMMER; (3) maximal ORF (ORFmax) <100 aa (transcripts with According to their genomic location,
alignments [complete branch length (CBL) > 0]) or <30 aa (transcripts without alignments [CBL = 0]); our 1133 embryonic lncRNAs are parti-
and (4) no sense-overlap with any protein-coding transcript. At each step, a green arrow denotes the
transcripts that passed the filter; a red arrow, those that were removed. Black bold numbers indicate the
tioned into 397 lincRNAs without over-
number of transcripts that passed the filter. Blue boxes highlight the number of transcripts that passed lap with any genes, 184 intronic over-
all filters and are considered noncoding (1133 lncRNAs in 859 loci). lapping lncRNAs, and 566 antisense

580 Genome Research


www.genome.org
LncRNA expression during zebrafish embryogenesis

exonic overlapping lncRNAs (Fig. 3). Intronic overlapping lncRNAs example, the zebrafish ortholog of the abundant nuclear lncRNA
are defined as loci that have no exon–exon overlap with another MALAT1 (also called NEAT2) was cleaved throughout its transcript
locus, i.e., there is no overlap between the mature lncRNA with and gave rise to multiple sRNAs (Supplemental Fig. 5). Consistent
exons of the overlapping locus. Intronic overlapping lncRNAs are with this observation, MALAT1 has previously been shown to be
in either sense or antisense orientation with respect to the over- associated with Ago2 (also known as EIF2C2), a known component
lapping gene and can be further partitioned into 105 intronic of the sRNA processing machinery (Weinmann et al. 2009). This
contained lncRNAs (incs; the lncRNA is contained within the analysis indicates that the large majority of our lncRNAs are not
transcribed region of another locus), 60 completely overlapping processed into sRNAs.
lncRNAs (concs; the other locus is contained within the transcribed
region of the lncRNA locus), and 19 partially overlapping lncRNAs
(poncs; neither incs nor poncs but with at least one exon of the
Zebrafish lncRNAs are shorter, less conserved and expressed at
lncRNA contained within an intron of another locus). lower levels than are protein-coding genes
Some lncRNAs may function as precursors for the generation Previous studies in mammals have shown that lncRNAs are
of sRNAs (ENCODE Project Consortium et al. 2007; Wilusz et al. shorter, less conserved, and expressed at significantly lower levels
2008). To identify sRNA-precursor lncRNAs, we compared our than are protein-coding genes (Guttman et al. 2010; Cabili et al.
lncRNA transcripts to a set of sRNAs present in 2-d-old zebrafish 2011). To determine whether embryonic lncRNAs have similar
(Methods; Cifuentes et al. 2010). We identified 41 lncRNAs that features, we analyzed the structure, expression level, and conser-
appear to function as precursors for the production of miRNAs vation of our lncRNAs (Fig. 4). We found that zebrafish lncRNAs
(16), snoRNAs (nine), or sRNAs of unknown categories (20) (Sup- were on average about one-third of the length of protein-coding
plemental Table 2). Four lncRNAs of the latter category contained transcripts (mean length of 1113 nt for lncRNAs versus 3352 nt for
a vast number of sRNAs throughout the entire transcript. For coding transcripts) (Fig. 4Aa). Moreover, lncRNAs had fewer exons
per transcript (about 2.8) than the aver-
age protein-coding gene (about 11) (Fig.
4Ab). These properties are comparable to
the estimated transcript length and exon
number of human lincRNAs (on average,
;1 kb and 2.9 exons, respectively) (Cabili
et al. 2011). Notably, zebrafish embryonic
lncRNAs were expressed on average at
about 10-fold lower levels than protein-
coding genes (Fig. 4B), consistent with the
low expression levels of their mammalian
counterparts (Guttman et al. 2010; Cabili
et al. 2011).
To assess the level of conservation of
lncRNAs, we used the CBL score, a mea-
sure of the fraction of phylogenetic tele-
ost alignments present over the region
of interest (Methods). In agreement with
signatures of conservation in mamma-
lian lncRNAs (Ponjavic et al. 2007, 2009;
Guttman et al. 2009, 2010; Ørom et al.
2010), a few lncRNAs were clearly con-
served across fish species (Fig. 4C; for two
conserved examples, see Supplemental
Fig. 6A). However, the majority of zebra-
fish lncRNA loci had low CBL scores, in-
dicating a lack of sequence alignments
over many noncoding regions (Fig. 4C).
The conservation of zebrafish lncRNAs as
reflected by CBL scores was substantially
Figure 3. Classification of lncRNAs. Numbers of lncRNAs in each of the three main classes, as defined
by their genomic location relative to neighboring or overlapping genes. Intergenic lncRNAs (blue; lower than the conservation of protein-
lincRNAs) have no overlap with any gene. lncRNAs with intronic overlap (green) are defined as loci that coding genes and was comparable to the
have overlap with another transcribed locus but no exon–exon overlap (no overlap between the mature conservation of intronic sequences (Sup-
lncRNA transcript with exons of the overlapping locus). They are on either the same or the opposite strand plemental Fig. 6B).
relative to the overlapping gene and can be partitioned into intronic contained lncRNAs (incs, light green;
the lncRNA is contained within the transcribed region of another locus), completely overlapping lncRNAs
(concs, green; the other locus is contained within the transcribed region of the lncRNA locus), and
partially overlapping lncRNAs (poncs, dark green; neither inc nor conc, but at least one exon of the
lncRNA genes carry chromatin marks
lncRNA has overlap with an intron of another locus). LncRNAs with antisense exonic overlap (red) have at associated with developmental
least one exon that overlaps with an exon of a protein-coding transcript on the opposite strand; they can regulators
be partitioned into those identified via the general pipeline (PhyloCSF < 20, light red) and those rescued
via the antisense pipeline (20 < PhyloCSF < 300, dark red). A scheme of the position of the lncRNA gene To assess to which extent lncRNA genes
(in color) relative to neighboring or overlapping gene(s) (black) is shown at the bottom. carry chromatin marks that are known to

Genome Research 581


www.genome.org
Pauli et al.

Figure 4. LncRNAs are shorter, less conserved, and expressed at lower levels than protein-coding genes. (A) Transcript length (a), number of exons (b),
and maximum ORF length (ORFmax) (c) of the 1133 lncRNAs (top row) and of the 1133 lncRNAs (blue) in comparison to protein-coding transcripts
(44,810 transcripts with PhyloCSF > 50; gray; bottom row). LncRNAs are generally shorter, have fewer exons, and contain shorter ORFs than protein-
coding transcripts. Note that this might be an underestimation of the actual size of lncRNAs due to a potentially more incomplete assembly of low-
expressed transcripts. (B) Comparison of the expression levels of lncRNA loci (859) and protein-coding loci (19,592 loci with PhyloCSF >50), plotted as
fragments per kilobase of exon per million fragments mapped (FPKM). LncRNA loci are expressed at approximately 10-fold lower levels than the majority
of protein-coding loci. (C ) Comparison of the alignment quality across the locus of interest, assessed by two alternative measurements of the branch
lengths present in the alignment. Branch lengths are measured on a scale from 0 to 1, where 0 indicates no alignments over the region of interest and 1
indicates the presence of 100% of sequence alignments. The branch length (BL) score refers to the alignment quality of the region that scores highest in
PhyloCSF (the highest scoring region [HSR]; left). The complete branch length (CBL) score refers to the alignment quality over the entire length of the
transcript (right). In the case of noncoding genes, alignments are poorer for the HSRs than for the entire gene length (BL scores < CBL scores). The reverse is
true for protein-coding genes, which tend to have the best alignments over the HSRs (BL scores close to one). The values of the median (yellow dashed line)
and mean are indicated in all panels.

be associated with protein-coding genes (Vastenhouw et al. 2010; seq). We tested for the presence of trimethylated lysine 4 on his-
Zhou et al. 2011), we performed chromatin immunoprecipitation tone 3 (H3K4me3), a known marker of promoters, and trimeth-
assays in shield stage embryos followed by deep sequencing (ChIP- ylated lysine 27 on histone H3 (H3K27me3), a repressive histone

582 Genome Research


www.genome.org
LncRNA expression during zebrafish embryogenesis

modification. We restricted our analysis to lincRNAs and intronic that neighbor or overlap lncRNAs. Comparison with enrichments
overlapping lncRNAs since unambiguous assignment of marks to observed in a control set of protein–protein gene neighbors
antisense exonic overlapping transcripts can be confounded by the revealed no significant enrichment of particular evolutionary age
overlapping genes. groups for our lncRNA neighbors (Supplemental Fig. 7C).
Of all lncRNA promoter regions that were assessed, 29% were Collectively, our analysis suggests that the neighbors of
marked with H3K4me3 (both H3K4me3-only and H3K4me3/ zebrafish lncRNAs belong to various classes of protein-coding
H3K27me3) (Fig. 5A). Notably, the fraction of H3K4me3-marked genes of both ancient and more recent evolutionary origin and
zebrafish lncRNA genes was similar to the 24% of human lincRNA generally do not correlate in their expression with the neigh-
genes that have a K4-K36 domain (Cabili et al. 2011), but was boring lncRNAs.
smaller than the fraction (63%) of marked zebrafish protein-coding
genes (Fig. 5A).
To consider the possibility that the discrepancy between the Temporal expression profiles of lncRNAs
fraction of H3K4me3-marked lncRNA and protein-coding loci Our high-resolution time-series of RNA-seq experiments allowed
could be due to the lower expression levels of lncRNA loci, we re- us to follow the expression dynamics of lncRNAs and protein-cod-
stricted our analysis to protein-coding genes expressed at shield ing genes as development proceeds. Comparison of independently
stage and at expression levels similar to lncRNAs. Even under these clustered expression profiles of noncoding and protein-coding loci
conditions, the discrepancy between H3K4me3-positive noncoding (Methods) revealed that both types of loci could be grouped into
(34%) and coding (74%) loci remained (Fig. 5B). This suggests that three broad classes (Fig. 6A): (1) loci whose transcripts were pa-
(1) the different expression levels of noncoding and protein-coding rentally supplied—these transcripts were present in the two- to
loci are not the primary cause of the different fractions of H3K4me3- four-cell-stage embryo (cleavage stages) and rapidly decayed after
marked loci, and (2) similarly to protein-coding genes (Vastenhouw the first few hours of embryogenesis; (2) loci whose expression
et al. 2010), noncoding loci are marked with H3K4me3 largely in- peaked during blastula and gastrula stages (dome, shield, and bud
dependently of their expression status. stages)—these transcripts were absent or only present at low levels
Interestingly, 7% of lincRNA and intronic overlapping lncRNA during the early cleavage stages and were zygotically transcribed;
loci were marked by both H3K4me3 and H3K27me3 at shield stage and (3) loci that were only induced 1 d after fertilization during the
(Fig. 5). Since Gene Ontology (GO)–term analysis of protein-coding process of organogenesis.
genes marked with both H3K4me3 and H3K27me3 at shield stage We discovered two differences between the expression pat-
revealed enrichment for developmental and regulatory functions terns of protein-coding and noncoding loci. First, lncRNAs were
(Supplemental Table 3), lncRNA loci may be important develop- more likely to be parentally supplied than were protein-coding
mental regulators. mRNAs (see Fig. 6A). Any locus was classified as ‘‘parentally pro-
vided’’ for which at least 10% of its total expression across all eight
embryonic stages was derived from the two- to four-cell stage. Of
Nearest neighbor analysis of lncRNA genes all transcripts present in our catalog, ;42% of lncRNAs classified as
Previous studies have shown that mammalian lncRNAs are pref- parentally provided, compared with only ;34% of protein-coding
erentially located next to genes with developmental functions transcripts (Fisher’s exact test, P < 10 05). These observations sug-
(Dinger et al. 2008; Mercer et al. 2008; Guttman et al. 2009; Ponjavic gest that parentally provided transcripts are specifically enriched
et al. 2009; Cabili et al. 2011). We therefore analyzed the GO terms in lncRNAs.
of genes that overlap with or are neighbors of zebrafish lncRNAs. We Second, the changes in a transcript’s expression level between
found significant enrichments (P < 0.05) of transcription factor ac- two consecutive stages were more pronounced for lncRNAs than
tivity, fate specification, and embryonic development and mor- for protein-coding genes. This observation suggests that lncRNAs
phogenesis for genes that overlap with antisense exonic lncRNAs have a more restricted temporal expression than do coding RNAs.
(Supplemental Fig. 7A; Supplemental Table 4) but not for neigh- To further test this hypothesis, we calculated a Shannon entropy-
bors of lincRNAs and intronic overlapping lncRNAs (Supplemental based specificity score per locus as a measure of expression level
Table 4). divergence during embryogenesis (Methods). All three classes of
The mere physical proximity of lncRNAs and genes with de- lncRNAs (lincRNAs, intronic overlapping, and antisense exonic
velopmental functions does not necessarily imply a functional link overlapping lncRNAs) showed an increased temporal specificity
between the protein-coding gene and the lncRNA. For example, compared with protein-coding genes (Fig. 6B). To rule out that this
recent studies in the mouse did not detect a strong correlation effect was caused by an increase in noise due to the lower expres-
between the expression levels of most lncRNAs and their neigh- sion levels of lncRNAs, we also sampled protein-coding loci from
bors (Guttman et al. 2011). Consistent with this study and with data the same expression quantiles as lncRNAs (Methods). Although
from human lincRNAs (Cabili et al. 2011), we did not detect a higher protein-coding loci that were expressed at low levels tended to be
degree of expression correlation for the majority of lncRNAs and more restricted in time than were highly expressed protein-coding
their neighbors (or overlapping genes) than for protein–protein loci, they were significantly less restricted than were lncRNAs (P <
gene pairs or randomly assigned gene pairs (Supplemental Fig. 7B). 10 4) (Fig. 6B). Together, these analyses reveal high temporal
The only exceptions were sense intronic overlapping lncRNAs, specificity during development as a novel property of lncRNAs.
which tended to positively correlate in expression with the over-
lapping genes (Supplemental Fig. 7B). Such overlapping lncRNAs
might resemble enhancer-associated lncRNAs (De Santa et al. 2010; Assigning function through expression correlation
Kim et al. 2010; Ørom et al. 2010; Wang et al. 2011). The lack of annotated features makes the assignment of func-
To test whether lncRNA genes are preferentially located near tions to lncRNAs a more challenging task than for proteins.
protein-coding genes of certain evolutionary ages, we analyzed the Therefore, functional predictions for mammalian lncRNAs have
phylostratographic classes (Domazet-Lošo and Tautz 2010) of genes often been based on ‘‘guilt-by-association’’ analyses (Dinger et al.

Genome Research 583


www.genome.org
Pauli et al.

Figure 5. LncRNA genes carry chromatin marks associated with developmental regulators. Shown are the fractions of promoters (6500 bp relative to
the transcription start site [TSS]) that are marked by a specific histone modification at shield stage. Histone marks were assessed by ChIP-seq experiments
and analyzed for the presence of H3K4me3 only, H3K27me3 only, and both H3K4me3 and H3K27me3. RefSeq genes (gray bars); protein-coding loci
(black bars); lncRNA loci (blue bars). (A) Marked fractions of promoters considering all loci. (B) Marked fractions of promoters only considering loci
expressed at shield stage. In B, protein-coding loci were sampled from expression levels comparable to the set of 145 lncRNA loci expressed at shield (see
Methods). Error bars, 1 SD of 10,000-times sampling. (C ) Example chromatin profiles for a shield-expressed lincRNA gene marked by H3K4me3 (top) and
for a lncRNA locus (overlapping the protein-coding genes eng2a and insig1) marked by both H3K4me3 and H3K27me3 (bottom). Signals are shown as the
number of ChIP-seq reads that aligned overlapping in a 5-bp window (note that the y-axis ranges from 0–12).

584 Genome Research


www.genome.org
LncRNA expression during zebrafish embryogenesis

Figure 6. Temporal expression profiles of lncRNA genes compared to protein-coding genes. (A) Dynamic changes in expression profiles of loci (rows)
across eight embryonic stages (columns). Heatmaps of 859 lncRNA loci (blue; left) and 23,462 protein-coding loci (gray; right) show normalized ex-
pression values (the sum of expression across all stages per locus is set to one). Three main expression patterns can be distinguished: ‘‘cleavage stages’’
(transcripts present in two- to four-cell-stage embryos), ‘‘zygotic’’ (transcripts enriched during blastula and gastrula stages and absent/only present at low
levels at the two- to four-cell stage), and ‘‘larval’’ (transcripts induced only 1 d after fertilization). Note that the fraction of parentally provided (cleavage
stage) transcripts is higher for lncRNAs than for protein-coding transcripts. (B) Temporal restriction of expression. Shown are distributions of Shannon
entropy-based temporal specificity scores that were calculated for distinct classes of lncRNA loci and protein-coding loci (see Methods): exonic over-
lapping antisense lncRNAs (red), intronic overlapping lncRNAs (green), intergenic lncRNAs (blue), all protein-coding loci (black), and protein-coding loci
of similar expression levels as lncRNA loci (gray; 95% confidence interval based on 10,000-times sampling). All classes of lncRNA loci display higher
temporal specificity than protein-coding loci. (C ) Expression-based association matrix of 835 lncRNA loci (rows) and functional gene sets (columns),
derived from gene set enrichment analysis (GSEA). (Red) Positive correlation; (blue) negative correlation; (white) no correlation. Rows corresponding to
lncRNAs whose RNA expression pattern is shown by in situ hybridization in Figure 7 are indicated on the left. Black boxes highlight two clusters associated
with functions in signaling (cluster 2) and development (cluster 6). (Top right) The most enriched GO terms per cluster in comparison to all other clusters.
(Bottom right) The 10 most enriched GO terms in the two boxed clusters in comparison to all other clusters, ranked by their –log10(P-values).

Genome Research 585


www.genome.org
Pauli et al.

2008; Guttman et al. 2009; Cabili et al. 2011). We therefore an- set of lncRNAs. Thirty-two lncRNAs were amplified from cDNA.
alyzed the correlation between the expression dynamics of each While the majority of lncRNAs did not reveal strong or tissue-
protein-coding gene with the expression dynamics of each specific expression (Supplemental Fig. 8), several lncRNAs showed
lncRNA locus. We performed gene set enrichment analysis both expression in specific cell types and distinct subcellular RNA
(GSEA) (Methods; Mootha et al. 2003; Subramanian et al. 2005; localization patterns (Fig. 7 and data not shown). Examples for cell
Guttman et al. 2009) and associated GO terms and lncRNAs. This type–specific expression patterns included lncRNAs that were
analysis identified several groups of lncRNAs associated with loaded into the fertilized embryo by cytoplasmic streaming (e.g.,
protein-coding gene sets of distinct functional categories such as hoxAa_lncRNA) (Fig. 7Ai), a lncRNA expressed in developing so-
signaling (cluster 2), development (cluster 6), and cell cycle (cluster mites (myo18a-lncRNA) (Fig. 7Aii), and lncRNAs with distinct ex-
8) (Fig. 6C; for a complete list of enriched GO terms in each cluster, pression patterns in the developing nervous system (Fig. 7Aiii,iv).
see also Supplemental Table 5). Interestingly, about one-third of our Several zebrafish lncRNAs showed distinct subcellular locali-
lncRNAs were associated with clusters enriched in developmental zation patterns, supporting and extending previous localization
functions (clusters 4–7). These results indicate that many embry- studies in the mouse brain (Mercer et al. 2008; Ponjavic et al. 2009).
onic lncRNAs are putative developmental regulators. For example, we observed nuclear enrichment of some lncRNAs
in early cleavage stage embryos, including chromatin association
in mitotically dividing cells (e.g., hoxAa_lncRNA) (Fig. 7Bi). Other
LncRNAs show tissue-specific and subcellularly restricted lncRNAs such as mprip_lncRNA were found to accumulate at the
expression patterns cytoplasmic side of yolk syncytial layer nuclei at the bud stage (Fig.
To determine whether embryonic lncRNAs were expressed in spe- 7Bii). A particular striking example for a subcellularly localized
cific tissues, we performed RNA in situ hybridization for a selected lncRNA was myo18a-lncRNA, which was enriched specifically at the

Figure 7. LncRNAs show tissue-specific and subcellularly restricted expression patterns. (A) Examples of lncRNAs with cell type–specific expression
patterns at different stages of embryogenesis. Shown are in situ hybridization images with probes specific to the indicated lncRNAs. Expression is observed
(i) in a two-cell stage embryo (cytoplasmic streaming from the yolk), (ii) in developing muscles, and (iii,iv) in distinct cells in the developing nervous
system. (i,ii) Lateral views (anterior toward the left in ii); (iii,iv) dorsal views, anterior toward the left. (B) Examples of subcellularly localized lncRNAs. Bottom
panels in i and ii (middle panel in iii, right) show a counterstain of the in situ image with the DNA-dye OliGreen (green). Black arrowheads point to
subcellularly localized RNAs; white arrowheads point to the same position in the OliGreen-stained images. (i) Nuclear enrichment and association with
chromatin (hoxAa-lncRNA); (top) 16-cell stage embryo with mitotically dividing nuclei; (middle, bottom) four-cell stage embryo. (ii) Enrichment at the
nuclear periphery (mprip_lncRNA): (top) overview of a bud-stage embryo, showing accumulation of the lncRNA around nuclei of the yolk syncytial layer
(YSL); (middle, bottom) close-up view of a dissected portion of the embryo shown in the top panel. Note that the lncRNA is specifically enriched around the
large nuclei of the YSL but not around the small nuclei of the overlying cell-sheet. (iii) Enrichment at the myoseptum, the boundary between two adjacent
myotubes (myo18a-lncRNA; top left, right); dystrophin mRNA (middle left) is a known marker of the myoseptum (Bassett 2003); myzh1.1 (myosin heavy
chain) mRNA (bottom left) is detected throughout the somites (not subcellularly localized); and (right) myo18a-lncRNA (red, in situ) is enriched at the
myoseptum, which is characterized by the absence of nuclei (regions of no green in the OliGreen-stained panel). Note that there is no overlap between red
and green in the merge panel.

586 Genome Research


www.genome.org
LncRNA expression during zebrafish embryogenesis

myoseptum, the boundary where myotubes of adjacent somites fraction of parentally biased transcripts is higher for lncRNAs than
meet (Fig. 7Aii,Biii). The myo18a-lncRNA localization pattern was for protein-coding genes. Because there is no de novo transcription
distinct from many muscle-specific mRNAs that are ubiquitously from the zygotic genome at this stage, these lncRNAs must be ei-
expressed in somites (e.g., myzh1.1 [myosin heavy chain]), but re- ther maternally or paternally provided. The vast majority of pro-
sembled the mRNA localization of the protein-coding gene dystro- tein-coding mRNAs and proteins present in the early embryo are of
phin, a known marker of the myoseptum (Fig. 7Biii, left panel; maternal origin and stored in the oocyte. This might also apply to
Bassett 2003). Dystrophin is a key component of the protein com- lncRNAs, but in light of the striking testis enrichment of lincRNAs
plex that connects the cytoskeleton of the muscle fiber to the ex- in humans (Cabili et al. 2011), it is intriguing to speculate that
tracellular matrix, and its deficiency causes severe myopathies some of the early lncRNAs may belong to the yet poorly charac-
(Koenig et al. 1988). Intriguingly, a potential function of myo18a_ terized small class of sperm-provided RNAs (Lalancette et al. 2008).
lncRNA in cell-–cell contact formation was supported by our ex- Second, lncRNAs are expressed in narrower time windows
pression-based GSEA approach, which associated myo18a_lncRNA than are protein-coding genes. Thus, in addition to being highly
with functions in ‘‘cell adhesion’’ and ‘‘structural molecule activity’’ tissue-specific (Cabili et al. 2011), lncRNA expression is highly
(Fig. 6C, cluster 3). Collectively, these results reveal that several temporally restricted. The association of specific sets of lncRNAs
embryonic lncRNAs are expressed not only in specific tissues but with well-defined developmental stages, together with their chro-
also in specific subcellular domains. matin state and GSEA predictions, suggests diverse roles in de-
velopment. For example, lncRNAs present during the early cleavage
cycles may function in the still mysterious process that orchestrates
Discussion
the ubiquitous repression of zygotic transcription. This is an in-
We have generated a systematic annotation of the zebrafish em- triguing possibility in light of the fact that numerous lncRNAs have
bryonic transcriptome, focusing specifically on the identification been shown to interact with repressive chromatin modifying
and characterization of lncRNAs. Large-scale RNA-seq experiments complexes (Rinn et al. 2007; Khalil et al. 2009; Huarte et al. 2010;
at eight embryonic stages allowed us to reconstruct 56,535 high- Schmitz et al. 2010; Tsai et al. 2010; Zhao et al. 2010; Guttman et al.
confidence coding and noncoding transcripts from 28,912 loci. 2011). In addition, early embryonic lncRNAs might regulate tran-
We recovered the vast majority of expressed RefSeq transcripts, scription of cell-cycle genes, a function recently suggested for
identified thousands of novel expressed loci and novel isoforms, a subset of cell-cycle promoter–associated human lincRNAs (Hung
and also captured the dynamic changes in expression levels of each et al. 2011). LncRNAs expressed during blastula and gastrula stages
transcript as development proceeds. Our data set is of about three- might have important roles in cell fate decisions, differentiation,
to fourfold higher depth than two recent zebrafish RNA-seq studies and cell migration. Indeed, recent large-scale knockdown analyses
(Aanes et al. 2011; Vesterlund et al. 2011). This higher sequencing in mouse ESCs revealed key roles for lincRNAs in cell fate specifi-
depth also translated into a significant increase in the number cation and maintenance of pluripotency (Guttman et al. 2011).
of identified expressed genes and was particularly important for LncRNAs expressed during later embryonic and early larval
the detection of lncRNAs that are expressed at relatively low levels. stages are candidates for functioning in specific tissues and cell
While both previous studies report read coverage across about types during organogenesis. Potential roles during organogenesis
11,000 annotated genes, we have transcript evidence for 12,816 are also supported by the tissue-specific expression of several
RefSeq genes (Fig. 1B) and 19,668 Ensembl loci (Supplemental Fig. lncRNAs. For example, specific lncRNAs are enriched in muscles and
2B). In addition, we identified and reconstructed high-confidence distinct subsets of neurons. Intriguingly, we also found lncRNAs
(two-times evidence) transcripts expressed from more than 9000 with specific subcellular localization patterns. These patterns range
novel loci with no previous annotations in RefSeq or Ensembl— from nuclear accumulation during the early cleavage stages to the
almost twice as many as the number of ‘‘novel transcribed regions’’ enrichment at the boundary between adjacent myotubes. Studies of
reported by Aanes et al. (2011) and Vesterlund et al. (2011). Thus, mRNA localization patterns of protein-coding genes in yeast (e.g.,
our data set provides the, to date, most comprehensive annotation ASH1) (Long et al. 1997) and flies (e.g., bicoid, oskar, gurken) (for
of the zebrafish embryonic transcriptome. review, see Johnstone and Lasko 2001) have shown that the sub-
We defined a stringent set of 1133 multi-exonic noncoding cellular localization of specific RNAs is essential for normal de-
transcripts, which includes lincRNAs, intronic overlapping lncRNAs, velopment. Thus, enrichment of lncRNAs in specific subcellular
exonic antisense overlapping lncRNAs, and precursors for sRNAs. compartments may be of fundamental importance for the regula-
Our lncRNAs—the first long noncoding transcript catalog in a ver- tory functions of ncRNAs.
tebrate embryo and in the zebrafish—share many of the character- In summary, our study provides the first catalog of lncRNAs in
istics of their mammalian counterparts (Dinger et al. 2008; Guttman a developing vertebrate. It suggests numerous roles of lncRNAs in
et al. 2009, 2010, 2011; Ponjavic et al. 2009; Cabili et al. 2011): rel- vertebrate development and provides a high-quality resource for
atively short length, low exon number, relatively low expression, future genetic, evolutionary, and genomic studies.
and conservation levels comparable to introns. Several observations
indicate that zebrafish lncRNAs are likely to have diverse functions: Methods
They are associated with chromatin marks characteristic of genes
with developmental functions (Bernstein et al. 2006; Vastenhouw RNA-seq of embryonic time course
et al. 2010), several are expressed in spatially and temporally re- Wild-type zebrafish embryos (TLAB) were staged according to
stricted domains, and functional ‘‘guilt-by-association’’ analyses standard procedures. About 1000 embryos were collected per stage
predict roles in processes ranging from cell cycle regulation to mor- (two to four cell, 1000 cell, dome, shield, bud, 28 hpf, 48 hpf, and
phogenesis. Thus, zebrafish lncRNAs will be an excellent model sys- 120 hpf) within a tight time window of ;10 min; it was ensured
tem for functional studies that are difficult to perform in mammals. that all embryos were at the same developmental stage. Total RNA
Analysis of the developmental in vivo expression profile of was isolated using the standard TRIzol (Invitrogen) protocol. Ge-
our data set highlighted two novel properties of lncRNAs. First, the nomic DNA was removed by DNase treatment and confirmed by

Genome Research 587


www.genome.org
Pauli et al.

qPCR assay. Two rounds of PolyA+-RNA purification were performed coding potential (see PhyloCSF section). All transcripts that scored less
for each sample, using the PolyA(Purist)-MAG kit (Ambion). The than 20 were retained as potential noncoding candidates, and tran-
quality of the RNA and lack of contaminating ribosomal RNA were scripts with PhyloCSF scores greater than 50 were considered proteins.
confirmed using the Agilent 2100 Bioanalyzer. Strand-specific li- The remaining transcripts (20 < PhyloCSF < 50) were initially classi-
braries for 76-bp paired-end sequencing were prepared according to fied as an ambiguous ‘‘gray’’ set. Second, the putatively noncoding
a modified UTP-method (Parkhomchuk et al. 2009), as detailed by transcripts and the ‘‘gray’’ set transcripts were repeat-masked and
Levin et al. (2010). Libraries were sequenced on the GA-analyzer subjected to blastx, blastp, and HMMER (versus Pfam-A and Pfam-B)
(shield stage library) and on the Illumina HiSeq 2000 (all stages), at (Eddy 2009). For blastp and HMMER, the transcripts were translated
a depth of 200–300 million reads per library (for statistics on read (stop to stop codon, due to possible incomplete assemblies that could
counts, see Supplemental Table 1). result in incomplete ORFs lacking the ATG start codon) in all three
sense frames. Any transcript with an E-value less than 10 4 in any of
Transcriptome assembly the three search algorithms was considered as protein-coding.
Not all candidates were alignable to regions in the other four
RNA-seq–derived reads were aligned independently for each de- fish species. PhyloCSF-based coding potential predictions are less
velopmental stage with TopHat (version 1.2.1) (Trapnell et al. reliable for transcripts with no alignments over their entire region
2009). To aid these alignments, all known transcript annotations (see Supplemental Fig. 4B). Therefore, a maximal ORF cutoff was
(Ensembl, RefSeq, and mRNAs from UCSC danRer7 [Zv9]) were imposed. For candidates without alignments (CBL = 0, see Com-
pooled and used as an additional junction set (AJS) for each TopHat parative Genomics Analysis of Conservation and Coding Potential)
run. The junction outputs from individual stage-specific TopHat this cutoff was set to 30 aa, for all remaining transcripts (CBL > 0) it
runs were pooled and added to the AJS (augmented AJS) to allow was set to 100 aa.
TopHat to use junction information from all stages. TopHat was Finally, transcripts were removed that have exonic sense over-
rerun on each of the stages using the augmented AJS. The output of lap with either Ensembl or RefSeq protein-coding genes or with the
this second run comprised the final alignment and junction set for protein-coding gene set of the embryonic transcriptome.
transcript assembly.
Transcriptomes were assembled with two different assem-
blers: Cufflinks (version 1.0.3) (Trapnell et al. 2010) and Scripture Antisense rescue pipeline
(version R4) (Guttman et al. 2010). The resulting transcripts were PhyloCSF scores of antisense transcripts can be confounded by
pooled, and a transcript was only considered reliable if it had high-scoring protein-coding genes on the opposite strand, neces-
support either from both assemblers or from at least two stages. sitating an alternative strategy for this set. The antisense-rescue
Transcripts <160 bp were excluded, as these were most likely se- pipeline was similar to the general pipeline (see above), with the
quencing or assembly artifacts. exception that the PhyloCSF threshold was set to 300. In addition,
the highest scoring region (HSR) for the putative antisense lncRNA
Multi-exonic transcripts had to have overlap with a protein-coding transcript on the op-
posite strand, and the PhyloCSF score for the protein-coding gene
Multi-exonic transcripts were merged with Cuffcompare, and all
had to be higher than the PhyloCSF score for the lncRNA (see
transcripts classified as repeat were discarded. Scripture’s strategy is
Supplemental Fig. 4D). Finally, after manual inspection, a thresh-
to call all possible isoforms, including some that are most likely
old of maximal 81% was set for the sense/antisense exonic overlap
wrong and have no Cufflinks support. Therefore, all Scripture-only
with the protein-coding gene to remove a small number of likely
isoforms lacking Cufflinks support were excluded whenever Cuf-
artifactual transcripts that had substantial overlap and likely
flinks had an assembled transcript for this locus. Furthermore, any
stemmed from errors in strand-calling during assembly.
di-exonic antisense transcript only supported by Scripture was re-
moved since these transcripts are likely artifacts due to Scripture’s
lack of strand-aware library support. Classification of lncRNAs
The resulting set of lncRNAs was subdivided into three categories:
Single exon transcripts (1) lncRNAs without any overlap with other loci classify as lincRNAs
Single exon transcripts were subjected to additional scrutiny: They (intergenic lncRNAs); (2) lncRNAs with intronic overlap are ex-
had to be significantly enriched in read coverage by Scripture pressed from loci that have overlap (exon–intron or intron–intron
(multiple testing corrected P < 0.01) (Guttman et al. 2010) and had but not exon–exon) with another transcribed locus (i.e., there is no
to have at least one supporting transfrag from Cufflinks. Cufflinks overlap between the mature lncRNA with exons of the overlapping
uses library strand information and can therefore correctly assign locus). They can be in either sense or antisense orientation with
the strand for single exon transcripts, while Scripture relies only respect to the overlapping gene and can be further partitioned into
on splice junctions and therefore cannot determine the strand- intronic contained lncRNAs (incs; the lncRNA is contained within
orientation of single exons. Transcripts classified by Cuffcompare the transcribed region of another locus), completely overlapping
(Trapnell et al. 2010) as contained (c), exon–intron fragment (e), ex- lncRNAs (concs; the other locus is contained within the transcribed
onic overlap (o), RNA pol II run-on (p), and repeat (r) were removed. region of the lncRNA locus), and partially overlapping lncRNAs
Moreover, any single exon that was within a range of 500 bp in the (poncs; neither incs nor poncs, but with at least one exon of the
sense direction relative to a multi-exonic transcript was removed. lncRNA contained within an intron of another locus). (3) Exonic
Finally, single exon and multi-exonic transcripts were merged antisense overlapping lncRNAs have exonic overlap with an exon of
with Cuffcompare, discarding all contained and redundant isoforms. a protein-coding transcript on the opposite strand.

ncRNA classification Comparative genomics analysis of conservation


Classification of each transcript as either coding or noncoding was and coding potential
determined using a step-wise filtering pipeline. First, all candidates The RNA-seq transcripts were analyzed in MULTIZ whole-genome
were scored with PhyloCSF (Lin et al. 2011) to determine their alignments of zebrafish with four other fish species (Tetraodon, Fugu,

588 Genome Research


www.genome.org
LncRNA expression during zebrafish embryogenesis

Stickleback, Medaka), generated by UCSC (http://genome.ucsc. was then calculated for samples of proteins equal in number to the
edu/cgi-bin/hgTrackUi?db=danRer7&g=cons8way). lncRNAs and from the same expression bins (10,000 repetitions to
estimate the dispersion and to calculate the P-value).
Analysis of conservation by branch length analysis
Due to the large phylogenetic distances separating these species, In situ expression analysis
sequences that can be aligned with BLASTZ/MULTIZ can generally
be assumed to have evolved under negative selection to some ex- To analyze the expression pattern and localization of RNAs, 300- to
tent. Therefore, the branch length (BL) score, a measure of align- 800-nt-long partial or full sequences of mRNAs were amplified
ment coverage that accounts for phylogenetic distances separating from cDNA by PCR and cloned into the pSC-vector (Strataclone)
the species, was used as a simple conservation score for each tran- according to standard procedures (primer sequences are available
script. The BL score is based on a phylogenetic tree with neutral BLs upon request). Thirty-two lncRNAs were cloned and tested by in
relating the species under analysis. For a single alignment column, situ hybridization experiments for their expression patterns from
the BL score is the ratio of the total BL of the tree relating only the shortly after fertilization to 2-d-old larvae. Digoxigenin (DIG)-
species that aligned (not gapped) in that position, to the total BL of labeled antisense RNA probes were generated by in vitro tran-
the tree relating all five species. The CBL score for a transcript is the scription with T3 or T7 RNA polymerases, using plasmid-encoded
average of the column-wise BL scores across the transcript. T3 or T7 polymerase binding sites. In situ hybridization of zebrafish
embryos of different embryonic stages was performed according to
standard procedures (Thisse and Thisse 2008), using immunohis-
Analysis of coding potential by PhyloCSF
tochemical detection of the DIG-labeled RNA–RNA hybrids by an
PhyloCSF (Lin et al. 2011) was used to assess coding potential in anti-DIG Alkaline-Phosphatase coupled antibody, followed by
the transcripts based on evolutionary signatures in the five-fish nonfluorescent detection with BCIP/NBT. DNA was visualized by
genome alignment. The alignment of each transcript was extracted incubation of stained embryos with the DNA-dye OliGreen (Invi-
from the genome alignments (‘‘stitching’’ the alignments of in- trogen, used at 1/400 in PBST). Images were processed with Pho-
dividual exons as needed), and PhyloCSF was applied using the toshop and ImageJ.
settings ‘‘–strategy=omega -f3–orf=StopStop3–minCodons=25.’’ This
command causes the program to enumerate complete and partial
regions between stop codons, in three frames, and report the best Chromatin mark analysis: ChIP-seq for H3K4me3 and
scoring. Due to the limited completeness and reliability of existing H3K27me3 at shield stage
zebrafish gene annotations, PhyloCSF was run in the simplified ChIP was performed as previously described (Vastenhouw et al.
‘‘–strategy=omega’’ mode that estimates evidence for a reduced 2010). Antibodies used were H3K4me3 (Millipore no. 07-473) and
dN/dS ratio, rather than performing a full empirical codon model H3K27me3 (Millipore no. 07-449). For analysis on the Illumina Hi-
comparison (which requires extensive training data). Seq platform, sequencing libraries were prepared according to
A transcript was classified as potentially protein-coding if Illumina protocols.
PhyloCSF reported a score of 20 or above, corresponding to a like- Peak calling for chromatin marks was done using Scripture’s
lihood ratio of (10^(20/10)):1 in favor of reduced dN/dS. Further- ChIP-seq module (Guttman et al. 2010). This module scans fixed-
more, each transcript was scored on both the ‘‘sense’’ and ‘‘anti- size windows across the genome and computes read coverage and
sense’’ strands. a multiple hypothesis corrected P-value for the observed coverage.
For both H3K4me3 and H3K27me3, 500- and 1000-bp windows
were scanned to account for both short regions with high read
sRNA analysis
coverage and for larger regions with lower read coverage. All win-
sRNAs expressed in 2-d-old wild-type zebrafish larvae (two bi- dows that were covered at a significant level (P < 0.01) were merged
ological replicates) were obtained from Cifuentes et al. (2010) and into ‘‘peaks.’’ The ends of the peaks were finally trimmed until
mapped to Zv9 using Bowtie (Langmead et al. 2009). The number coverage at the ends is at least the average peak coverage. To ac-
of sRNAs overlapping lncRNA loci were counted. Transcripts with count for systematic biases—e.g., due to open chromatin—peaks
at least five uniquely mapped overlapped sRNAs were annotated were filtered using input genomic DNA sequence by requiring that
according to known sRNA classes (miRNA precursor, snoRNA every peak called contained a 500-bp window with a library score
precursor, MALAT1-like transcripts, transcripts of unknown sRNA at least threefold higher than the input genomic sequence score.
types). Peaks were intersected with promoter regions (6500 bp rela-
tive to the transcriptional start site [TSS]) of our transcripts. To
obtain protein-coding loci of similar expression levels as lncRNA
Expression analysis loci at shield stage, the same strategy was used as for expression
The expression level of each locus was assessed using Cuffdiff analysis (see above), except that ranking of protein-coding loci was
(Trapnell et al. 2010) in its time-series mode with upper quantile based exclusively on their expression levels at shield stage.
normalization. To visualize developmental expression profiles via
heatmaps, expression levels were normalized to get relative ex-
pression levels over the developmental time-course (sum to an Gene set enrichment analysis
expression of one over all stages). LncRNAs and protein-coding loci The expression level of each lncRNA locus was correlated with all
were clustered separately using k-means (10 clusters) with a dis- protein-coding loci, similar to (Guttman et al. 2009). For each
tance matrix constructed from the Pearson correlation. lncRNA locus, a list of correlation-based ranked protein-coding
The temporal specificity score over N time-points (N = 8 em- loci was constructed and subjected to GSEA (Mootha et al. 2003;
bryonic stages) was defined as 1 H(g)/log2(N), where H(g) is the Subramanian et al. 2005). An association matrix between lncRNA
Shannon entropy expressed in bits of the expression vector of gene g. loci and GO terms was constructed, using a false-discovery rate
To compare lncRNA loci to protein-coding loci of similar expression threshold of 0.01. Rows (lncRNA loci) and columns (GO terms)
levels, the expression levels of each locus were summed over all were clustered (k-means, 10 clusters), resulting in distinct subsets
time-points and sorted into 100 quantiles. The Shannon entropy of lncRNAs associated with functional GO terms. To determine the

Genome Research 589


www.genome.org
Pauli et al.

enrichment level of positively associated GO terms for each cluster deciphers novelties in transcriptome dynamics during maternal to
with respect to other clusters, positively correlated GO terms were zygotic transition. Genome Res 21: 1328–1338.
Bartel DP. 2009. MicroRNAs: target recognition and regulatory functions.
ranked according to a binominal test. Cell 136: 215–233.
Bassett DI. 2003. Dystrophin is required for the formation of stable muscle
attachments in the zebrafish embryo. Development 130: 5851–5860.
Nearest neighbor analysis Beissbarth T, Speed TP. 2004. GOstat: find statistically overrepresented Gene
Ontologies within a group of genes. Bioinformatics 20: 1464–1465.
For each lincRNA locus the nearest protein-coding neighbor within Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B,
<10 kb was identified. For antisense overlapping and intronic Meissner A, Wernig M, Plath K, et al. 2006. A bivalent chromatin
overlapping lncRNAs, overlapping gene(s) were identified. This structure marks key developmental genes in embryonic stem cells. Cell
resulted in a list of lncRNA loci/protein-coding loci pairs. Similar to 125: 315–326.
Bertone P. 2004. Global identification of human transcribed sequences with
the method described by Cabili et al. (2011), Pearson correlation genome tiling arrays. Science 306: 2242–2246.
was used to explore the expression-based relationship between Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL.
these pairs. The results were subdivided based on (1) the class of the 2011. Integrative annotation of human large intergenic noncoding
lncRNA and (2) the orientation of the lncRNA locus relative to the RNAs reveals global properties and specific subclasses. Genes Dev 25:
1915–1927.
neighbor/overlapping protein-coding locus. The list of pairs (lncRNA Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R,
loci/protein-coding loci) also formed the basis for GO term enrich- Ravasi T, Lenhard B, Wells C, et al. 2005. The transcriptional landscape
ment analysis using GOstat (Beissbarth and Speed 2004) and phy- of the mammalian genome. Science 309: 1559–1563.
lostratographic analysis of nearest neighbors (see below). Cifuentes D, Xue H, Taylor DW, Patnode H, Mishima Y, Cheloufi S, Ma E,
Mane S, Hannon GJ, Lawson ND, et al. 2010. A novel miRNA processing
pathway independent of dicer requires Argonaute2 catalytic activity.
Science 328: 1694–1698.
Phylostratographic analysis of nearest neighbors De Santa F, Barozzi I, Mietton F, Ghisletti S, Polletti S, Tusi BK, Muller H,
Phylostratographic classes for zebrafish genes were obtained from Ragoussis J, Wei C-L, Natoli G. 2010. A large fraction of extragenic RNA
Domazet-Lošo and Tautz (2010). LncRNA neighboring/overlapping pol II transcription sites overlap enhancers. PLoS Biol 8: e1000384. doi:
10.1371/journal.pbio.1000384.
protein-coding loci (see above) were tested for enrichment in cer- Dinger ME, Amaral PP, Mercer TR, Pang KC, Bruce SJ, Gardiner BB, Askarian-
tain phylostratographic classes by a sampling procedure that used Amiri ME, Ru K, Solda G, Simons C, et al. 2008. Long noncoding RNAs in
the protein population as a null model. mouse embryonic stem cell pluripotency and differentiation. Genome
Res 18: 1433–1445.
Domazet-Lošo T, Tautz D. 2010. A phylogenetically based transcriptome age
Data access index mirrors ontogenetic divergence patterns. Nature 468: 815–818.
Eddy SR. 2009. A new generation of homology search tools based on
The RNA-seq and ChIP-seq data have been submitted to the NCBI probabilistic inference. Genome Inform 23: 205–211.
Gene Expression Omnibus (GEO) under accession no. GSE32900, ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A,
Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis
containing the Subseries GSE32898 (RNA-seq) and GSE32899 ET, et al. 2007. Identification and analysis of functional elements in
(ChIP-seq). All data will also be accessible for downloading and con- 1% of the human genome by the ENCODE pilot project. Nature 447:
venient viewing on our website Z-Seq (http://www.broadinstitute.org/ 799–816.
software/z-seq/). Fejes-Toth K, Sotirova V, Sachidanandam R, Assaf G, Hannon GJ, Kapranov
P, Foissac S, Willingham AT, Duttagupta R, Dumais E, et al. 2009. Post-
transcriptional processing generates a diversity of 59-modified long and
short RNAs. Nature 457: 1028–1032.
Acknowledgments Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G,
We thank Cole Trapnell for helpful advice and continuous support Fairley S, Fitzgerald S, et al. 2011. Ensembl 2011. Nucleic Acids Res 39:
D800–D806.
with Cufflinks; Geo Pertea for help with Cuffcompare; UCSC and Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O,
Ann Zweig for providing the teleost sequence alignments; the Carey BW, Cassady JP, et al. 2009. Chromatin signature reveals over
Broad Sequencing Platform for all sequencing work; Sara Chauvin a thousand highly conserved large non-coding RNAs in mammals.
for Project Management at the Broad Institute; Dongkeun Jang, Nature 458: 223–227.
Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L,
Michael Reich, and Jill Mesirov (Broad Institute) for help in cre- Koziol MJ, Gnirke A, Nusbaum C, et al. 2010. Ab initio reconstruction of
ating our website Z-Seq; Xian Adiconis for assistance in library cell type-specific transcriptomes in mouse reveals the conserved multi-
preparation; Moran Cabili for sharing the human lincRNA data set exonic structure of lincRNAs. Nat Biotechnol 28: 503–510.
prior to publication and for helpful discussions and comments on Guttman M, Donaghey J, Carey BW, Garber M, Grenier JK, Munson G,
Young G, Lucas AB, Ach R, Bruhn L, et al. 2011. lincRNAs act in the
the manuscript; and Guo-Liang Chew and James Gagnon for circuitry controlling pluripotency and differentiation. Nature 477: 295–
discussions and helpful comments on the manuscript. A.P. was 300.
supported by an EMBO Long-term postdoctoral fellowship and Huarte M, Guttman M, Feldser D, Garber M, Koziol MJ, Kenzelmann-Broz D,
a Human Frontier Science Program (HFSP) postdoctoral fellow- Khalil AM, Zuk O, Amit I, Rabani M, et al. 2010. A large intergenic
noncoding RNA induced by p53 mediates global gene repression in the
ship. E.V. was supported by a grant from FNU, Denmark. N.L.V. p53 response. Cell 142: 409–419.
was supported by the Human Frontier Science Program (HFSP), Hung T, Wang Y, Lin MF, Koegel AK, Kotake Y, Grant GD, Horlings HM, Shah
Charles A. King Trust postdoctoral fellowships, and an NIH grant N, Umbricht C, Wang P, et al. 2011. Extensive and coordinated
(1K99HD067220-01). A.S. was supported by the EU Seventh transcription of noncoding RNAs within cell-cycle promoters. Nat Genet
43: 621–629.
Framework Programme (FP7/2007-2013)/ERC grant agreement Johnstone O, Lasko P. 2001. Translational regulation and RNA localization
204135. J.L.R. is a Runyon Rachleff Innovation and Searle Scholar in Drosophila oocytes and embryos. Annu Rev Genet 35: 365–406.
supported by the NIH (1R01ES02026). A.F.S. is supported by the Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SPA,
NIH (5RO1 GM056211). This work was supported by the NHGRI Gingeras TR. 2002. Large-scale transcriptional activity in chromosomes
21 and 22. Science 296: 916–919.
grant 1RO1HG005111-01. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler
PF, Hertel J, Hackermüller J, Hofacker IL, et al. 2007. RNA maps reveal
new RNA classes and a possible function for pervasive transcription.
References Science 316: 1484–1488.
Khalil AM, Guttman M, Huarte M, Garber M, Raj A, Rivea Morales D,
Aanes H, Winata CL, Lin CH, Chen JP, Srinivasan KG, Lee SGP, Lim AYM, Thomas K, Presser A, Bernstein BE, van Oudenaarden A, et al. 2009.
Hajan HS, Collas P, Bourque G, et al. 2011. Zebrafish mRNA sequencing Many human large intergenic noncoding RNAs associate with

590 Genome Research


www.genome.org
LncRNA expression during zebrafish embryogenesis

chromatin-modifying complexes and affect gene expression. Proc Natl Rinn JL, Euskirchen G, Bertone P, Martone R, Luscombe NM, Hartman S,
Acad Sci 106: 11667–11672. Harrison PM, Nelson FK, Miller P, Gerstein M, et al. 2003. The
Kim T-K, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, Harmin DA, transcriptional activity of human chromosome 22. Genes Dev 17:
Laptewicz M, Barbara-Haley K, Kuersten S, et al. 2010. Widespread 529–540.
transcription at neuronal activity-regulated enhancers. Nature 465: Rinn JL, Kertesz M, Wang JK, Squazzo SL, Xu X, Brugmann SA, Goodnough
182–187. LH, Helms JA, Farnham PJ, Segal E, et al. 2007. Functional demarcation
Kimmel CB, Ballard WW, Kimmel SR, Ullmann B, Schilling TF. 1995. Stages of active and silent chromatin domains in human HOX loci by
of embryonic development of the zebrafish. Dev Dyn 203: 253–310. noncoding RNAs. Cell 129: 1311–1323.
Koenig M, Monaco AP, Kunkel LM. 1988. The complete sequence of Schmitz K-M, Mayer C, Postepska A, Grummt I. 2010. Interaction of
dystrophin predicts a rod-shaped cytoskeletal protein. Cell 53: 219–228. noncoding RNA with the rDNA promoter mediates recruitment of
Koziol MJ, Rinn JL. 2010. RNA traffic control of chromatin complexes. Curr DNMT3b and silencing of rRNA genes. Genes Dev 24: 2264–2269.
Opin Genet Dev 20: 142–148. Sleutels F, Zwart R, Barlow DP. 2002. The non-coding Air RNA is required for
Lalancette C, Miller D, Li Y, Krawetz SA. 2008. Paternal contributions: new silencing autosomal imprinted genes. Nature 415: 810–813.
functional insights for spermatozoal RNA. J Cell Biochem 104: 1570– Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette
1579. MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. 2005. Gene
Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory- set enrichment analysis: a knowledge-based approach for interpreting
efficient alignment of short DNA sequences to the human genome. genome-wide expression profiles. Proc Natl Acad Sci 102: 15545–
Genome Biol 10: R25. doi: 10.1186/gb-2009-10-3-r25. 15550.
Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, Thisse C, Thisse B. 2008. High-resolution in situ hybridization to whole-
Gnirke A, Regev A. 2010. Comprehensive comparative analysis of mount zebrafish embryos. Nat Protoc 3: 59–69.
strand-specific RNA sequencing methods. Nat Methods 7: 709–715. Tian D, Sun S, Lee JT. 2010. The long noncoding RNA, Jpx, is a molecular
Lin MF, Jungreis I, Kellis M. 2011. PhyloCSF: a comparative genomics switch for X chromosome inactivation. Cell 143: 390–403.
method to distinguish protein coding and non-coding regions. Trapnell C, Pachter L, Salzberg SL. 2009. TopHat: discovering splice
Bioinformatics 27: i275–i282. junctions with RNA-Seq. Bioinformatics 25: 1105–1111.
Long RM, Singer RH, Meng X, Gonzalez I, Nasmyth K, Jansen RP. 1997. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ,
Mating type switching in yeast controlled by asymmetric localization of Salzberg SL, Wold BJ, Pachter L. 2010. Transcript assembly and
ASH1 mRNA. Science 277: 383–387. quantification by RNA-Seq reveals unannotated transcripts and
Mancini-Dinardo D, Steele SJS, Levorse JM, Ingram RS, Tilghman SM. 2006. isoform switching during cell differentiation. Nat Biotechnol 28:
Elongation of the Kcnq1ot1 transcript is required for genomic 511–515.
imprinting of neighboring genes. Genes Dev 20: 1268–1282. Tripathi V, Ellis JD, Shen Z, Song DY, Pan Q, Watt AT, Freier SM, Bennett CF,
Mercer TR, Dinger ME, Sunkin SM, Mehler MF, Mattick JS. 2008. Specific Sharma A, Bubulya PA, et al. 2010. The nuclear-retained noncoding RNA
expression of long noncoding RNAs in the mouse brain. Proc Natl Acad MALAT1 regulates alternative splicing by modulating SR splicing factor
Sci 105: 716–721. phosphorylation. Mol Cell 39: 925–938.
Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, Tsai M-C, Manor O, Wan Y, Mosammaparast N, Wang JK, Lan F, Shi Y, Segal
Puigserver P, Carlsson E, Ridderstråle M, Laurila E, et al. 2003. PGC-1a- E, Chang HY. 2010. Long noncoding RNA as modular scaffold of histone
responsive genes involved in oxidative phosphorylation are modification complexes. Science 329: 689–693.
coordinately downregulated in human diabetes. Nat Genet 34: 267–273. Vastenhouw NL, Zhang Y, Woods IG, Imam F, Regev A, Liu XS, Rinn J, Schier
Nagano T, Mitchell JA, Sanz LA, Pauler FM, Ferguson-Smith AC, Feil R, Fraser AF. 2010. Chromatin signature of embryonic pluripotency is established
P. 2008. The Air noncoding RNA epigenetically silences transcription by during genome activation. Nature 464: 922–926.
targeting G9a to chromatin. Science 322: 1717–1720. Vesterlund L, Jiao H, Unneberg P, Hovatta O, Kere J. 2011. The zebrafish
Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, transcriptome during early development. BMC Dev Biol 11: 30. doi:
Osato N, Saito R, Suzuki H, et al. 2002. Analysis of the mouse 10.1186/1471-213X-11-30.
transcriptome based on functional annotation of 60,770 full-length Wang KC, Chang HY. 2011. Molecular mechanisms of long noncoding
cDNAs. Nature 420: 563–573. RNAs. Mol Cell 43: 904–914.
Ørom UA, Derrien T, Beringer M, Gumireddy K, Gardini A, Bussotti G, Lai F, Wang KC, Yang YW, Liu B, Sanyal A, Corces-Zimmerman R, Chen Y, Lajoie
Zytnicki M, Notredame C, Huang Q, et al. 2010. Long noncoding RNAs BR, Protacio A, Flynn RA, Gupta RA, et al. 2011. A long noncoding RNA
with enhancer-like function in human cells. Cell 143: 46–58. maintains active chromatin to coordinate homeotic gene expression.
Pandey RR, Mondal T, Mohammad F, Enroth S, Redrup L, Komorowski J, Nature 472: 120–124.
Nagano T, Mancini-Dinardo D, Kanduri C. 2008. Kcnq1ot1 antisense Weinmann L, Höck J, Ivacevic T, Ohrt T, Mütze J, Schwille P, Kremmer E,
noncoding RNA mediates lineage-specific transcriptional silencing Benes V, Urlaub H, Meister G. 2009. Importin 8 is a gene silencing factor
through chromatin-level regulation. Mol Cell 32: 232–246. that targets argonaute proteins to distinct mRNAs. Cell 136: 496–507.
Parkhomchuk D, Borodina T, Amstislavskiy V, Banaru M, Hallen L, Wilusz JE, Freier SM, Spector DL. 2008. 39 end processing of a long nuclear-
Krobitsch S, Lehrach H, Soldatov A. 2009. Transcriptome analysis by retained noncoding RNA yields a tRNA-like cytoplasmic RNA. Cell 135:
strand-specific sequencing of complementary DNA. Nucleic Acids Res 37: 919–932.
e123. doi: 10.1093/nar/gkp596. Zhao J, Sun BK, Erwin JA, Song J-J, Lee JT. 2008. Polycomb proteins targeted
Pauli A, Rinn JL, Schier AF. 2011. Non-coding RNAs as regulators of by a short repeat RNA to the mouse X chromosome. Science 322: 750–
embryogenesis. Nat Rev Genet 12: 136–149. 756.
Poliseno L, Salmena L, Zhang J, Carver B, Haveman WJ, Pandolfi PP. 2010. A Zhao J, Ohsumi TK, Kung JT, Ogawa Y, Grau DJ, Sarma K, Song J-J, Kingston
coding-independent function of gene and pseudogene mRNAs regulates RE, Borowsky M, Lee JT. 2010. Genome-wide identification of
tumour biology. Nature 465: 1033–1038. polycomb-associated RNAs by RIP-seq. Mol Cell 40: 939–953.
Ponjavic J, Ponting CP, Lunter G. 2007. Functionality or transcriptional Zhou VW, Goren A, Bernstein BE. 2011. Charting histone modifications and
noise? Evidence for selection within long noncoding RNAs. Genome Res the functional organization of mammalian genomes. Nat Rev Genet 12:
17: 556–565. 7–18.
Ponjavic J, Oliver PL, Lunter G, Ponting CP. 2009. Genomic and
transcriptional co-localization of protein-coding and long non-coding
RNA pairs in the developing brain. PLoS Genet 5: e1000617. doi:
10.1371/journal.pgen.1000617. Received October 12, 2011; accepted in revised form November 21, 2011.

Genome Research 591


www.genome.org

You might also like