Single-Cell Analysis of Chromatin Accessibility in The Adult Mouse Brain

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Article

Single-cell analysis of chromatin accessibility


in the adult mouse brain

https://doi.org/10.1038/s41586-023-06824-9 Songpeng Zu1,10, Yang Eric Li1,2,10, Kangli Wang1,10, Ethan J. Armand1, Sainath Mamde1,
Maria Luisa Amaral1, Yuelai Wang1, Andre Chu1, Yang Xie1, Michael Miller3, Jie Xu1,
Received: 31 March 2023
Zhaoning Wang1, Kai Zhang1, Bojing Jia1, Xiaomeng Hou3, Lin Lin3, Qian Yang3, Seoyeon Lee1,
Accepted: 1 November 2023 Bin Li1, Samantha Kuan1, Hanqing Liu4, Jingtian Zhou4, Antonio Pinto-Duarte5, Jacinta Lucero5,
Julia Osteen5, Michael Nunn6, Kimberly A. Smith7, Bosiljka Tasic7, Zizhen Yao7, Hongkui Zeng7,
Published online: 13 December 2023
Zihan Wang8, Jingbo Shang8, M. Margarita Behrens5, Joseph R. Ecker6, Allen Wang3,
Open access Sebastian Preissl3,9 & Bing Ren1,3 ✉

Check for updates

Recent advances in single-cell technologies have led to the discovery of thousands of


brain cell types; however, our understanding of the gene regulatory programs in these
cell types is far from complete1–4. Here we report a comprehensive atlas of candidate
cis-regulatory DNA elements (cCREs) in the adult mouse brain, generated by analysing
chromatin accessibility in 2.3 million individual brain cells from 117 anatomical
dissections. The atlas includes approximately 1 million cCREs and their chromatin
accessibility across 1,482 distinct brain cell populations, adding over 446,000 cCREs
to the most recent such annotation in the mouse genome. The mouse brain cCREs are
moderately conserved in the human brain. The mouse-specific cCREs—specifically,
those identified from a subset of cortical excitatory neurons—are strongly enriched
for transposable elements, suggesting a potential role for transposable elements in
the emergence of new regulatory programs and neuronal diversity. Finally, we infer
the gene regulatory networks in over 260 subclasses of mouse brain cells and develop
deep-learning models to predict the activities of gene regulatory elements in different
brain cell types from the DNA sequence alone. Our results provide a resource for the
analysis of cell-type-specific gene regulation programs in both mouse and human
brains.

The Brain Initiative Cell Census Network aims to achieve a comprehen- insulators, silencers and other less-well-characterized regulatory
sive understanding of the cellular and molecular composition of the sequences work together to drive cell-type-specific gene expression in
mammalian brain1. As an experimental model, the laboratory mouse has development11,12, differentiation and disease13,14. Comprehensive map-
a critical role in the investigation of gene function in vivo as well as in the ping of CREs in mouse brain cells will provide mechanistic insights into
development and safety evaluation of various therapeutics. A detailed gene regulation and function in different brain cell types and advance
catalogue of cell types in the mouse brain along with their spatial distri- our understanding of brain development and neurological disorders.
bution and functional connections would therefore greatly facilitate the Previous catalogues of cCREs in mouse brain cells were derived
study of the complex neurocircuits and gene pathways as well as help in through epigenomic profiling of a limited number of brain regions
the development of treatments for neurological disorders. Single-cell and are therefore incomplete2,15–22. To more comprehensively delin-
transcriptomics studies2–7 have identified hundreds of subclasses and eate the cCREs in the mouse brain cells, we used the single-nucleus
thousands of cell types across the brain. This considerable cellular and assay for transposase-accessible chromatin followed by sequencing
spatial complexity underscores the need for a better understanding of (snATAC–seq) to profile chromatin accessibility at the single-cell reso-
the cis-regulatory elements (CREs) that are responsible for the identity lution across the entire adult mouse brain. In a previous study19 that
and gene expression patterns in each cell type. focused on the mouse cerebrum, we reported the delineation of 160
CREs control spatiotemporal gene expression through the binding cell types comprising approximately 800,000 brain cells across 45
of sequence-specific transcription factors (TFs) and the recruitment anatomic dissections, and the annotation of 491,818 cCREs that are
of chromatin remodeller proteins and/or transcription machinery to probably deployed in one or more of these cell types. Here we report
their target genes8–10. These elements, including promoters, enhancers, the analysis of an additional 1.5 million brain cells from the rest of mouse

1
Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA. 2Department of Neurosurgery and Genetics, Washington University
School of Medicine, St Louis, MO, USA. 3Center for Epigenomics, University of California San Diego, School of Medicine, La Jolla, CA, USA. 4Genomic Analysis Laboratory, The Salk Institute for
Biological Studies, La Jolla, CA, USA. 5The Salk Institute for Biological Studies, La Jolla, CA, USA. 6Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, USA.
7
Allen Institute for Brain Science, Seattle, WA, USA. 8Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA. 9Institute of Experimental and
Clinical Pharmacology and Toxicology, Faculty of Medicine, University of Freiburg, Freiburg, Germany. 10These authors contributed equally: Songpeng Zu, Yang Eric Li, Kangli Wang.
✉e-mail: [email protected]

378 | Nature | Vol 624 | 14 December 2023


brain regions, including 72 new anatomical dissections. Through inte- considerable number of single-cell chromatin accessibility profiles
grative analysis of a total of 2.3 million mouse brain cells, we provide for the mammalian brain.
a comprehensive map of cCREs representing 1,482 brain cell types.
Our results not only provide independent evidence to support the
complexity and diversity of cell types across brain regions, but also Clustering and cell type annotation
double the annotated mouse brain cCREs to 1 million. We performed iterative clustering using SnapATAC229 to classify the
A large fraction of the mouse brain cCREs has sequence homology 2.3 million nuclei into distinct cell groups on the basis of their pairwise
in the human genome, and displays chromatin accessibility in the similarity of chromatin accessibility profiles (Methods, Extended Data
human brain cells23, suggesting conserved gene regulatory functions. Figs. 4 and 5 and Supplementary Table 3). Before clustering, we first
Consistent with previous reports10,24,25, mouse-specific brain cCREs, visualized the data using uniform manifold approximation and projec-
especially those found in the subclasses of excitatory neurons, are tion (UMAP; Fig. 1c) with a 5 kb resolution for genomic bin features in
strongly enriched for transposable elements (TEs) including LINE-1 SnapATAC31 for a global view. In the UMAP, we marked the nuclei into
and endogenous retrotransposons, highlighting a potential role of three major divisions, including 998,000 nuclei predominantly com-
TEs in the evolution of neuronal functions in the mammalian brain. We prising glutamatergic (Glut) neurons (based on the neurotransmitter
also predict gene regulatory networks (GRNs) in over 260 subclasses genes Slc17a7, Slc17a6, Slc17a8); 384,000 nuclei predominantly com-
of brain cell types and develop deep-learning-based models to predict prising GABAergic neurons (GABA, based on the neurotransmitter
cell-type-specific use of cCREs from DNA sequence information. gene Slc32a1) and 959,000 nuclei consisting of primarily non-neuronal
cell types. We performed four rounds of iterative clustering to fur-
ther classify the cells into subclasses and cell subtypes (Extended Data
Single-cell analysis of the mouse brain Fig. 4a). During clustering, we used a 500 bp resolution for genomic
We dissected 117 brain regions from the isocortex, olfactory bulb bin features. After the first iteration (hereafter, L1-level clustering),
(OLF), hippocampal formation (HPF), striatum (STR), pallidum (PAL), we divided the 2.3 million nuclei into 37 groups for L2-level clustering,
amygdala (AMY), thalamus (TH), hypothalamus (HY), midbrain (MB), using over 4 million chromatin features. For each group, we then per-
pons (P), medulla (MY) and cerebellum (CB) in 8-week-old male mice formed a second and a third round of clustering (L2-level and L3-level
(Fig. 1a, Extended Data Fig. 1 and Supplementary Table 1), including clustering) sequentially with the top 500,000 genomic bin features
45 dissections from the isocortex, OLF, HPF, STR and PAL reported and identified a total of 248 subgroups and 899 subtypes of brain cells,
previously19. The dissections were performed on 600-µm-thick coronal respectively (Extended Data Fig. 4a). A total of 291 out of 899 L3-level
brain slices according to the Allen Brain Reference Atlas26 (Extended subtypes consisted of more than 400 cells per subtype and, in total,
Data Fig. 1) with two replicates obtained from pools of the same region they captured 1.8 million cells. For these 291 L3-level subtypes, we also
dissected from at least two brains (Fig. 1a and Methods). We performed performed a fourth round of clustering (L4-level clustering) to further
snATAC–seq for all of the 234 samples using an automated single-cell classify them into a total of 874 clusters. In summary, we identified a
combinatorial indexing ATAC–seq27 protocol. The sequencing reads total of 1,482 cell clusters (874 L4-level clusters and 608 L3-level clus-
corresponding to each nucleus were then deconvoluted on the basis ters without L4-level clustering). The number of nuclei in each cluster
of nucleus-specific DNA barcode combinations (Extended Data ranges from 34 to 48,694, with a median number of 484 nuclei per
Fig. 2a–e). High correlations between biological replicates (median, cluster (Supplementary Tables 3 and 4). We used the term subtypes to
0.99; range, 0.96–1.0) and between datasets from similar brain regions represent the 1,482 clusters in the latter part of this Article.
(ranges: 0.97–0.99 (AMY); 0.94–0.98 (CB); 0.89–0.99 (HPF); 0.97– To annotate the cell type identity of the 1,482 subtypes, we performed
0.99 (HY); 0.93–0.99 (isocortex); 0.94–0.99 (MB); 0.98–0.99 (MY); integration analysis using the data reported in a companion single-cell
0.89–0.99 (OLF); 0.95–0.99 (PAL); 0.94–0.99 (P); 0.83–0.98 (STR); RNA-seq study of 2 million cells (over 5,300 clusters) from adult male
and 0.92–0.99 (TH)) support the high reliability and robustness of the mouse brains5. We first calculated the gene expression scores in each
assays (Extended Data Fig. 2f). We confirmed the high quality of all of nucleus using SnapATAC2 with the fragments mapped to the gene
the datasets (n = 234: 117 dissections with 2 replicates) using a set of promoter (up to 2 kb to TSSs) and gene body regions as described previ-
quality-control metrics (Methods and Extended Data Fig. 2a–f). For ously31,32. We next performed integration analysis using the Seurat32,33
the subsequent analyses, we focused on the nuclei with at least 1,000 separately for neuronal cells and non-neuronal cells (Methods). The
sequenced fragments and the transcriptional start site (TSS) enrich- co-embedding of both the scRNA-seq and the snATAC–seq neuronal
ment above 10 (Extended Data Fig. 3a). We next removed potential cells showed excellent overlap between the two modalities (Fig. 1d)
doublets in each dataset based on a modified Scrublet28 procedure and the mouse brain major regions (Extended Data Fig. 6a,b). We
using SnapATAC229. As Scrublet was originally designed for single-cell also observed the same result for non-neuronal cells (Extended Data
RNA-sequencing (scRNA-seq) doublet removal, we compared it using Fig. 6c–e). The consensus matrix calculated on the basis of the ratio
another method, AMULET30, which was recently published for dou- of transferred labels from the scRNA-seq data to our snATAC–seq data
blet detection and removal in snATAC–seq data. We found that it showed excellent correspondence between the two datasets, suggest-
achieved similar results for our data based on a simulation study, in ing the robustness of the cell type identification based on either tran-
which the doublets were simulated from several samples from our data scriptome or chromatin accessibility (Fig. 1e, Extended Data Fig. 6f–h
(Extended Data Fig. 3b). After removing 7% of nuclei that were deemed and Supplementary Table 5). For each snATAC–seq-based subtype, we
to be potential doublets (Extended Data Fig. 3c,d), we retained the used the top-ranked cluster label transferred from the scRNA-seq data
chromatin accessibility profiles from 2,355,842 nuclei, with a median to represent its scRNA-seq cluster-level annotation. In total, 1,267 neu-
4,368 DNA fragments per nucleus (Supplementary Table 2). Among ronal subtypes in the snATAC–seq data were mapped to 965 scRNA-seq
them, 817,655 were from the isocortex (including 370,841 from previ- clusters. In the scRNA-seq data, the 5,300 clusters were grouped into
ous study), 201,113 from the OLF (including 137,209 from previous 338 cell subclasses, the most representative layer for cell type analysis.
study), 155,952 from the STR (including 114,743 from previous study), To annotate our data more robustly, we next mapped our cell subtypes
81,834 from the PAL (including 38,960 from previous study), 271,933 into this layer using the hierarchical relationship between cell cluster
from the HPF (including 164,568 from previous study), 65,958 from and cell subclass defined in the scRNA-seq data. The heat map of the
the AMY, 142,890 from the TH, 83,321 from the HY, 243,137 from the consensus matrix between our subtypes and the scRNA-seq subclasses
MB, 82,488 from the MY, 103,147 from the pons and 106,414 from the showed excellent correspondence (Fig. 1e and Supplementary Table 5).
CB (Fig. 1a,b and Extended Data Fig. 3e,f). This dataset represents a To reduce the potential annotation bias induced by different numbers

Nature | Vol 624 | 14 December 2023 | 379


Article
a Anterior
Sagittal view
Posterior
f
315

Class
01: IT–ET Glut
02: NP–CT–L6b Glut
03: OB–CR Glut
04: DG–IMN Glut
05: OB–IMN GABA
06: CTX–CGE GABA
07: CTX–MGE GABA
08: CNU–MGE GABA
09: CNU–LGE GABA
11: CNU–HYa GABA
Slices 10: LSX GABA
1 3 5 7 9 11 13 15 17
(600 μm) 2 4 6 8 10 12 14 16 18 12: HY GABA
13: CNU–HYa Glut
14: HY Glut
b A Major region 15: HY Gnrh1 Glut
B (% of cells) 17: MH–LH Glut
18: TH Glut

253 neuronal subclasses (ordered by subclass ID from smallest at the bottom to largest at the top)
C AMY 2.8 19: MB Glut
D CB 4.7 20: MB GABA
E 21: MB Dopa
Dissection regions

HPF 11.4 22: MB–HB Sero


No. cells
F 5,000 HY 3.7 23: P Glut
G 10,000 Isocortex 35.0 24: MY Glut
15,000 25: Pineal Glut
H 20,000 MB 10.3 26: P GABA
J MY 3.3 27: MY GABA
K 28: CB GABA
OLF 8.5 29: CB Glut
Ka No. dissections No. total cells PAL 3.4
Kb Last (45) Last (>800,000) 4.3
Pons
L New (72) New (>1,500,000)
STR 6.5
1 2 3 4 5 6 7 8 9 1011121314 15161718 6.1 NT type
TH
Slice
Glut
c GABA
UMAP D12MSN GABA (>384,000)
PVGA SSTGA Dopa
n = 2.3 million Glut–GABA
STRGA
PIRGA MXDGA CBGRGL Chol
VIPGA Sero
ETL5GL LAMGA STRPALGA GABA–Glyc
THGL LSXGA
ITL23GL AMYGAGL CBGA
ITL56GL ITENTPIRGL OBGA
MSGA HBGL
HYGLGA PURKGA
ITL45GL VEC
ITL46GL ICGL CRGL VPIA
RHPGL Replicate
CTL6GL ITL5GL THGA PER
HBGAGL Rep. 1
L6bGL NPGL CHOR
CLAGL MGL Rep. 2
RSPL4GL
VLMC
RGL
CA1GL
ITRSPL23GL EPEN
CA3GL ASCTE
ASCNT Major region
DGNBL BERG AMY
HPFGL CB
HPF
UMAP2

Glut (>998,000) IOL OGC HY


Isocortex
OPC NN (>959,000) MB
UMAP1 MY
OLF
d e PAL
Pons
subclasses of scRNA–seq

STR
TH
253/315 neuronal

001
1

5
10
20
30
40

101
102
103
104
105

Co-embedding of neurons in 1,267 neuronal clusters


NT type
Class

snATAC–seq and scRNA-seq of snATAC–seq


Replicate Major region No. of clusters No. of nuclei
RNA ATAC 1.00 0.75 0.50 0.25 per subclass per subclass
Score

Fig. 1 | Single-cell analysis of chromatin accessibility in the adult whole neuronal clusters from our snATAC–seq data. f, The 253 neuronal subclasses in
mouse brain. a, Schematic of the sample dissection strategy. The brain map was our snATAC–seq data matched to neuronal subclasses in the scRNA-seq above,
generated using coordinates from the Allen Mouse Brain Common Coordinate and ordered on the basis of the subclass IDs (for all of the following figures, the
Framework (CCF) v.3 (ref. 26). b, The number of nuclei for 117 dissections after order was kept the same unless otherwise mentioned). From left to right, the
quality control and doublet removal. The dot size is proportional to the size of bar plots represent the class, major neurotransmitter (NT) type, biological
cells and the dissections that were not covered by our previous study19 are shown replicate distribution of nuclei, major region distribution of nuclei, number
in grey. A to L on the left were used as the dissection region labels on each of clusters and number of nuclei. Detailed information about class,
slice (details are provided in Extended Data Fig. 1). The number of dissections neurotransmitter type and subclass is reported in the companion paper5. A list
represents the number of dissections covered by our previous study (last) and of full names of the subclasses is provided in Supplementary Table 3. CTX,
updated in the current study (new). The total number of cells represents the cerebral cortex; HYa, anterior hypothalamus; L6b, layer 6b; LSX, lateral septal
number of cells covered by our previous study (last) and updated in the current complex; IT, intratelencephalic; ET, extratelencephalic; NP, near-projecting;
study (new). c, UMAP81 embedding and clustering analysis of snATAC–seq data. CT, corticothalamic; OB, olfactory bulb; CR, Cajal-Retzius; DG, dentate gyrus;
The light colours denote major cell classes. NN, non-neuronal cells. Cells are IMN, immature neurons; CGE, caudal ganglionic eminence; MGE, medial
coloured on the basis of major regions as in b. d, The co-embedding UMAP ganglionic eminence; CNU, cerebral nuclei; LGE, lateral ganglioniceminence;
embedding of the neuronal cells from scRNA-seq data5 and the snATAC–seq MH, medial habenula; LH, lateral habenula; Chol, cholinergic neurons; Dopa,
data on the same space coloured by the two modalities. e, The consensus score dopaminergic neurons; Glyc, glycinergic neurons; Sero, serotonergic neurons.
between neuronal subclasses from the scRNA-seq data above and L4-level

380 | Nature | Vol 624 | 14 December 2023


of cells in the clusters, for each of our 1,482 subtypes, we manually mouse cerebral regions19 (Extended Data Fig. 9c), and further expands
checked the major regions of the top three cluster-related subclasses, it by an additional 446,606 cCREs. They are also enriched for active
and the gene markers for some subclasses using the bigwig data and chromatin states or potential insulator-protein-binding sites mapped
gene expression scores (Extended Data Fig. 7) generated using Sna- in bulk mouse brain tissues (Extended Data Fig. 9d). Nearly all of the
pATAC2. Finally, 275 out of 338 subclasses were annotated to the 1,482 frequently interacting regions previously identified from the mouse
subtypes. This includes 253 out of 315 neuronal subclasses, covering cortex region38 (3,158 out of 3,169) overlap with our cCREs (Methods and
28 neuronal classes and 7 neurotransmitter types, as well as 22 out of Extended Data Fig. 9e,f). Only 2.3% were in promoter regions (defined as
23 non-neuronal subclasses, covering 5 non-neuronal classes (Sup- 1.5 kb upstream and 500 bp downstream of the TSS) of protein-coding
plementary Table 4). We confirmed that the matched subclasses in our and long non-coding RNA genes, while 34.2% were in intron regions,
snATAC–seq data were robust to variations in the sequencing depth, 35.9% in intergenic regions and 22.8% in TEs, including long terminal
signal-to-noise ratio between brain regions and replicates (Extended repeats (LTRs), long interspersed nuclear elements (LINEs), short inter-
Data Fig. 4b,c) by performing the k-nearest-neighbour batch effect spersed nuclear element (SINEs) and other repeats (Fig. 2a). We found
test34 and local inverse Simpson’s index analysis35 (Extended Data an average of 45,303 (range, between 4,947 and 177,906) peaks (501 bp
Fig. 4d,e) and by comparing the ratio of biological replicates across in length) in each cell cluster (Extended Data Fig. 9g).
multiple subclasses (Extended Data Fig. 5). The unmatched 63 sub- The list of cCREs greatly expands the previous catalogue of mouse
classes correspond mainly to rare cell populations, accounting for a cCREs defined by bulk chromatin accessibility data. Importantly,
total of 1.7% of the scRNA-seq data. For example, the only unmatched 44% of the mouse brain cCREs (Supplementary Table 9) did not over-
non-neuronal subclass is monocytes, with 21 cells. Other unmatched lap with the DNase-hypersensitive sites (DHSs) mapped in a broad
subclasses correspond to rare cell subclasses mainly from the MB, spectrum of mouse tissues (not limited to brain) and multiple devel-
pons and MY regions, in which the subtle differences between cell opmental stages39,40 (Fig. 2b). Several lines of evidence indicate that
types may hinder their identification using chromatin accessibility these cCREs probably participate in regulatory functions. First, they
profiles alone5. Nevertheless, the general agreement between the display higher levels of sequence conservation compared with ran-
open-chromatin-based clustering and transcriptomics-based clus- dom genomic regions with similar GC content (Fig. 2c). Second, they
tering laid the foundation for integrative analysis of cell-type-specific feature cell-type-restricted accessibility, a potential factor in their
gene regulatory programs in the mouse brain, as for the mouse cerebral lack of detection in previous bulk tissue assays. More than 62% of the
region19. In the text below, we focus on the snATAC–seq subclasses cCREs are active in less than ten subtypes, and more than 19% of them
and the subtypes within each subclass based on the above integrative are accessible in only one cell subtype (Fig. 2d,e and Extended Data
analysis. Fig. 9h). Third, the cell-type-specific chromatin accessibility profiles
Most neuronal cell types and some non-neuronal cell types showed of these cCREs strongly correlate with DNA hypomethylation41 (Fig. 2f,
strong regional specificity (Fig. 1f and Extended Data Fig. 8). For exam- Methods and Extended Data Fig. 9i). The cCREs were organized on the
ple, in the CB region, we identified 15 subtypes consisting of 97,000 basis of the non-negative matrix factorization (NMF)42 using the matrix
nuclei that were annotated as CB granule Glut neurons; and two of normalized chromatin accessibility of the cCREs (all of the cCREs
Bergmann glial subtypes including about 1,600 nuclei. In the HY region, and the cCREs with no overlaps with the DHSs separately) across the
one subtype with 297 nuclei specifically showed the imputed gene 275 cell subclasses (Methods and Supplementary Tables 10 and 11).
expression of the neuropeptide gene Pmch, which integrated well with Notably, two subclasses show DNA hypomethylation across most of
the lateral hypothalamic area Pmch-positive Glut neurons from the the cCREs (Extended Data Fig. 9j).
scRNA-seq data. A series of astrocyte-related cells were identified with
region specificity, such as astrocytes in the telencephalon region, astro-
cytes in non-telencephalon regions, choroid plexus cells and tanycytes, Inferring GRNs
which were integrated well with the corresponding subclasses in the To further dissect the gene regulatory programs in each of the 275 sub-
scRNA-seq data (Extended Data Fig. 6i). classes on the basis of the subtype-specific cCREs identified previously,
we first assessed the relationship between the chromatin accessibility
at the cCREs with transcription levels of putative target genes across
Identification and annotation of cCREs the cell subclasses, and we then constructed cell-subclass-specific
To identify the cCREs in each of the 1,482 subtypes, we aggregated the GRNs43. We performed the analysis at the subclass level because cell
DNA-sequence reads from cells in the subtype and determined peaks of clusters are sufficiently resolved and the open-chromatin landscapes
open chromatin signals using MACS236 (Extended Data Fig. 9a). When align strongly with scRNA-seq dataset.
the number of cells of a subtype was fewer than 200, we combined it We began with detecting pairs of co-accessible cCREs within 500 kb
with other subtypes that were within the same L3-level subtype and for each cell subclass using Cicero44 and inferred candidate target pro-
mapped to the same cluster in the scRNA-seq data. Only 19 subtypes moters for each distal cCRE located more than 1 kb away from the anno-
were affected by this step. Finally, we performed the peak calling on tated TSSs in the mouse genome (Fig. 3a and Methods). We determined
the resulting 1,463 clusters. We selected the genomic regions mapped hundreds of thousands of cCRE–cCRE pairs within 500 kb of each other
as accessible chromatin in both biological replicates. To account for in 274 out of 275 cell subclasses (Supplementary Table 12). This set
potential biases introduced by factors such as sequencing depth and/ included the promoter-distal cCRE combinations between 502,704 dis-
or number of nuclei in individual clusters, we retained only the repro- tal cCREs and 24,414 promoters of protein-coding and long non-coding
ducible peaks based on a modified MACS2 score (hereafter, score RNA genes (Extended Data Fig. 10a,b). The median distance between all
per million (SPM))37 (Methods and Extended Data Fig. 9a). The peaks of the promoter-distal cCREs pairs is 156 kb (Extended Data Fig. 10c).
with SPM ≥ 5 were retained. For each subtype, we retained the peaks To link potential enhancers to their putative target genes, we looked
that were determined to be open chromatin regions in a significant for the subsets of distal cCREs showing positive correlations between
fraction of the cells (false-discovery rate (FDR) < 0.01, zero-inflated their chromatin accessibility and RNA expression of the putative target
β-model; Extended Data Fig. 9b). In total, we identified a union of genes across the 275 cell subclasses. We computed Pearson correlation
1,053,811 open chromatin regions (500 bp extension surrounding the coefficients (PCCs) between the normalized chromatin accessibil-
peak summit) or cCREs (Supplementary Table 6), which together make ity signals and the RNA expression for each pair of distal cCRE and
up 19% of the mouse genome (Supplementary Tables 7 and 8). This the corresponding genes of the proximal cCRE (Fig. 3a). As a control,
list includes 98% of the cCREs reported in our previous study on the we randomly shuffled the cCREs and the putative target genes, then

Nature | Vol 624 | 14 December 2023 | 381


Article
a Intron, 34.2% Exon, 1.8%
TTS, 1.1% b f
cCREs are ordered by modules from 1 to 150
LTR, 8.1%
All 1 million cCREs
Non-coding, 0.3%
56% of
SINE, 4.5% cCREs
CpG island, 0.4%
60% of
5′ UTR, 0.2%
rDHSs
Promoter–TSS, 2.3% LINE, 7.5%

Cell subclasses from 1 to 244


Other repeats, 2.7%
3′ UTR, 1.0%
1,192,301 ENCODE rDHSs

1,053,811 cCREs
Intergenic, 35.9%
n = 1,053,811
No. clusters
c Ovlp d 1
Non-ovlp 6% 2
Random
19% 3
4
PhastCons score

0.20
38%
11% 5
0.15 6
8% 71%
7
0.10
8
snATAC–seq log[CPM + 1] mCG score snmC-seq
–250 bp cCRE Summit 250 bp 9
Non-ovlp cCREs Ovlp cCREs
(591,289) ≥10 0 5.4 0 1
(462,522)

e Chr. 13 Chr. 10 Chr. 11 Chr. 11 Chr. 8 Chr. 13 Chr. 11 Chr. 8 Non-ovlp ENCODE rDHS
460,000 cCREs
DG-PIR Ex IMN
Lamp LHX GABA
Pvalb Chandelier GABA
SI-MA-LPO-LHA Skor Glut

Cell subclasses from 1 to 244


LHA Pmch Glut
PVT-PT Ntrk Glut
SCdg-PAG Tfapb Glut
MB-MY TPH Glut-Sero
MY LHX Gly-GABA
PPY-PGRNl Vip Glyc-GABA
CBX Purkinje GABA
Bergmann NN
Astro-NT NN
CHOR NN
BAM NN

Gm30233 Pmch Cog2 Ror2 Myo1d Camsap3


Non-ovlp cCREs Ovlp cCREs snATAC–seq log[CPM + 1] mCG score snmC-seq

0 3.5 0.1 1.0

Fig. 2 | Identification and characterization of cCREs across mouse brain cell browser tracks of the two types of cCREs. Left, cCREs with no overlaps with
types. a, The fraction of cCREs that overlaps with annotated sequences in the rDHSs. Right, the cCREs with overlaps with rDHSs. The subclass names were
mouse genome was determined using HOMER45. TTS, transcription termination the same as for the scRNA-seq data in the companion paper5. f, The chromatin
site; UTR, untranslated region. b, The overlaps between the cCREs in this study accessibility at 150 cis-regulatory modules across the 244 shared cell subclasses
(red) and the representative DHSs (rDHSs; blue) from the SCREEN database18. in the snATAC–seq data for all of the 1 million cCREs (top left). Rows represent
c, The average PhastCons conservation scores of cCREs (red) overlapping (ovlp) subclasses, and columns are representative cCREs sampled from each module.
with rDHSs, cCREs (blue) with no overlaps with rDHSs, and random genomic Right, heat map showing the snDNA-methylation signals from the snmC-seq41
background (grey) were determined using deepTools82. d, The fraction of cCREs analysis at the genomic locations of the corresponding cCREs for the same
captured by different cell subtypes for peak calling. Left, the cCREs with no subclasses. Bottom, heat maps similar to those above but for only the 460,000
overlaps with rDHSs. Right, the cCREs with overlaps with rDHSs. e, Genome cCREs with no overlaps with the ENCODE rDHSs.

computed the PCCs of the shuffled cCRE–gene pairs (Fig. 3b and Meth- based on the modules (Fig. 3c and Supplementary Tables 14 and 15).
ods). This analysis revealed a total of 613,485 positively correlated The putative enhancers in each module showed cell-subclass-specific
distal cCRE (putative enhancer)–gene pairs and 107,413 negatively chromatin accessibility profiles co-occurring with the RNA expres-
correlated distal cCRE–gene pairs at an empirically defined significance sion of their putative target genes (Fig. 3c). We next performed the
threshold of FDR < 0.01 (Extended Data Fig. 10d and Supplementary motif-enrichment analysis for each module using HOMER45 with a
Table 13). The median distance between the potential enhancers and the threshold of P < 10−10 (Fig. 3c and Supplementary Table 16). The known
target promoters was 133 kb (Extended Data Fig. 10e). Each promoter motifs showed a similar cell-subclass-specific pattern, which indicated
region was assigned to a median of 24 putative enhancers (Extended cell-subclass-specific regulatory programs. For example, EBF transcrip-
Data Fig. 10f). The top proximal–distal cCRE pairs and positive pairs tion factor 1 (EBF1), which is important for B cell development, was
showed enrichment signals using the chromatin conformation data expressed in the pericytes from human brain tissues46. We found that
from the companion study41 (Methods and Extended Data Fig. 10g,h). EBF1 motifs are enriched in the cCREs from pericytes in the mouse brain
For the subsequent analysis, we focused mainly on the positively (Fig. 3c). For example, motifs for both the TF PU.1 and interferon regula-
correlated pairs, including 281,200 potential enhancers and 20,703 tory factor 8 (IRF8) were enriched in border-associated macrophages
putative target genes. To investigate how the putative enhancer may (BAMs) and microglia (Fig. 3c and Supplementary Tables 15 and 16). IRF8
regulate cell-type-specific gene expression, we further classified them is critical to transform microglia into a reactive phenotype47,48. PU.1 is
into 54 modules using the NMF42 on the matrix of normalized chro- especially expressed in microglia and can regulate genes associated
matin accessibility across the cell subclasses based on the integra- with Alzheimer’s disease in primary human microglia49. PU.1 and IRF8
tion analysis with the scRNA-seq data, and organized the distal cCREs also have essential roles in macrophages50,51.

382 | Nature | Vol 624 | 14 December 2023


a Identify co-accessible
c
cCRE–gene pairs

LDT–PCG–CS Gata3 Lhx1 GABA


CEA−AAA–BST Six3 Sp9 GABA

NTS–PARN Neurod2 Gly–GABA


PCG–PRNR Vsx2 Nkx6–1 Glut
MEA–COA–BMA Ccdc42 Glut

PAG–PPN Pax5 Sox21 GABA


SI–MA–LPO–LHA Skor1 Glut

OCT:OCT (POU, homeobox)


Cell type 1

PH–ant–LHA Otp Bsx Glut

PAG–SC Pou4f1 Zic1 Glut

PAX5 (paired, homeobox)


CLA–EPd–CTX Car3 Glut

PAS–MV Ebf2 Gly–GABA


TU−ARH Otp Six6 GABA
MPO−ADP Lhx8 GABA

EWS:ERG–fusion (ETS)
CBX MLI Cdh22 GABA
MEA−BST Sox6 GABA
LSX Sall3 Pax6 GABA

IPN Otp Crisp1 GABA

RBPJ:Ebox (?, bHLH)


PH–SUM Foxa1 Glut
OB–mi Frmd7 GABA

AHN Onecut3 GABA

PRC–PAG Pax6 Glut

SC Lef1 Otx2 GABA

PU.1:IRF8 (ETS:IRF)
HOXC9 (homeobox)
PRP Otp Gly–GABA

DUXBL (homeobox)
HNF6B (homeobox)
HNF1b (homeobox)
IC Tfap2d Maf Glut
RT−ZI Gnb3 GABA

PVT–PT Ntrk1 Glut

ARNT:AHR (bHLH)

RUNX–AML (Runt)
NF1–halfsite (CTF)
COAp Grxcr2 Glut

PBX1 (homeobox)

FOXD3 (forkhead)
PRT Tcf7l2 GABA

NTS Phox2b Glut


L2−3 IT PPP Glut

Otx2 (homeobox)
bHLHE40 (bHLH)

GSC (homeobox)
L2−3 IT CTX Glut

STR Lhx8 GABA


CA2–FC–IG Glut

Hypendymal NN

APN C1ql2 Glut


VMH Fezf1 Glut

Pineal Crx Glut


SCOP Sln Glut
Cell type 2

STR D1 GABA

Astro–OLF NN
Bergmann NN

NTS Dbh Glut

RUNX1 (Runt)
Lamp5 GABA

SC Bnc2 Glut

MyoD (bHLH)
MYF5 (bHLH)
Astro–NT NN

Microglia NN
CT SUB Glut

TBX5 (T-box)
Tanycyte NN

TCF7 (HMG)
ZFP281 (ZF)

PRDM1 (ZF)
FRA2 (bZIP)

EBF1 (EBF)
MYB (HTH)

GATA1 (ZF)

ELF5 (ETS)
E2F3 (E2F)

PU.1 (ETS)
RFX (HTH)

CTCF (ZF)
EBF (EBF)

ISRE (IRF)
VLMC NN

IRF8 (IRF)
KLF6 (ZF)
HIC1 (ZF)
p63 (p53)
Tlx? (NR)
SMC NN
OPC NN
OEC NN

Sp5 (ZF)
DG Glut

DC NN
Cell type 3

275 subclasses 275 subclasses

Detect correlated
cCRE–gene pairs
281,200 positively correlated cCREs
cCRE accessibility

Putative target gene expressions


54 Modules

RNA expression

b
613,485 positively correlated pairs

6
Density

snATAC–seq scRNA–seq
4
Motif analysis for 54 cCRE modules
Class NT type
2 Astro–Epen CNU–MGE GABA HY Gnrh1 Glut MB Glut OB–CR Glut log[CPM + 1] z score of log[CPM + 1] –log10[P]
GABA Chol
CB GABA CTX–CGE GABA Immune MB–HB Sero OB–IMN GABA GABA–Glyc Dopa
CB Glut CTX–MGE GABA IT–ET Glut MH–LH Glut OEC Glut Sero
CNU–HYa GABA DG–IMN Glut LSX GABA MY GABA OPC–Oligo Glut−GABA NN 0 5 –1.3 1.5 10 100
0 CNU–HYa Glut HY GABA MB Dopa MY Glut P GABA
–1.0 –0.5 0 0.5 1.0 CNU–LGE GABA HY Glut MB GABA NP–CT–L6b Glut P Glut
Pearson correlation Pineal Glut TH Glut Vascular

Fig. 3 | Integrative analysis to identify the potential enhancer–gene (FDR < 0.01). The grey-filled curve shows the distribution of PCCs for randomly
connections across the whole mouse brain. a, Schematic of the computational shuffled cCRE–gene pairs. c, The chromatin accessibility of putative enhancers
strategy used to identify cCREs that are positively correlated with the mRNA (left); mRNA expression of the linked genes in the 275 cell subclasses across the
expression of the target genes; PCCs were calculated across 275 cell subclasses whole mouse brain (middle); and the enrichment of known TF motifs in distinct
between the snATAC–seq and scRNA-seq data. Co-accessible cis-regulatory DNA enhancer gene modules (right). A total of 428 out of 440 known motifs from
interactions were predicted using Cicero 44 for each cell subclass. b, In total, HOMER45 with enrichment P < 10 −10 is shown. The unadjusted P values were
613,485 pairs (red) of positively correlated cCRE–gene pairs were identified calculated using two-sided Fisher’s exact tests.

We next applied CellOracle52 to the snATAC–seq and scRNA-seq data double-positive motif composed of activating transcription factor 3
(Methods and Extended Data Fig. 11a,b) for GRN analysis. To achieve (ATF3), KLF4 and TAL1, indicating that the three factors may positively
this, the subclass-specific distal cCREs detected using Cicero above regulate each other in the BAM subclass. ATF3 is an inflammatory medi-
were first mapped to mouse TFs based on TF-binding motifs using the ator and a key regulator of interferon response in macrophages57. KLF4
tool gimmemotifs53. A regularized linear regression model was then from the Kruppel-like family of factors has an essential role in mono-
used to predict the gene expression at the single-cell level on the basis cyte differentiation58, and is a mediator of proinflammatory signals in
of the mapped TF-motif instances surrounding each gene promoter macrophages59. The Tal1 gene, which encodes a basic helix-loop-helix
and generate GRNs for each subclass. The 3,000 most variable genes TF, is expressed during monocyte–macrophage lineage differentiation
across all of the subclasses from the scRNA-seq data using Seurat and and has an important role in cell cycle progression and proliferation
499 TFs reported to have essential roles in defining cell subclasses in the during monocytopoiesis60,61. Using the Cistrome Data Browser62 as a
scRNA-seq data5 were included for this analysis. Finally, we successfully resource for chromatin immunoprecipitation followed by sequenc-
inferred GRNs for 267 out of 275 cell subclasses (one example of GRN ing data, we noticed that ATF3 binds to putative enhancers near both
from the subclass ASC-TE_NN, that is, astrocytes from the telencephalon Tal1 and Klf4 in bone-marrow-derived macrophages (Gene Expression
region, is shown in Fig. 4a). The resulting GRNs contained a total of 403 Omnibus: GSE99895; Extended Data Fig. 12c,d). Overall, non-neuronal
TFs and 2,628 non-TF genes (Methods and Supplementary Table 17). cells showed higher numbers on several network motifs (such as the
As expected, the connectivity of the nodes follows a power-law dis- regulated double-positive motif) compared with Glut neurons and
tribution54 (Fig. 4b) in 266 of 267 of them (Extended Data Fig. 11c). On GABAergic neurons (Fig. 4e and Extended Data Figs. 11d and 12a).
average, each GRN owned 312 TFs and 681 genes (Fig. 4c). Furthermore, we highlighted the importance of key TFs within these
Recurring network motifs are a common feature of GRNs55. We networks by calculating their eigenvector centrality scores using
compared the 17 common network motifs56 in each of the above GRNs CellOracle. In Fig. 4f, the 267 subclasses and 226 TFs were ordered in
(Methods and Supplementary Table 18) across different cell classes the same manner as described in the companion paper5 (Supplemen-
defined in the scRNA-seq data (Extended Data Figs. 11d,e and 12a) and tary Table 21). Notably, we observed a similar pattern of importance
across different brain regions (Methods, Extended Data Fig. 12b and scores for the TFs as seen in the scRNA-seq data, where normalized
Supplementary Table 19). We first mapped the 267 subclasses to five gene expression was shown. This consistency of the TF signatures
main regions, that is, the telencephalon (isocortex, OLF, AMY, STR, across modalities reinforced the fidelity of our GRN inferences. It also
PAL), diencephalon (TH, HY), hindbrain (pons, MY), MB and CB, only demonstrated how regulatory codes of TFs across the whole mouse
if at least 60% (248 subclasses left) of the cells in the subclass could be brain could be revealed through integrated analysis of snATAC–seq
mapped to these regions, and identified regulated double-positive and scRNA-seq data.
motifs (TF A increases the expression of both TF B and TF C, and TF B TFs such as JUN, JUNB and FOS have high importance scores across
and TF C can positively regulate each other) (Fig. 4d and Supplementary multiple neuronal and non-neuronal subclasses. TFs of the bHLH
Table 20). The GRN from BAMs (BAM_NN; Fig. 4e) includes a regulated family such as NEUROD1, NEUROD2, NEUROD6 and BHLHE22 have

Nature | Vol 624 | 14 December 2023 | 383


Article
a b c e 0.04 Double positive
Example of GRN from ASC-TE_NN Degree distribution of GRN from ASC-TE_NN CB
40 X Y MB
80 0.03

Number of genes
340 1,000 Diencephalon

Number of TFs

Density
Genes per TF
TFs per gene
r2 = 0.79 30
Lhx2 0.10 60 0.02 Telencephalon
Olig1 –3 Hindbrain
300 20 40
600 0.01
0.08
10 20
Gli3 Id4 –4 0
260 0

log[P(k)]
0.06 260 0 0 50 100 150 200

P(k)
Irx2 Meis2 –5 Cascades positive
Nr2f1
0.04
d Regulated double- Main class
0.0015 X Y Z
Rfx3 –6
Npas2 0.02 0.03 positive motif GABA

Density
GABA–Glyc 0.0010
Rreb1 Glut
0.02

Density
0 –7 NN
BAM_NN 0.0005
Negative Positive TF 0 50 100 150 200 0 2 4 ATF3
0.01 Z
K log[K] KLF4 TAL1 0
0 X Y
BAM_NN

1, 0
2, 0
3, 0
4, 0
0
00
00
00
00
0 50 100 150 200
Counts
f NT
class
NR4A2

FOS
NEUROD6

SP8

TCF7L2

PAX5

HOXB4

GATA2

MEIS2
001

338
267 subclasses (ordered by subclass ID from smallest to largest)
0.8
0.6
0.4
0.2

Glut Sero IT–ET Glut CTX–MGE GABA CNU–HYa Glut MB GABA P GABA OEC
0

GABA Chol NP–CT–L6b Glut CNU–MGE GABA HY Glut MB Dopa MY GABA Vascular
NT

HY Gnrh1 Glut
Class

Dopa GABA–Glyc OB–CR Glut CNU–LGE GABA MB–HB Sero CB GABA Immune
Glut–GABA NN DG–IMN Glut CNU–HYa GABA MH–LH Glut P Glut CB Glut
OB–IMN GABA LSX GABA TH Glut MY Glut Astro–Epen Eigenvector centrality
CTX–CGE GABA HY GABA MB Glut Pineal Glut OPC–Oligo

Fig. 4 | Inference of subclass-specific GRNs across the whole mouse brain. the whiskers show 1.5× the interquartile range. d, Normalized histograms of
a, Example of the GRN inferred in telencephalon-region astrocyte (ASC-TE_NN) the number of the regulated double-positive 56 network motifs for each main
using CellOracle 52 . Edges are weighted and directed to reflect the putative cell class. The lines are the kernel-based density curves fitted for different
regulation strength and mode (inhibition or activation). b, The degree histograms. e, Histograms of the two network motifs for five mouse brain
distribution of the GRN in a. P(k), the probability of a node having k degree in regions: telencephalon (isocortex, OLF, HPF, STR, PAL and AMY), diencephalon
the GRN. The degree of one node is the number of other nodes with links to it. (TH and HY), MB, hindbrain (MY and pons) and CB. f, Heat map of eigenvector-
c, The number of TFs, the number of genes, the number of regulated TFs per based centralities or importance scores of TFs in each of the subclass-specific
gene and the number of genes regulated by the TFs among the GRNs for each of GRNs. Each row represents a TF, and each column a subclass. The orders of the
267 cell subclasses. The numbers of dots in each box plot from left to right are TFs and subclasses are based on the companion paper5 for the similar heat map
as follows: 267, 267, 185,000 and 82,000. For the latter two plots, treat TFs and but using the scRNA-seq data. The names of the rows and columns are listed in
genes from different subclasses as different ones. For the box plots in c, the box Supplementary Table 18.
limits span the first to third quartiles, the centre line denotes the median and

high importance scores for many types of neurons such as the Glut neurons in the MB and pons regions. TCF7L2, SHOX2 and EBF1 had
neurons in the isocortex region. Our analysis also indicated potential high importance scores associated with Glut neurons specifically in
regulation of gene expression in GABAergic neurons by TFs such as the TH region. Moreover, TCF7L2 exhibited high importance in the MB
ARX, SP8 and SP9 in the telencephalon regions, whereas TFs such as region. Next, we observed that the TFs FOXA1 and FOXA2 had a specific
GATA2, TAL1 and GATA3 showed high importance scores for GABAergic association with the Glut neurons in the MB region. HOX-family TFs

384 | Nature | Vol 624 | 14 December 2023


a b 5′ UTR c NT
Mouse genome
X 45% Glut
Non-coding Promoter–TSS GABA
Dopa
Glut–GABA
CpG island Exon 20 Chol
Sero

No. of cell subclasses


Human genome GABA–Glyc
NN
Mouse specific Orthologous
Other repeats 0% Intron
58% P < 0.005
6
10
No. of cCREs (×105)

5 42% HighTE-Glut
4 LTR Intergenic

3
2 SINE TTS
1 0
LINE 3′ UTR
0.05 0.10 0.15
0
Mouse specific Orthologous Set Mouse specific Orthologous Fraction of open cCREs overlapped with TEs in subclass

d e Glutamatergic synapse
h NN Astro-TE NN
Glutamatergic synapse OPC NN
GABA GPe-SI Sox6 Cyp26b1 GABA
Synaptic membrane Neuron to neuron synapse NDB-SI-ant Prdm12 GABA
Proteasomal protein NP SUB Glut
Neuron to neuron synapse catabolic process Other Glut
HPF CR Glut
Ribonucleoprotein complex CLA-EPd-CTX Car3 Glut
Postsynaptic specialization biogenesis L5/6 IT TPE-ENT Glut
L6 IT CTX Glut

ATAC signal
Synaptic vesicle Asymmetric synapse
L5 IT CTX Glut
Postsynaptic specialization Postsynaptic specialization L4/5 IT CTX Glut
membrane Proteasome-mediated- L2/3 IT CTX Glut
Postsynaptic membrane ubiquitin-dependent HighTE- L2/3 IT ENT Glut
Postsynaptic density protein catabolic process L2/3 IT PIR-ENTl Glut
Glut LA-BLA-BMA-PA Glut
membrane Postsynaptic density CA1-ProS Glut
Asymmetric synapse RNA splicing CA3 Glut
Proteasomal protein L2/3 IT PPP Glut
mRNA processing L4 RSP-ACA Glut
catabolic process
L5 ET CTX Glut
0 1 2 3 4 0 5 10 L6b CTX Glut
–log10[FDR] –log10[FDR] L6 CT CTX Glut
NN Astro-TE NN
OPC NN
GPe-SI Sox6 Cyp26b1 GABA
f Top 10 DCA TE–cCREs with synaptic genes in PDC (selected based on FDR)
GABA
NDB-SI-ant Prdm12 GABA
Other Glut NP SUB Glut
12 L1MB8–Cdkl5 HPF CR Glut
RMER19B2–Psenen CLA-EPd-CTX Car3 Glut
L5/6 IT TPE-ENT Glut
10 LX5C–Lin7b Class L6 IT CTX Glut
RNA signal

L5 IT CTX Glut
LX6–Snca DCA L4/5 IT CTX Glut
8 L1MD–Grin2a Non-DCA L2/3 IT CTX Glut
L2/3 IT ENT Glut
–log10[FDR]

TE super family HighTE-


LTR78B–Itpr1 L2/3 IT PIR-ENTl Glut
DNA Glut LA-BLA-BMA-PA Glut
6
LX5C–Itpr1 LINE CA1-ProS Glut
L2–Dlgap2 LTR CA3 Glut
4 SINE L2/3 IT PPP Glut
L1MB5–Prkcg L4 RSP-ACA Glut
ORR1E–Vhl L5 ET CTX Glut
No. of DCA TE−cCREs: 1,331
2 L6b CTX Glut
L6 CT CTX Glut
No. of total tested TE−cCREs: 31,137
RefSeq genes
0 Dlgap2
–2.5 –2.0 –1.5 –1.0 –0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
cCREs
log2[fold change]
TEs L2
g Motif enriched in DCA TE–cCREs of highTE-Glut Positive PDC
bHLH family NEUROG2 P = 1 × 10–99 TCF4 P = 1 × 10–88
bZIP family FRA1 P = 1 × 10 –38
ATF3 P = 1 × 10 –38

ZF family EGR1 P = 1 × 10–38 WT1 P = 11 × 10–30

Fig. 5 | Analyses of chromatin accessibility at TEs of cCREs. a, Schematic of functions among genes that exhibited positive correlations with TE-cCREs in
mouse-specific and orthologous cCREs. The bar plot shows the numbers of highTE-Glut subclasses, compared with genes positively correlated with all
mouse-specific and orthologous cCREs. b, The fraction of the genomic cCREs in highTE-Glut subclasses. f, DCA at TE-cCREs in highTE-Glut subclasses
distribution of mouse-specific and orthologous cCREs. c, The fraction of cCREs compared with other subclasses. The top ten DCA TE-cCREs correlating with
overlapping with TEs in each subclass of Glut neurons, GABAergic neurons, synaptic-related genes are shown. The top ten DCA TE-cCRE–gene pairs (such
dopaminergic neurons, cholinergic neurons, serotonergic neurons, glycinergic as L1MB8–Cdkl5) are indicated by red boxes. The super family of the top ten
neurons and non-neurons. The two curves show the Gaussian distribution from DCA TE-cCREs are indicated by different shapes. g, The top three motif families
the mixture model. highTE-Glut refers to the Glut neuron subclasses with a high enriched in the DCA TE-cCREs in highTE-Glut neurons. The unadjusted P values
percentage of their cCREs overlapping with TEs. d, Gene Ontology (GO) analysis were calculated using two-sided Fisher’s exact tests. h, Genome browser tracks
revealing an enrichment of neuronal-specific functions among genes that of aggregate chromatin accessibility profiles for NN, GABA, highTE-Glut and
exhibited positive correlations with TE-cCREs (TE-related cCREs) in highTE- other Glut subclasses at selected DCA TE-cCREs and gene pairs. RNA signals
Glut subclasses, compared with genes positively correlated with TE-cCREs in shown here were collected from the previous study 2 . PDC, proximal–distal
all subclasses. e, GO analysis revealing an enrichment of neuronal-specific connections.

displayed high importance scores in both GABAergic and Glut neurons this study with a separate study of single-cell chromatin accessibility
in the MY region. Last, MAF and MAFB showed high importance scores in 42 human brain regions23. We first identified orthologues of mouse
in GABAergic neurons in the cortex region. cCREs in the human genome by performing reciprocal homology
searches and found 613,073 cCREs (58% of total mouse cCREs) defined
in mouse brains to have orthologous sequences in the human genome
Conservation of the mouse brain cCREs (more than 50% of bases lifted over to the mouse genomes) (Fig. 5a
To investigate the conservation of the gene regulatory landscapes in and Extended Data Fig. 13a). The percentage of orthologous cCREs is
mouse brain cells, we compared the mouse brain cCREs defined in significantly higher than the random expectation (32% orthologous

Nature | Vol 624 | 14 December 2023 | 385


Article
for randomly shuffled cCREs). Among these orthologous cCREs, performed motif analysis on those variable TEs that may have a regula-
39% (22% of total mouse cCREs) were identified as open chromatin tory role. We found that distal variable TEs in positive proximal–distal
regions in one or more cell types in the human brains (Extended Data cCRE connections were enriched for many binding sites of TFs, includ-
Fig. 13a,b). We therefore defined the 22% of mouse cCREs with both DNA ing HF1-halfsite, RORγt and HNF1 (Extended Data Fig. 15c). In addi-
sequence similarity and open chromatin in the human brain cells as tion to the above variable TE families, a greater number of TEs showed
chromatin-accessibility-conserved cCREs. This modest rate of conser- invariable chromatin accessibility across brain cell types (Extended
vation may reflect the still incomplete annotation of cCREs in the human Data Fig. 15d).
brain. Indeed, nearly 33% of the human brain cCREs defined in the other
study have a homologous sequence in the mouse genome that also dis-
plays chromatin accessibility in one or more mouse brain cell types23. Deep-learning models for brain cCREs
Nevertheless, the chromatin-accessibility-conserved cCREs appear to Deep-learning models have shown great promise in the dissection
have constraints during evolution, and probably have important regula- of gene regulatory mechanisms65–69. Sequence-based predictors of
tory roles in mammalian brain cells. Consistent with a recent report63, gene expression or epigenetic features have been developed for large
the fraction of cCREs that are classified as chromatin-accessibility mammalian genomes using cell-type-specific epigenetic and tran-
conserved in the human brain vary significantly among different brain scriptional profiles as training data65,67,70. These models can help to
cell types. Furthermore, the chromatin-accessibility-conserved cCREs annotate sequence motifs that drive regulatory element function,
tend to be at promoter regions (Extended Data Fig. 13c) and accessible and to predict the influence of DNA variants on gene regulation.
in a broader spectrum of cell types (Extended Data Fig. 13d–f). To develop sequence-based predictors of chromatin accessibility
in different brain cell types (Fig. 6a and Methods), we adapted the
deep-learning model architecture Basenji, which uses densely con-
Mouse-specific cCREs are enriched for TEs nected dilated convolution neural networks that are used in natural
Notably, 42% of mouse cCREs defined in mouse brain cells lack language processing tasks65. We generated training, validation and
orthologous genome sequences in the human genome (Fig. 5a). These testing datasets (Methods) from the 275 subclasses (also referred to
mouse-specific cCREs show strong enrichment of TEs, especially the as cell types in this section) and evaluated the model on the 221 sub-
LINEs, SINEs and LTRs (Fig. 5b and Extended Data Fig. 14a). Notably, classes with at least 500 cells including 93 GABAergic and 111 Glut cell
cCREs defined in 22 subclasses of excitatory neurons display an unusu- subtypes, and 17 non-neuronal types (Fig. 6b). The resulting model
ally high rate of overlap with TEs, and we refer to them as highTE-Glut successfully predicted open chromatin regions across these cell types,
subclasses (Fig. 5c and Extended Data Fig. 14b–e). In total, 20 out of 22 with an average PCC of 0.825 between the predicted signals and true
highTE-Glut subclasses were specifically found in the isocortex, OLF chromatin accessibility signals across cell types (Fig. 6c). To further
and HPF. Notably, the genes near the 115,772 TE-overlapping cCREs, improve the model performance in under-represented cell types, we
including both mouse-specific and orthologous cCREs, and expressed introduced a weighted loss function to enable the model to better learn
in at least one of the highTE-Glut neuron subclasses were enriched for the cell-type-specific signals during training (Methods). To compare
those involved in synaptic-related functions (Extended Data Fig. 14f–h). the peaks identified from experimental signals to the peaks called from
We found 14,619 genes whose expression was positively correlated with predicted signals, we calculated the area under the receiver operating
chromatin accessibility at 31,137 TE-overlapping cCREs (hereafter, characteristic (AUROC) and demonstrated that the model can predict
TE-cCREs) across the different subclasses of brain cells, and found that the open chromatin regions very well (from 0.72 to 0.94, and 0.85 on
they were also significantly enriched for synapse-related functions average) for different cell types (Fig. 6d and Supplementary Table 25).
(Fig. 5d,e, Extended Data Fig. 14i and Supplementary Table 22). The This high performance was comparable to the prediction of chromatin
large number of genes with nearby accessible TE-cCREs is unexpected. accessibility signals from the most advanced deep-learning model67.
To further investigate the genes potentially subject to TE-derived regu- We further evaluated the model’s ability to predict cell-type-specific
latory cCREs, we performed differential chromatin accessibility (DCA) chromatin accessibility at each cCRE across the diverse cell subclasses,
analysis between highTE-Glut and other cell subclasses, and uncov- achieving a median PCC of 0.59 for the variable cCREs (coefficient of
ered 1,331 such TE-cCREs. Among them, accessibility profiles at 228 variation > 1) in the testing set (Fig. 6e). To demonstrate the perfor-
DCA TE-cCREs, including L1MB8, L2 and ORR1E, were correlated with mance of our model, we visualized predictions in unseen test regions
expression of synaptic-related genes (Fig. 5f, Extended Data Fig. 14j among 12 cell types representing diverse brain regions, cell classes
and Supplementary Table 23). Motif analysis of these DCA TE-cCREs and neurotransmitters (Fig. 6f). Our model not only recapitulated
showed enrichment of many bHLH-family and bZIP-family TFs, such as signals that were common across subclasses (Nr4a2), but also showed
NeuroG2, TCF4 and FRA1 (Fig. 5g and Supplementary Table 24). subclass-specific predictions. For example, signals around Apoe were
Examples of positively correlated TE-cCRE and synaptic-related gene specific in astrocytes (Astro-TE-NN and Bergmann-NN) and signals
pairs are shown in Fig. 5h and Extended Data Fig. 14k. Furthermore, around Ecel1 were specific in neurons.
we examined the superfamilies and families of the DCA TE-cCREs in While still poorly characterized, the grammar and syntax of gene
highTE-Glut, comparing them to all TE-cCREs in highTE-Glut as the regulatory elements are believed to be evolutionarily conserved71. We
background. We observed a significant enrichment of DCA TE-cCREs therefore tested how well the above-described deep-learning model
in the LINE superfamily (FDR = 8.05 × 10−36) and the L1 subfamily trained using mouse single-cell chromatin accessibility data can predict
(FDR = 1.27 × 10−38). L1, an actively retrotransposon in both mouse cCREs in the matched human brain cell types with human sequences
and human, has accumulated in mammalian genomes. It can serve as inputs23 (Fig. 6g and Extended Data Fig. 16). Satisfyingly, we found
as a source of evolutionary novelties by providing essential motifs64. that the mouse deep-learning model can predict chromatin accessibil-
On the basis of the analysis of variability of chromatin accessibility ity profiles in the matching human brain cell types fairly accurately
of TEs, we found 90 TEs that display variable patterns of chromatin (AUROC, 0.75 on average) (Fig. 6h). It achieves modest accuracy in
accessibility across brain cell subclasses (Extended Data Fig. 15a,b). predicting cell type specificity among cCREs (median PCC = 0.41)
Most of them showed strong negative correlation with DNA CpG meth- (Fig. 6i). The cell-type-specific distal cCREs, such as the ones close
ylation signals in the matched cell subclasses. Many of them, such as to marker genes CUX2, GAD2, DRD1 and OLIG1, were well predicted
LTR64, X2_LINE and MamTip1, also showed positive correlations with (Fig. 6j). These results open a window to evaluate the influence of risk
RNA expression signals in the matched cell subclasses, suggesting a variants on regulatory activities across corresponding cell types in
potential role for these TEs in regulating gene expression. We further the human brain.

386 | Nature | Vol 624 | 14 December 2023


a DNA sequence
b Glut c 1.0 d AUROC
e
n = 111 1.0 1.0
A
C 0.8
G 0.8 160
T 0.8
0.6

True-positive rate
131 kb (128 bp bins) 140

cCRE density
Accuracy (PCC)
120

Pearson r
0.6 0.6 0.4
DL model 100
GABA 0.2 80
(Basenji) n = 93
0.4 0.4 60
0

Invariable
ATAC–seq signal 40
SST GABA = 0.915 –0.2

cCREs
0.2 20
ET Glut = 0.920

subclasses
0.2 Variable cCREs
–0.4

221 cell
OPC NN = 0.929
NN 0
n = 17 0 0.2 0.4 0.6 0.8 1.0 0 1.0 2.0 3.0
0
False-positive rate Coefficient of variance
GABA Glut NN

f L2/3 IT CTX Glut


True signals in mouse

CA1-ProS Glut
L6 CT CTX Glut
Vip GABA
Pvalb GABA
Sst GABA
CBX MLI Megf11 GABA
CB granule Glut
Bergmann NN
Astro-TE NN
Oligo NN
Microglia NN
Nr4a2 Pou4f2 Ecel1 Hopx Apoe Pf4
L2/3 IT CTX Glut
CA1-ProS Glut
Predicted signals

L6 CT CTX Glut
Vip GABA
Pvalb GABA
Sst GABA
CBX MLI Megf11 GABA
CB granule Glut
Bergmann NN
Astro-TE NN
Oligo NN
Microglia NN

g j
Training
True signals in human

Cell types Cell types IT-L6


DL model IT-L5
and sequence and sequence IT-L4/5
from mouse (Basenji) from human IT-L2/3
Predicting VIP
SST
D1CaB
OPC
h i OGC
1.0 1.0 MGC
DRD1 CUX2 GAD2 OLIG1
0.8 IT-L6
Predicted signals

0.5 IT-L5
IT-L4/5
Pearson r
AUROC

0.6 IT-L2/3
0 VIP
0.4 SST
D1CaB
0.2 –0.5 OPC
OGC
MGC
0 –1.0
Human Overall Distal Proximal
cell types

Fig. 6 | Deep-learning models predict chromatin accessibility in different Representative loci near Nr4a2, Pou4f2, Ecel1, Hopx, Apoe and Pf4 are shown.
brain cell types from the DNA sequence. a, Schematic of the deep-learning g, Schematic of predicting potential chromatin accessibility signals using
(DL) model Basenji for predicting chromatin accessibly. b, The number of human DNA sequence as inputs. h, The AUROC was calculated for matched
subclasses of each cell class in the training dataset. c, The accuracy (Pearson human cell types. n = 26 cell types for the human brain dataset. i, The Pearson r
correlation) of each class. n = 93 (GABA), n = 111 (Glut) and n = 17 (NN) subclasses. of true signals and the predicted signals across cell types for all tested cCREs,
d, The AUROC was calculated for representative subclasses by comparing the tested distal cCREs and tested proximal cCREs. The numbers of overall, distal
peaks called from predicted genomic signals with the peaks called from real and proximal cCREs are 452,531, 437,207 and 15,324, respectively. j, True signals
experimental signals. e, The model’s ability to predict cell-type-specific patterns captured from ATAC–seq analysis in human cell types and predicted chromatin
of open chromatin. The coefficient of variance (variance/mean) across cell accessibilities are shown at representative genomic loci near the genes CUX2,
types was compared with the Pearson r calculated between true signals and GAD2, DRD1 and OLIG1. Cell-type-specific cCREs are highlighted in grey. For the
the predicted signals across cell subclasses. Each dot represents one cCRE in box plots, the box limits span the first to third quartiles, the centre line denotes
the testing set. f, True signals from ATAC–seq data in mouse cell subclasses the median and the whiskers show 1.5× the interquartile range.
were compared with the predicted chromatin accessibility in the test set.

remain to be discovered because many cell types defined by scRNA-seq


Discussion or other molecular modalities are not currently represented in the
Here we describe a comprehensive cCRE catalogue of the mouse brain, snATAC-based cell clusters. Furthermore, the current catalogue was
through single-cell chromatin accessibility analysis of more than at the resolution of cell subclasses, and may not reflect subtle differ-
2.3 million cells from 117 anatomical dissections in the adult mouse ences between cell types, subtypes and states defined in the companion
brain. This catalogue represents a comprehensive annotation of can- single-cell transcriptomics or single-cell methylome studies5,6,41.
didate gene regulatory elements of the mammalian brain. It greatly We have attempted to reconstruct the GRNs in over 260 different
expands on the previous cCRE annotation of the mouse brain cells, brain cell subclasses by applying CellOracle52 to the single-cell ATAC–seq
adding more than 460,000 cCREs. This addition is enabled by the use and RNA-seq datasets collected from the adult mouse brain. The GRNs
of single-cell-resolution chromatin profiling, which enables the iden- that we inferred for brain cells would be the first such GRNs character-
tification of chromatin accessibility in rare brain cell types that are ized for the mammalian brain cells. We characterized the common
under-represented in previous bulk assays and brain regions that were network motifs in these cell types. Indeed, the GRN-based eigenvector
not surveyed in previous studies. Indeed, more than two-thirds of the centralities of TFs across the subclass (Fig. 4f) showed similar pattern
new cCREs are detected in ten or fewer brain cell subtypes (Fig. 2d), in the scRNA-seq study5. There is a limitation to the GRNs inferred using
with a median of six cell subtypes. By comparison, the cCREs reported the CellOracle strategy. For example, owing to the use of a regression
in the previous catalogues39,40 based on bulk tissue studies are typi- model, CellOracle cannot infer autoregulatory loops. Besides, the
cally detected as accessible in ten or more cell types, with a median double-negative network motif (A inhibits B and B inhibits A) was seldom
of 28 cell subtypes. It is possible that additional mouse brain cCREs predicted, potentially also due to the limitation of using a regression

Nature | Vol 624 | 14 December 2023 | 387


Article
model. In our opinion, instead of treating all of the cells in one popula- 3. Scala, F. et al. Phenotypic variation of transcriptomic cell types in mouse motor cortex.
Nature 598, 144–150 (2021).
tion in such a static way, the pseudotime reconstruction models72–75 from 4. Kozareva, V. et al. A transcriptomic atlas of mouse cerebellar cortex comprehensively
the single-cell data can be used to organize the cells in a dynamic manner, defines cell types. Nature 598, 214–219 (2021).
which would enable time-series-related models76,77 to be used to predict 5. Yao, Z. et al. A high-resolution transcriptomic and spatial atlas of cell types in the whole
mouse brain. Nature https://doi.org/10.1038/s41586-023-06812-z (2023).
the autoregulatory loops and the double-negative-network-motif-like 6. Zhang, M. et al. Molecularly defined and spatially resolved cell atlas of whole mouse
structures. Indeed, a recent method, Dictys78, uses stochastic process brain. Nature https://doi.org/10.1038/s41586-023-06808-9 (2023).
modelling to infer the feedback loops. Furthermore, to have more con- 7. Langlieb, J. et al. The cell type composition of the adult mouse brain revealed by single
cell and spatial genomics. Preprint at bioRxiv https://doi.org/10.1101/2023.03.06.531307
fident GRNs, from the computational view, multiple methods from (2023).
different aspects can be combined to provide diverse evidence43. 8. Preissl, S., Gaulton, K. J. & Ren, B. Characterizing cis-regulatory elements using single-cell
We investigated the sequence conservation of gene regulatory ele- epigenomics. Nat. Rev. Genet. 24, 21–43 (2023).
9. Levine, M., Cattoglio, C. & Tjian, R. Looping back to leap forward: transcription enters a
ments in the whole mouse brain by comparing the cCRE atlas in the new era. Cell 157, 13–25 (2014).
mouse brain defined in the present study to a cCRE atlas obtained from 10. Long, H. K., Prescott, S. L. & Wysocka, J. Ever-changing landscapes: transcriptional
a separate snATAC–seq analysis of 42 adult human brain regions in enhancers in development and evolution. Cell 167, 1170–1187 (2016).
11. Rada-Iglesias, A. et al. A unique chromatin signature uncovers early developmental
three adult male donors. We found that around 22% of cCREs defined enhancers in humans. Nature 470, 279–283 (2011).
in the current study are conserved in both sequence and in chromatin 12. Batut, P. J. et al. Genome organization controls transcriptional dynamics during
accessibility in the human brain. This modest number of conserved development. Science 375, 566–570 (2022).
13. Long, H. K. et al. Loss of extreme long-range enhancers in human neural crest drives a
cCREs is probably due to the still incomplete cataloguing of cCREs in craniofacial disorder. Cell Stem Cell 27, 765–783 (2020).
the human brain cells. Nevertheless, the cCREs showing conserved 14. Lee, T. I. & Young, R. A. Transcriptional regulation and its misregulation in disease. Cell
chromatin accessibility and sequence in both the mouse and human 152, 1237–1251 (2013).
15. Su, Y. et al. Neuronal activity modifies the chromatin accessibility landscape in the adult
brains are clearly under evolutionary constraints and, therefore, prob- brain. Nat. Neurosci. 20, 476–483 (2017).
ably possess functional importance. Consistent with previous reports, 16. Sinnamon, J. R. et al. The accessible chromatin landscape of the murine hippocampus at
the chromatin-accessibility-conserved cCREs tend to be promoters single-cell resolution. Genome Res. 29, 857–869 (2019).
17. Gorkin, D. U. et al. An atlas of dynamic chromatin landscapes in mouse fetal development.
or distal elements (probable enhancers) that display accessibility in Nature 583, 744–751 (2020).
a broader spectrum of cell types24,63. By contrast, the mouse-specific 18. The ENCODE Project Consortium et al.Expanded encyclopaedias of DNA elements in the
human and mouse genomes. Nature 583, 699–710 (2020).
cCREs are strongly enriched for TEs, implicating a potential role of
19. Li, Y. E. et al. An atlas of gene regulatory elements in adult mouse cerebrum. Nature 598,
TEs in cell-type-specific gene expression patterns in the mouse brain. 129–136 (2021).
The finding is consistent with previous observations of TE reactiva- 20. Thornton, C. A. et al. Spatially mapped single-cell chromatin accessibility. Nat. Commun.
12, 1274 (2021).
tion in development and in various tissues79. Note that the strongest
21. Doni Jayavelu, N., Jajodia, A., Mishra, A. & Hawkins, R. D. Candidate silencer elements for
enrichment of TE in cCREs is observed especially in 20 Glut (excita- the human and mouse genomes. Nat. Commun. 11, 1061 (2020).
tory) neurons from the isocortex, OLF and HPF. We speculate that TEs 22. Liu, H. et al. DNA methylation atlas of the mouse brain at single-cell resolution. Nature
598, 120–128 (2021).
may contribute positively to transcriptional regulation and chroma-
23. Li, Y. E. et al. A comparative atlas of single-cell chromatin accessibility in the human brain.
tin structure in these cells. In support of this possibility, nearly 1,300 Science 382, eadf7044 (2023).
TE-overlapping cCREs display positive correlation between chroma- 24. Roller, M. et al. LINE retrotransposons characterize mammalian tissue-specific and
evolutionarily dynamic regulatory regions. Genome Biol. 22, 62 (2021).
tin accessibility and mRNA levels from potential target genes. Their
25. Zhang, Y. et al. Single-cell epigenome analysis reveals age-associated decay of
putative target genes include those involved in synaptic function and heterochromatin domains in excitatory neurons in the mouse brain. Cell Res. 32,
synapse organization. Our results raise the interesting possibility that 1008–1021 (2022).
26. Wang, Q. et al. The Allen Mouse Brain Common Coordinate Framework: a 3D reference
neural circuit diversity could be influenced by TEs during evolution.
atlas. Cell https://doi.org/10.1016/j.cell.2020.04.007 (2020).
By extracting the context information from DNA sequence, deep- 27. Preissl, S. et al. Single-nucleus analysis of accessible chromatin in developing mouse
learning methods have recently been used for the prediction of forebrain reveals cell-type-specific transcriptional regulation. Nat. Neurosci. 21, 432–439
(2018).
various genomic functional features, such as epigenetic modifications, 28. Wolock, S. L., Lopez, R. & Klein, A. M. Scrublet: computational identification of cell doublets
3D interactions and gene expression65–69. We adapted this approach to in single-cell transcriptomic data. Cell Syst. 8, 281–291 (2019).
develop sequence-based models to predict the chromatin accessibility 29. Zhang, K., Zemke, N. R., Armand, E. J. & Ren, B. SnapATAC2: a fast, scalable and versatile
tool for analysis of single-cell omics data. Preprint at bioRxiv https://doi.org/10.1101/
in 275 mouse brain cell subclasses. We achieved excellent performance 2023.09.11.557221 (2023).
comparable to the prediction of ATAC–seq signals from the most recent 30. Thibodeau, A. et al. AMULET: a novel read count-based method for effective multiplet
attention-based model architecture67. Although previous efforts have detection from single nucleus ATAC-seq data. Genome Biol. 22, 252 (2021).
31. Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC.
attempted to train deep-learning models simultaneously on multiple Nat. Commun. 12, 1337 (2021).
genomes80, evaluation of how well the sequence-based predictors 32. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
trained in one species can be applied to a different species is lacking 33. Hao, Y. H. et al. Dictionary learning for integrative, multimodal and scalable single-cell
analysis. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01767-y (2023).
for matched cell types between species. Our results demonstrate that 34. Buttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing
deep-learning models trained using open chromatin landscapes in the single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
mouse brain cell types generalize well in the corresponding human 35. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony.
Nat. Methods 16, 1289–1296 (2019).
brain cell types. 36. Zhang, Y. et al. Model-based analysis of ChIP-seq (MACS). Genome Biol. 9, R137 (2008).
37. Corces, M. R. et al. The chromatin accessibility landscape of primary human cancers.
Science 362, eaav1898 (2018).
Online content 38. Shen, Y. et al. A map of the cis-regulatory sequences in the mouse genome. Nature 488,
116–120 (2012).
Any methods, additional references, Nature Portfolio reporting summa- 39. The ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA elements
ries, source data, extended data, supplementary information, acknowl- (ENCODE). PLoS Biol. 9, e1001046 (2011).
40. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the
edgements, peer review information; details of author contributions human genome. Nature 489, 57–74 (2012).
and competing interests; and statements of data and code availability 41. Liu, H. et al. Single-cell DNA methylome and 3D multi-omic atlas of the adult mouse brain.
are available at https://doi.org/10.1038/s41586-023-06824-9. Nature https://doi.org/10.1038/s41586-023-06805-y (2023).
42. Kim, H. & Park, H. Sparse non-negative matrix factorizations via alternating non-
negativity-constrained least squares for microarray data analysis. Bioinformatics 23,
1495–1502 (2007).
1. BRAIN Initiative Cell Census Network (BICCN). A multimodal cell census and atlas of the 43. Badia-i-Mompel, P. et al. Gene regulatory network inference in the era of single-cell
mammalian primary motor cortex. Nature 598, 86–102 (2021). multi-omics. Nat. Rev. Genet. https://doi.org/10.1038/s41576-023-00618-5 (2023).
2. Yao, Z. et al. A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. 44. Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin
Nature 598, 103–110 (2021). accessibility data. Mol. Cell 71, 858–871 (2018).

388 | Nature | Vol 624 | 14 December 2023


45. Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime 68. Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning.
cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, Nat. Biotechnol. 40, 121–130 (2022).
576–589 (2010). 69. Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global
46. Parker, K. R. et al. Single-cell analyses identify brain mural cells expressing CD19 as map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949
potential off-tumor targets for CAR-T immunotherapies. Cell 183, 126–142 (2020). (2022).
47. Masuda, T. et al. IRF8 is a critical transcription factor for transforming microglia into a 70. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible
reactive phenotype. Cell Rep. 1, 334–340 (2012). genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
48. Kierdorf, K. et al. Microglia emerge from erythromyeloid precursors via Pu.1- and 71. Wong, E. S. et al. Deep conservation of the enhancer regulatory code in animals. Science
Irf8-dependent pathways. Nat. Neurosci. 16, 273–280 (2013). 370, eaax8137 (2020).
49. Rustenhoven, J. et al. PU.1 regulates Alzheimer’s disease-associated genes in primary 72. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by
human microglia. Mol. Neurodegener. 13, 44 (2018). pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
50. Pourcet, B. et al. LXRα regulates macrophage arginase 1 through PU.1 and interferon 73. Ji, Z. & Ji, H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq
regulatory factor 8. Circ. Res. 109, 492–501 (2011). analysis. Nucleic Acids Res. 44, e117 (2016).
51. Langlais, D., Barreiro, L. B. & Gros, P. The macrophage IRF8/IRF1 regulome is required for 74. Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory
protection against infections and is associated with chronic inflammation. J. Exp. Med. inference methods. Nat. Biotechnol. 37, 547–554 (2019).
213, 585–603 (2016). 75. Van den Berge, K. et al. Trajectory-based differential expression analysis for single-cell
52. Kamimoto, K. et al. Dissecting cell identity via network inference and in silico gene sequencing data. Nat. Commun. 11, 1201 (2020).
perturbation. Nature 614, 742–751 (2023). 76. Chen, T., He, H. L. & Church, G. M. Modeling gene expression with differential equations.
53. van Heeringen, S. J. & Veenstra, G. J. GimmeMotifs: a de novo motif prediction pipeline for Pac. Symp. Biocomput. 1999, 29–40 (1999).
ChIP-sequencing experiments. Bioinformatics 27, 270–271 (2011). 77. Ma, B., Fang, M. & Jiao, X. Inference of gene regulatory networks based on nonlinear
54. Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’ networks. Nature 393, ordinary differential equations. Bioinformatics 36, 4885–4893 (2020).
440–442 (1998). 78. Wang, L. et al. Dictys: dynamic gene regulatory network dissects developmental continuum
55. Shen-Orr, S. S., Milo, R., Mangan, S. & Alon, U. Network motifs in the transcriptional with single-cell multiomics. Nat. Methods 20, 1368–1378 (2023).
regulation network of Escherichia coli. Nat. Genet. 31, 64–68 (2002). 79. Fueyo, R., Judd, J., Feschotte, C. & Wysocka, J. Roles of transposable elements in the
56. Shoval, O. & Alon, U. SnapShot: network motifs. Cell 143, 326 (2010). regulation of mammalian transcription. Nat. Rev. Mol. Cell Biol. 23, 481–497 (2022).
57. Labzin, L. I. et al. ATF3 is a key regulator of macrophage IFN responses. J. Immunol. 195, 80. Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol.
4446–4455 (2015). 16, e1008050 (2020).
58. Feinberg, M. W. et al. The Kruppel-like factor KLF4 is a critical regulator of monocyte 81. Leland McInnes, J. H., Nathaniel, S. & Großberger, L. UMAP: uniform manifold approximation
differentiation. EMBO J. 26, 4138–4148 (2007). and projection. J. Open Source Softw. 3, 861 (2018).
59. Feinberg, M. W. et al. Kruppel-like factor 4 is a mediator of proinflammatory signaling in 82. Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data
macrophages. J. Biol. Chem. 280, 38247–38258 (2005). analysis. Nucleic Acids Res. 44, W160–W165 (2016).
60. Dey, S., Shi, Y. B. & Brandt, S. J. Novel function of the TAL1/SCL transcription factor in
differentiation of murine bone marrow monocytes. Blood 108, 1272 (2006). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
61. Dey, S., Curtis, D. J., Jane, S. M. & Brandt, S. J. The TAL1/SCL transcription factor regulates published maps and institutional affiliations.
cell cycle progression and proliferation in differentiating murine bone marrow monocyte
precursors. Mol. Cell. Biol. 30, 2181–2192 (2010).
62. Zheng, R. et al. Cistrome Data Browser: expanded datasets and new tools for gene Open Access This article is licensed under a Creative Commons Attribution
regulatory analysis. Nucleic Acids Res. 47, D729–D735 (2019). 4.0 International License, which permits use, sharing, adaptation, distribution
63. Sarropoulos, I. et al. Developmental and evolutionary dynamics of cis-regulatory elements and reproduction in any medium or format, as long as you give appropriate
in mouse cerebellar cells. Science 373, eabg4696 (2021). credit to the original author(s) and the source, provide a link to the Creative Commons licence,
64. Sookdeo, A., Hepp, C. M., McClure, M. A. & Boissinot, S. Revisiting the evolution of mouse and indicate if changes were made. The images or other third party material in this article are
LINE-1 in the genomic era. Mob. DNA 4, 3 (2013). included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
65. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with to the material. If material is not included in the article’s Creative Commons licence and your
convolutional neural networks. Genome Res. 28, 739–750 (2018). intended use is not permitted by statutory regulation or exceeds the permitted use, you will
66. Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA need to obtain permission directly from the copyright holder. To view a copy of this licence,
sequence with Akita. Nat. Methods 17, 1111–1117 (2020). visit http://creativecommons.org/licenses/by/4.0/.
67. Avsec, Z. et al. Effective gene expression prediction from sequence by integrating
long-range interactions. Nat. Methods 18, 1196–1203 (2021). © The Author(s) 2023

Nature | Vol 624 | 14 December 2023 | 389


Article
Methods add_tile_matrix function to add the 500 bp genomic bin features, then
used the select_features function to filter out the features with frequen-
Tissue preparation and nucleus isolation cies along the samples of lower than 0.5% or higher than 99.5%. We then
All experimental procedures using live animals were approved by the applied the scrublet function of SnapATAC2 to get the doublet scores.
SALK Institute Animal Care and Use Committee under protocol number The parameter expected_doublet_rate was set to 0.08, which is based on
18-00006. Adult C57BL/6J male mice were purchased from Jackson our previous experiment on the snATAC–seq pipeline19. Barcodes with
Laboratories. Brains were extracted from 56–63-day-old mice and sec- scrublet scores of greater than 0.5 were treated as potential doublets
tioned into 600 µm coronal sections along the anterior–posterior axis and removed from our analysis.
in ice-cold dissection medium2,83. Specific brain regions were dissected We compared Scrublet with another recently published method
according to the Allen Brain Reference Atlas26 (Extended Data Fig. 1) named AMULET30, which is used for doublet detection and removal
and nuclei were isolated as described previously26. For each region, dis- in snATAC–seq data. We simulated datasets containing singlets and
sected brain tissues were pooled from 2–31 (only 2 dissections from the artificial doublets from eight samples in the primary motor area and
mouse CB region had 2 animals for snATAC–seq library construction, all evaluated the performances of the two methods using precision-recall
of the other samples had 4–31 animals) of the same sex to obtain enough curve (PRC) and area under PRC (AUPRC).
nuclei for snATAC–seq for each biological replica, and two biological
replicas were performed. We shared the same fluorescence-activated Iterative cell clustering
cell sorting (FACS) sequential gating/sorting strategy and the Sony After nucleus filtering by quantity control and doublet removal, we
SH800S software with our previous study19. adapted a fourth-round iterative clustering using SnapATAC2 for later
identification of cell-type-specific cCREs (Extended Data Fig. 4a). The
scATAC–seq analysis following basic procedure was used. For the first round of clustering
snATAC–seq libraries were generated as described using version 2 (L1-level clustering), we used all of the 2.3 million nuclei to perform
indexing19. PCR amplification was performed for 11 or 12 cycles. A the standard clustering. At the second round (L2-level), for each of the
step-by-step-protocol for library preparation is available online (https:// 37 clusters above, we performed independent clustering. At the third
doi.org/10.17504/protocols.io.4zzgx76). Libraries were sequenced round (L3-level), for each of the 248 clusters above, we performed inde-
using the HiSeq 2500 (Illumina), a HiSeq 4000 (Illumina) or NovaSeq pendent clustering again. At the fourth round of clustering (L4-level
6000 (Illumina) system with the following settings: 50 + 10 + 12 + 50 clustering), we performed only clustering for the L3-level clusters with
(read1 + index1 + index2 + read2). number of cells no less than 400. The details are as follows.

Processing and alignment of sequencing reads Feature selection. We applied the function add_tlle_matrix from
Paired-end sequencing reads were demultiplexed and the cell index SnapATAC2 to extract the cell by genomic bin count matrix. The size of
was transferred to the read name. Sequencing reads were aligned to a consecutive genomic region was chosen as 500 bp. We filtered out any
the mm10 reference genome using bwa84. After alignment, we checked bins overlapping with the ENCODE blacklist and removed the top 0.5%
the fragment length contribution, which is characteristic for ATAC–seq and tail 0.5% bins based on the read coverage from the count matrix.
libraries (Extended Data Fig. 2e) for each of the 234 samples. We then Only chromosomes 1–19, X and Y were considered. For our L1-level
combined the sequencing reads to fragments using the make_frag- clustering, we used all of the bin features (over 4 million) that passed
ment_file function of SnapATAC229 and, for each fragment, we applied the criteria above as non-neuronal cells and diverse neuronal cells were
the following quality control criteria: (1) retain only fragments with all included. For clustering of other levels, we chose the default top
quality scores MAPQ > 30; (2) remove PCR duplicates. Reads were also 500,000 features using the function select_features of SnapATAC2.
sorted on the basis of cell barcodes in read names, and shifted +4 bp
for positive strand and −5 bp for negative strand to correct the 9 bp Dimensionality reduction. We applied the function of spectral from
duplication induced from Tn5 transposase85 during processing. SnapATAC2 to convert the high-dimension sparse 500 bp genomic
bin features per cell into low dimensional representations, which used
TSSe calculation spectral embedding of the normalized graph Laplacian defined by the
Enrichment of ATAC–seq accessibility at TSSs was used to quantify data cell-to-cell similarity matrix using cosine distance. For L1-level and
quality without the need for a defined peak set. We followed a previously L2-level clustering, we chose 50 as the dimension of the low-dimensional
described procedure86, and used the function filter_cells in SnapATAC2 representation space as usually a large number of cells and potentially
to calculate TSS enrichment (TSSe). TSS positions were obtained from diverse cell types was involved in the two levels. We used ‘elbow plot’ to
the GENCODE87 database v.16. In brief, Tn5-corrected insertions (reads rank all of the principal components to make sure that the top 50 com-
aligned to the positive strand were shifted +4 bp and reads aligned to ponents were sufficient for our analysis. For later analysis, we chose 30
the negative strand were shifted –5 bp) were aggregated ±2,000 bp instead. The parameter ‘weighted_by_sd’ in the function spectral was set
relative (TSS-strand-corrected) to each unique TSS genome wide. This to be true for all dimensional reduction. We did not use the parameter
profile was then normalized to the mean accessibility ±1,900–2,000 bp ‘sample_size’ in the function spectral, so no approximation method
from the TSS and smoothed every 11 bp. The maximum of the smoothed was used for the spectral embedding. For 2.3 million cells, it took about
profile was taken as the TSSe. 300 GB memory in our high-performance computing system88.

Nucleus filtering by quality control Graph-based clustering. We then applied the function knn from
Nuclei with ≥1,000 uniquely mapped fragments and TSSe ≥ 10 were SnapATAC2 to construct the k-nearest neighbour graph using the
filtered for each of 234 samples according to the ENCODE ATAC–seq parameter n_neighbors = 50 and the parameter method was set to
data standards and process pipeline (https://www.encodeproject.org/ ‘kdtree’. We next used the function leiden of SnapATAC2 for clustering
atac-seq/). We used the filter_cells function of SnapATAC2 to achieve with the parameter object_function set as modularity. The parameter
this. resolution, which affected the number of clusters a lot, was selected
from 0.1 to 2 with a step size 0.1 based on the silhouette coefficient89
Doublet removal using the Python package Scikit-learn90. We also manually checked
We used a modified version of Scrublet28 to remove potential doublets the UMAP81 for each clustering result to make sure that the resolution
for every sample independently using SnapATAC2. First, we used the was suitable corresponding to the top silhouette coefficient. UMAP
projections were calculated using the Python package umap with the For every cell cluster above, we combined all properly paired reads
parameters a as 1.8956, b as 0.8005 and init as spectral. All of the reso- to generate a pseudobulk ATAC–seq dataset for individual biological
lution parameters during clustering are provided in Supplementary replicates. Moreover, we generated two pseudoreplicates comprising
Table 3. In our later analysis, we used the term subtypes to represent half of the reads from each biological replicate. We called peaks for
all of the final clusters from L3-level clustering and L4-level clustering. each of the four datasets and a pool of both replicates independently.
Peak calling was performed on the Tn5-corrected single-base insertions
Integration analysis with scRNA-seq data using MACS236 with the following parameters: --shift -75 --extsize 150
We performed integration analysis of the 1,482 subtypes with all of over --nomodel --call-summits --SPMR -q 0.01. Finally, we extended peak
5,300 clusters reported in a companion scRNA-seq study of 4.5 mil- summits by 250 bp on either side to a final width of 501 bp for merging
lion cells for the whole adult mouse brain5. Only cells from male mice and downstream analysis. If the number of cells in any of the pseudobulk
were considered in the scRNA-seq data, which is over 2 million cells. ATAC–seq from either individual biological replicates or individual
The scRNA-seq data are mainly from 10x v.2 and 10x v.3 platforms, pseudoreplicates is fewer than 200, we did not run MACS2 for it. We
and only a few thousand cells are from snRNA-seq. On the basis of our did this to reduce the potential false negatives during the next filtering
integration analysis, we did not see significant differences between step induced by the limited number of cells in the replicates.
using 10x v.3 alone and using all of them. Very few cell clusters were To generate a list of reproducible peaks, we retained peaks that
found using the 10x v.2 but not using the 10x v.3 platform. We therefore (1) were detected in the pooled dataset and overlapped ≥50% of peak
used all of the cells without distinguishing their platform information length with a peak in both individual replicates or (2) were detected in
in the later analysis. the pooled dataset and overlapped ≥50% of peak length with a peak in
We first imputed RNA expression levels according to the chromatin both pseudoreplicates.
accessibility of the gene promoter (up to 2 kb to TSSs) and gene body We found that, when the cell population varied in read depth or num-
as described previously32 using the function make_gene_matrix in ber of nuclei, the MACS2 score varied proportionally due to the nature
SnapATAC2. We next performed integration analysis using Seurat32 for of the Poisson distribution test in MACS219. Ideally, we should perform
neuronal cells and non-neuronal cells separately. For neuronal cells, in a reads-in-peaks normalization but, in practice, this type of normaliza-
the scRNA-seq data, we randomly selected 50 cells for each of over 5,100 tion is not possible because we do not know how many peaks we will
clusters, and finally got more than 200,000 cells. To have a comparable get. To account for differences in the performance of MACS2 based on
number of cells in our snATAC–seq data, we randomly selected 150 cells read depth and/or number of nuclei in individual clusters, we converted
for each of over 1,260 L4-level neuronal subtypes and got over 180,000 MACS2 peak scores (−log10[q]) to SPM37. We filtered reproducible peaks
nuclei. For non-neuronal cells, we sampled 500 cells per cluster and got by choosing a SPM cut-off of 5.
35,000 cells in the scRNA-seq data. For the snATAC–seq, we sampled We then retained only reproducible peaks on chromosome 1–19 and
300 cells per L4-level subtypes, and got over 57,000 nuclei. both sex chromosomes and filtered ENCODE mm10 blacklist regions.
For the variable features, we applied the >8,000 genes from differ- A union peak list for the whole dataset was obtained by merging peak
ential expression analysis in the scRNA-seq study5, and used their data sets from all of the cell clusters using BEDtools91.
as the reference. We next applied the canonical component analysis for Finally, as snATAC–seq data are very sparse, we selected only ele-
integration using Seurat v.5. Canonical component analysis was recom- ments that were identified as open chromatin in a significant fraction
mended for the cross-modality integration, which indeed showed more of the cells in each cluster. To this end, we first randomly selected the
promising results than reciprocal principal component analysis in our same number of non-DHS regions from the genome as background
experiments. Seurat v.5 is specifically designed to handle large-scale using the shuffleBed function of BEDtools, and calculated the fraction
datasets and is especially important for our scenario. We used the of nuclei for each cell type that showed a signal at these sites. We next
function FindTransferAnchors with the parameter k.anchor as 50 for fitted a zero-inflated β-model, and empirically identified a significance
single-cell level label transfer. k.anchor is important for large-scale threshold of FDR < 0.01 to filter potential false positive peaks. Peak
data integration as mentioned in Seurat. The default k.anchor value regions with FDR < 0.01 in at least one of the clusters were included in
is 5 for that function, and we tested k.anchor as 5, 10, 30, 50, 70, 100 downstream analysis. Given one cell subclass, we treat all of the peaks
and 120; a k.anchor value of 50 showed more reliable results compared from the subtypes mapped to this subclass as the peaks for the subclass.
with others. For UMAP visualization, we used the FindIntegration-
Anchors function of Seurat, and then calculated UMAP based on the Identification of cis-regulatory modules
co-embedding space. It was also recommended by Seurat to perform We used NMF42 to group cCREs into cis-regulatory modules on the basis
integration in this manner. The transfer label scores for a given L4-level of their relative accessibility across major clusters. We adapted NMF
subtype in our snATAC–seq data is a numeric vector, where each ele- (Python package sklearn90) to decompose the cell-by-cCRE matrix V
ment is the number of cells annotated as the corresponding cluster (N × M, N rows: cCRE, M columns: cell clusters) into a coefficient matrix
in the scRNA-seq data divided by the number of cells in that L4-level H (R × M, R rows: number of modules) and a basis matrix W (N × R), with
subtype. For each L4-level subtype, we used the corresponding top a given rank R19:
3 clusters in the scRNA-seq data as the candidate annotations, then The basis matrix defines module-related accessible cCREs, and
mapped the three clusters to the subclasses defined in the scRNA-seq the coefficient matrix defines the cell cluster components and their
data, and manually checked whether they were consistent on mouse weights in each module. The key issue to decompose the occupancy
brain major regions and gene markers. profile matrix was to find a reasonable value for the rank R (that is, the
number of modules). Several criteria have been proposed to decide
Identification of reproducible peak sets in each cell cluster whether a given rank R decomposes the occupancy profile matrix
We performed peak calling according to the ENCODE ATAC–seq pipe- into meaningful clusters. Here we applied two measurements, Sparse-
line (https://www.encodeproject.org/atac-seq/) on 1,482 L4-level sub- ness92 and Entropy42, to evaluate the clustering result. Average values
types and used the same procedure to filter the peaks at both the bulk were calculated from five NMF runs at each given rank with a random
and single-cell level (Extended Data Fig. 9a) as in our previous study19. seed, which ensures that the measurements are stable (Extended Data
Before calling peaks, we merged clusters with the number of cells less Fig. 9f).
than 200 if they shared the same cell cluster annotation based on the We next used the coefficient matrix to associate modules with dis-
integration analysis before and were in the same L3-level cluster. Next, tinct cell clusters. In the coefficient matrix, each row represents a mod-
1,463 subtypes (including merged ones) were used. ule, and each column represents a cell cluster. The values in the matrix
Article
indicate the weights of the clusters in their corresponding module. To evaluate the confidence of identified subclass-specific cCRE–gene
The coefficient matrix was then scaled by column (cluster) from 0 to 1. pairs, we randomly selected 11 major subclasses (Sst_GABA, Pvalb_
Subsequently, we used a coefficient > 0.1 (~95th percentile of the whole GABA, CBX_MLI_Megf11_GABA, Vip_GABA, CA1-ProS_Glut, CB_granule_
matrix) as a threshold to associate a cluster with a module. Glut, L6_CT_CTX_Glut, L2-3_IT_CTX_Glut, Astro-TE_NN, Microglia_NN,
Moreover, we associated each module with accessible elements using Bergmann_NN), and calculated the Hi-C signal enrichment (at 1 kb
the basis matrix. For each element and each module, we derived a basis resolution) at the top 20% subclass-specific cCRE–gene pair anchors
coefficient score, which represents the accessible signal contributed identified in this study. We found that there is statistically significant
by all clusters in the defined module. We also implemented and calcu- higher enrichment (P = 0.004) of chromatin interaction signal at the
lated a basis-specificity score called feature score for each accessible corresponding subclass-specific cCRE–gene pair anchors, compared
element using the kim method42. The feature score ranges from 0 with non-corresponding pair anchors (Extended Data Fig. 10g), suggest-
to 1. A high feature score means that a distinct element is specifically ing that subclass-specific cCRE–gene pairs are more likely to interact
associated with a specific module. Only features that fulfil both fol- in the cell types in which the cCREs are active.
lowing criteria were retained as module specific elements: (1) feature Meanwhile, we selected the two peak modules that show global acces-
score greater than median + 3s.d.; (2) the maximum contribution to sibility across the subclasses based on the NMF analysis (Fig. 2f (top
a basis component is greater than the median of all contributions left)). We then selected all of the proximal–distal connections with
(that is, of all elements of W ). cCREs in the peak modules above and ranked the proximal–distal con-
nections based on the highest Cicero scores they have. We treated them
Inference of cis-co-accessible cCREs as global proximal–distal connections and performed the Hi-C signals
Cis-co-accessibility cCREs are predicted for all open regions in each by aggregating all of the Hi-C data. From the heat maps (Extended Data
of the 275 cell subclasses separately using Cicero for Monocle 372,93 Fig. 10h), we observed the strong enrichment signals for the global
with the default parameters and the mouse mm10 genome, scanning proximal–distal connections.
the mouse genome with a window size of 500 kb. For each subclass,
we randomly selected 5,000 nuclei, and used all of the nuclei for cell Predicting GRNs for each cell subclass
clusters with <5,000 nuclei. Only one subclass failed during running We adapted the recently published Python package CellOracle52 on
Cicero, which was annotated as ‘Hypendymal_NN’ with 92 nuclei in our data to infer GRNs for each cell subclass across the whole mouse
total, and showed the smallest number of peaks (less than 5,000) of brain based on our integration analysis between our snATAC–seq data
all of the subclasses. To find an optimal co-accessibility threshold for and the scRNA-seq data5. Three steps were followed. First, we identi-
each subclass, we randomly shuffled the columns of the cell-by-cCREs fied the co-accessibility distal-to-proximal pairs, which was described
matrix (that is, the cCREs) in the cells as the background and identified previously for each subclass. Second, we mapped the distal cCREs to
co-accessibility regions from this shuffled matrix. A normal distribu- TFs. Lastly, we identified the regulatory relationships between TFs
tion is then used to fit the co-accessibility scores from the shuffled and the potential target genes by fitting a regularized linear regres-
background using the R package fitdistrplus94. Co-accessibility cCREs sion model using scRNA-seq data. For the second step, according to
were filtered out only if their co-accessibility scores were significantly the CellOracle tutorial, we used the Python package gimmemotifs53
larger than the background (FDR < 0.001 using Benjamini–Hochberg for the TF-binding-motif scan with the mouse genome mm10 and the
adjustment). CCREs outside of ±1 kb of TSSs in GENCODE mm10 ver- default motif database provided by CellOracle. The proximal cCREs
sion 23, were treated as distal cCREs, others as proximal ones. All of were mapped to the genes based on GENCODE mm10 (v.23, the same as
the cis-co-accessibility cCREs were then grouped into three classes: above). We used Seurat32 to randomly sample 1,000 cells per subclass
proximal-to-proximal, distal-to-distal and distal-to-proximal pairs. In (all of the cells of a cell subclass were used if it had <1,000 cells). To
our study, we focused only on distal-to-proximal pairs. select the variable features, we performed the FindVariableFeatures
function of Seurat to select the top 3,000 genes, and then we manually
Enrichment analysis of FIREs added the 499 TFs (if any of them were missed in the previous 3,000
We called frequently interacting regions (FIREs) in the mouse cortex38 genes) that were reported in the scRNA-seq data of ref. 5. For each sub-
by applying the criteria in our group’s FIRE paper95. The result showed class, we performed CellOracle on the scRNA-seq data with the default
that most FIREs (3,158 out of 3,169) overlap with cCREs in the mouse parameters. We used P < 0.001 and the top 10,000 edges based on
brain, and a fraction of the cCREs (71,626 out of 1,053,811) overlap with the absolute values of the weights to filter the predicted interactions
FIREs (Extended Data Fig. 10e). between TFs and genes as suggested by CellOracle. Finally, 267 out of
We next tested whether cCREs are enriched at FIREs through permu- 275 subclasses successfully had the predicted GRNs.
tation analysis. In brief, we shuffled the mouse genome 1,000 times,
each time generating 1,053,811 random regions with equivalent sizes Sequence conserved, chromatin accessibility conserved and
as the cCREs. We then calculated the number of overlaps between mouse-specific cCREs
the randomly generated regions and the FIREs during each shuffle. The orthologous cCREs of the mouse brain in the human genome
We found that cCREs are significantly enriched at FIREs (P < 0.001; were identified by performing reciprocal homology searches using
Extended Data Fig. 10f), with the actual number of overlaps on FIREs the liftover tool96. The mouse cCREs for which human genome
substantially higher than expected. sequences had high similarity (more than 50% of bases lifted over
to the mouse genome) were defined as orthologous cCREs. We next
Motif enrichment compared these orthologous cCREs in the mouse brain with our
We performed both de novo and known motif-enrichment analysis previously identified cCREs in the human brain23. Those ortholo-
using Homer45. gous cCREs, which both were DNA sequence conserved across spe-
cies and had open chromatin in orthologous regions, were defined
Enrichment analysis of chromatin conformation as chromatin-accessibility-conserved cCREs. The other orthologous
We cross-referenced the dataset from the companion study41, in which cCREs, which were only sequence conserved to orthologous regions
a comprehensive chromatin conformation/methylome joint profile but had not been identified as open chromatin regions in other
throughout the adult mouse brain is described, and most of the subclass species, were defined as chromatin-accessibility-divergent cCREs.
annotations (244 subclasses of 275 subclasses in our data) are shared Mouse-specific cCREs were those ones that were not able to find orthol-
between these two datasets. ogous regions in the human genome.
For training, we set the parameter batch size to 32, epochs to 150 and
TE analysis patience to 30.
The TE annotation of cCREs was annotated using Homer45 and UCSC To evaluate the model’s ability to identify cell-type-specific patterns
mm10 refGene and RepeatMasker annotation. To define the high of cCRE, we compared the Spearman correlation of model predictions
TE-cCREs fraction of subclasses, we fitted a mixture model for the to true accessibility across cell types in all peaks in the test set. We fur-
TE-cCRE fraction across all subclasses using the R package mixtools97 ther compared cross-cell-type correlation to the coefficient of variation
(v.2.0.0). The P value was calculated based on the null distribution. (the ratio of s.d. to mean) of each peak.
To annotate the TE-cCREs, we used two strategies. One was based We also evaluated the model’s accuracy when applied to human
on the genomic regions. We mapped the TE-cCREs to genes within cell types. We first identified matched human cell types from a previ-
3 kb flanking regions using the R package ChIPseeker98 (v.1.34.1). ous study23. For each subclass in human and mouse cCREs, we per-
Another method to link the gene to TE-cCREs was based on the cCREs formed spearman correlation across orthologous cCREs (Extended
and gene correlation. For each GO test, we also filtered unexpressed Data Fig. 16). We next selected pairs based on correlation and annota-
genes in defined subclasses based on the single-cell RNA-seq data (see tion matching. We then used the model to predict chromatin acces-
the companion manuscript5). The DCA of TE-cCREs between groups sibility in the paired human cell types, across all chromosomes.
was calculated using the Wilcoxon rank-sum test. Motif-enrichment We further evaluated this prediction accuracy within and across cell
analysis of TE-cCREs was performed using Homer software using the types.
‘given size’ parameter.
To analyse the TE-accessible variability with decreased noise, the TE External datasets
signal was aggregated from the TE-cCREs. To calculate the correlation External datasets used were as follows: (1) ENCODE rDHS regions for
between chromatin accessibility and mCG methylation in TEs across both hg19 and mm10 are obtained from SCREEN database (https://
subclasses, we averaged and normalized the TE-cCRE mCG signal for screen.encodeproject.org)39,40. (2) ChromHMM38,102 states for mouse
each TE in matched subclasses from the companion paper41. To calcu- brain are download from GitHub (https://github.com/gireeshkbogu/
late the correlation between chromatin accessibility and RNA expres- chromatin_states_chromHMM_mm9) and coordinates are LiftOver
sion, we aggregated RNA signals at TE-cCREs of each TE in matched (https://genome.ucsc.edu/cgi-bin/hgLiftOver) to mm10 with the
subclasses from a previous study99. default parameters96. (3) PhastCons103 conserved elements were down-
loaded from the UCSC Genome Browser (http://hgdownload.cse.ucsc.
GO enrichment edu/goldenpath/mm10/phastCons60way/). (4) The ENCODE mm10
We performed GO enrichment analysis using R package clusterPro- blacklist file was downloaded from http://mitra.stanford.edu/kundaje/
filer100,101. The background genes were selected on the basis of the akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz.
enrichment analysis and described in text. The P value was computed (5) Mouse mm10 genome information was downloaded from GENCODE
using the Fisher exact test and adjusted for multiple comparisons using (https://www.gencodegenes.org/mouse/).
the Benjamini–Hochberg method.
Statistics
Deep-learning model No statistical methods were used to predetermine sample sizes.
Our model was trained on all 275 subclasses annotated based on the There was no randomization of the samples, and investigators were
integration with the scRNA-seq data. We generated aggregated genome not blinded to the specimens being investigated. However, cluster-
signal tracks in bigwig format by running MACS236. The training, val- ing of single nuclei based on chromatin accessibility was performed
idation, and testing datasets have been generated using the script in an unbiased manner, and cell types were assigned after clustering.
basenji_data.py from Basenji65 with the parameters: “-b mm10.blacklist. Low-quality nuclei and potential barcode collisions were excluded
bed -l 131072 --local -p 16 -t 0.1 -v 0.1 -w 128”. from downstream analysis as described above.
The model architecture, layers and parameters are adapted from the
mouse model from a previous study80, with modification only in the Reporting summary
last output head layer with parameter: “units”: 275. To encourage the Further information on research design is available in the Nature Port-
model to predict cCREs in under-represented cell types, we created folio Reporting Summary linked to this article.
one novel loss function:

wi , i = cov( ytrue(i , i ) ) Data availability


Demultiplexed FASTQ files are available at the NEMO archive (NEMO,
n RRID: SCR_016152) at https://assets.nemoarchive.org/dat-bej4ymm
w = ∑ wii /n (the raw directory under the source data URL in this archive), and at
i =1
the NCBI under GEO accession number GSE246791. Processed data are
available at our web portal (http://www.catlas.org) and the same GEO
Poisson loss = ypred(i , j ) − ytrue(i , j ) × log( ypred(i , j ) ) accession number above.

loss function = w ⋅ Poisson loss


Code availability
The i represent the cell type, and j represents genomic bins. The Custom code and scripts used for analysis are available at GitHub
ytrue represents the genomic bin-by-type matrix calculated from true (https://github.com/beyondpie/CEMBA_wmb_snATAC).
signals. The ypred represents the predicted genomic bin-by-type matrix.
The pairwise covariance wi,i was calculated between cell types. We 83. Luo, C. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements
in mammalian cortex. Science 357, 600–604 (2017).
then sum the scores across rows and normalize the number of cell 84. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler
types as weights. Last, the weights w was dot multiplied by the original transform. Bioinformatics 25, 1754–1760 (2009).
poisson loss. 85. Yan, F., Powell, D. R., Curtis, D. J. & Wong, N. C. From reads to insight: a hitchhiker’s guide
to ATAC-seq data analysis. Genome Biol. 21, 22 (2020).
We trained the subclass-level deep-learning model on four NVIDIA 86. Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell
A100 80 GB GPUs using the script basenji_train.py from Basenji65. 184, 5985–6001 (2021).
Article
87. Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 103. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast
7, S4 (2006). genomes. Genome Res. 15, 1034–1050 (2005).
88. Triton Shared Computing Cluster (San Diego Supercomputer Center, 2022); https://doi.
org/10.57873/T34W2R.
89. Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster Acknowledgements We thank all of the other members of the Ren laboratory for their input.
analysis. J. Comput. Appl. Math. 20, 53–65 (1987). This study was supported by NIH grant U19MH114831 to J.R.E. and B.R., and NIH grant
90. Fabian Pedregosa, G. V. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. U19MH114830 to H.Z. J.R.E. is an investigator of the Howard Hughes Medical Institute.
12, 2825–2830 (2011). Zhaoning Wang is a DDBrown Awardee of the Life Sciences Research Foundation. Work at the
91. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic Center for Epigenomics was also supported by the UC San Diego School of Medicine. This
features. Bioinformatics 26, 841–842 (2010). publication includes data that were generated at the UC San Diego IGM Genomics Center
92. Hoyer, P. O. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. using an Illumina NovaSeq 6000 system that was purchased with funding from a National
Res. 5, 1457–1469 (2004). Institutes of Health SIG grant (S10 OD026929).
93. Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis.
Nature 566, 496–502 (2019). Author contributions Study supervision: B.R. Contribution to data analysis: S.Z., Y.E.L., K.W.,
94. Delignette-Muller, M. L. & Dutang, C. fitdistrplus: an R package for fitting distributions. E.A., S.M., Y.W., M.L.A., H.L., J.Z., H.Z. and J.S. Contribution to data generation and management:
J. Stat. Softw. 64, 1–34 (2015). S.Z., S.P., Y.E.L., A.W., X.H., M.M., S.K., J.O., J.L., A.P.-D., M.M.B., H.Z., Z.Y., B.L., K.A.S., M.N.,
95. Schmitt, A. D. et al. A compendium of chromatin contact maps reveals spatially active B.J., L.L., Q.Y. and S.L. Contribution to the web portal: Y.E.L. and S.Z. Contribution to data
regions in the human genome. Cell Rep. 17, 2042–2059 (2016). interpretation: S.Z., Y.E.L., K.W., E.A., S.P., B.R., J.R.E., M.M.B., B.T., H.Z., J.X., Zihan Wang, S.M.,
96. Tyner, C. et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45, Y.X., K.Z. and A.C. Contribution to writing the manuscript: S.Z., Y.E.L., B.R., K.W., H.Z., B.T., S.P.,
D626–D634 (2017). M.M.B., J.X., Zhaoning Wang and S.M. All of the authors edited and approved the manuscript.
97. Benaglia, T., Chauveau, D., Hunter, D. R. & Young, D. S. mixtools: an R package for analyzing
mixture models. J. Stat. Softw. 32, 1–29 (2009). Competing interests B.R. is a co-founder and consultant of Arima Genomics and co-founder
98. Yu, G., Wang, L. G. & He, Q. Y. ChIPseeker: an R/Bioconductor package for ChIP peak of Epigenome Technologies. J.R.E. is on the scientific advisory board of Zymo Research. H.Z. is
annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015). on the scientific advisory board of MapLight Therapeutics.
99. Yao, Z. et al. A taxonomy of transcriptomic cell types across the isocortex and hippocampal
formation. Cell 184, 3222–3241 (2021). Additional information
100. Wu, T. et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Supplementary information The online version contains supplementary material available at
Innovation 2, 100141 (2021). https://doi.org/10.1038/s41586-023-06824-9.
101. Yu, G., Wang, L. G., Han, Y. & He, Q. Y. clusterProfiler: an R package for comparing biological Correspondence and requests for materials should be addressed to Bing Ren.
themes among gene clusters. OMICS 16, 284–287 (2012). Peer review information Nature thanks the anonymous reviewers for their contribution to the
102. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and peer review of this work. Peer reviewer reports are available.
characterization. Nat. Methods 9, 215–216 (2012). Reprints and permissions information is available at http://www.nature.com/reprints.
Extended Data Fig. 1 | Maps of the 117 anatomical dissections of the adult are marked according to the Allen Brain Reference Atlas26. The frontal view of
whole mouse brain. a, Schematic of brain tissue dissection strategy. Mouse each slice from slices 1–18 is shown, with the dissected regions alphabetically
brains were cut into 600-µm-thick coronal slices. b, These brain maps were labelled on the left, and the anatomic labelling listed on the right. A detailed
generated using coordinates from the Allen Mouse Brain Common Coordinate list of the dissected regions and the full anatomic labelling can be found in
Framework (CCF) v3 (ref. 26). Brain regions dissected from each coronal slice Supplementary Table 1.
Article

Extended Data Fig. 2 | Quality control metrics of the snATAC-seq datasets experiments. In a-d, the number per each boxplot (rep1 or rep2) is 117. In each
at the bulk level. a, Box plots showing the distribution of mapping ratios (the boxplot, the box spans the first to third quartiles, the horizontal line denotes
fraction of the mapped sequencing reads) in replicates (rep) 1 and 2 of the the median, and whiskers show 1.5x the interquartile range. e, Frequency
snATAC-seq experiments from each brain dissection. b, Box plots showing the distribution plot showing the fragment size distribution of each snATAC-seq
distribution of the number of proper read pairs (reads are correctly oriented) sample or datasets (234 samples/datasets in total). f, Heat map showing the
in rep 1 and 2 of the snATAC-seq experiments. c, Box plots showing the pairwise Spearman correlation coefficients of the mapping correlations of the
distribution of numbers of unique chromatin fragments detected in rep 1 and 2 bam files between the snATAC-seq datasets. The column and row names consist
of the snATAC-seq experiments. d, Box plots showing the distribution of the of two parts: brain region name and replicate label. Study represents dissections
number of unique barcodes captured in replicates 1 and 2 of snATAC-seq covered by our previous study (Last) or updated in the current study (New).
Extended Data Fig. 3 | Quality control metrics of the snATAC-seq datasets their replicate information. n = 117 biologically independent samples for each
at the single-cell level. a, Dot plot illustrating fragments per nucleus and replicate 1 and 2. d, Number of nuclei retained after each step of quality control.
individual TSS enrichment. Nuclei in the top right quadrant were selected for e, Bar plots showing the numbers of nuclei passing quality control for subregions.
analysis (TSS enrichment > 10 and > 1,000 fragments per nucleus). b, Box plots f, Box plots showing the TSS enrichments and unique fragments per nuclei for
showing the AUPRCs of AMULET30 and Scrublet 28 on the simulated data sets the replicates in different mouse brain regions. The smallest sample size is ORB
from the corresponding samples labelled in x axis. Each bar represents the region replicate 1 with n = 4,943 cells, while the largest is PAL-2 replicate 1 with
mean value of 10 random experiments with 1x standard deviation as the error n = 12,464 cells. In c and f, boxes span the first to third quartiles, horizontal line
bar. Two-sided t-tests were used, and *** means P-value < 0.0001. c. Box plots denotes the median, and whiskers show 1.5x the interquartile range.
showing the doublet rates across the samples. Samples were grouped based on
Article

Extended Data Fig. 4 | Iterative clustering for the snATAC-seq data. neighbour batch effect test 34 (kBET) for the 275 subclasses. Boxes span the first
a, A multi-stage cell clustering pipeline is organized for all the nuclei passing to third quartiles, horizontal line denotes the median, and whiskers show 1.5× the
our quality control. b, Violin plots showing the number of unique fragments interquartile range. Two-sided t-tests showed no significant P-values between
per nucleus in each cell subclass. c, Violin plots showing the TSS enrichment in the values from the two boxes. e, Distribution of the local inverse Simpson’s
each nucleus of each cell subclass. d, Boxplots of acceptance rates from k-nearest index35 (LISI) scores for cells in each subclass.
Extended Data Fig. 5 | See next page for caption.
Article
Extended Data Fig. 5 | Quality and reproducibility of the cell clusters. a, CDF dissections. The column and row names consist of two parts: brain region name
plot showing the consistency of the estimated fraction of each cell subclass and replicate label. For example, CB-1.1 represents the replicate 1 of the first
between the biological replicates. Two-sided Kolmogorov-Smirnov test shows brain dissection of the cerebellum (CB-1). The embedded box plot shows the
no significant difference between the biological replicates. b, Box plots of the distribution of Spearman correlation coefficients between two biological
P values of two-sided Kolmogorov-Smirnov tests illustrate consistent results replicates, replicates from intra-major brain regions and inter-major brain
between the two biological replicates for each subclass across major brain regions. Significance is denoted as ***P < 2.2e-16, determined by one-sided
regions, sub-regions and brain dissections tested. n = 12 comparisons for major Wilcoxon rank-sum test. n = 22720 pairs for “intra-major regions” group, n = 4424
regions, n = 41 comparisons for sub-regions and n = 117 comparisons for pairs for “inter-major regions” group, n = 117 for “between replicates” group.
dissection regions. c, Heat map showing the pairwise Spearman correlation Boxes span the first to third quartiles, horizontal line denotes the median, and
coefficients of cell subclass composition between each replicate of brain whiskers show 1.5x the interquartile range.
Extended Data Fig. 6 | Integration analysis between the snATAC-seq and the clusters from the snATAC-seq data. g, Consensus scores between neuronal
scRNA-seq data for neurons and non-neurons separately. UMAP on the clusters from the scRNA-seq data of Allen Institute and L4-level neuronal clusters
co-embedding space of neurons from the snATAC-seq data (a) and scRNA-seq from the snATAC-seq data. h, Consensus score between non-neuronal clusters
data (b). Colours as major regions. c, The co-embedding UMAP embedding of from the scRNA-seq data and L4-level non-neuronal clusters from the snATAC-seq
non-neuronal cells from the scRNA-seq data and the snATAC-seq data on the data. i, The 22 non-neuronal subclasses matched to the non-neuronal subclasses
same space coloured by the two modalities. UMAP on the co-embedding space in the scRNA-seq. From left to right, the bar plots represent class, biological
of non-neurons from snATAC-seq data (d) and scRNA-seq data (e). Colours replicate distribution of nuclei, major region distribution of nuclei, number of
as major regions. f, Consensus scores (i.e., transfer-label scores) between clusters, and number of nuclei.
non-neuronal subclasses from the scRNA-seq data and L4-level non-neuronal
Article

Extended Data Fig. 7 | Marker genes for the subclasses after integration in Slc32a1 (GABA), Slc17a6 (Glut-subcortical), Slc17a7 (Glut-cortical), Slc17a8 (Glut),
the snATAC-seq data using the imputed gene expressions. Dotplot showing Slc6a5 (Gly-GABA), Slc6a4 (Glut-Sero), Slc6a3 (Dopa), Slc18a3 (Chol), Hdc (Hist),
the snATAC-seq gene activity scores of the marker genes (columns) used for Slc6a2 (Nora). The subsequent columns are the most occurring marker gene
identification of the scRNA-seq data across the cell subclasses 5. The first 13 reported within each Allen Institute subclass designation corresponding to
columns correspond to major neuronal cell type marker genes including each subclass annotation (row) of the snATAC-seq data.
neurotransmitter genes as follows: Snap25 (Neuron), Gad1 (GABA), Gad2 (GABA),
Extended Data Fig. 8 | Cellular composition of brain dissections for cell dissected regions are shown as different sized dots. The sizes of dots correspond
subclasses. a, Bar plot shows the total number of nuclei sampled for each brain to the percentage and the colours of the dots indicate the brain dissections.
dissection region. b, Normalized percentages (pct) of each subclass in all the
Article

Extended Data Fig. 9 | See next page for caption.


Extended Data Fig. 9 | Statistics of peak calling on snATAC-seq data for each variation of chromatin accessibility at each cCRE across cell subclasses. The
cell subtype. a, Schematic of peak calling and filtering pipeline. b, Density left density map refers to the cCREs overlapping with the ENCODE DHSs, and
distribution plot showing the fraction of cells per cell type in which a peak was the right one refers to the cCREs having no overlaps with the ENCODE DHSs.
accessible and a corresponding background for each cell type. For each cell i, Scatter plot showing entropy (blue) and sparseness (red) trends when
type, the background is defined as the non-DHS and non-peak regions randomly increasing the number of modules used for non-negative matrix factorization.
picked from the genome. c, Venn plot showing the overlapping between the When the module number is 150, we can see a significant drop in entropy and a
peaks from the whole mouse brain and the ones from the cerebral regions19. significant increase in sparseness. j, The red arrows point to the two subclasses
d, Enrichment analysis of the peak sets with a 15-state ChromHMM model in the with lowest number of cells in the snmC-seq data41.
mouse brain chromatin102. e, Density map comparing the median and maximum
Article

Extended Data Fig. 10 | Characterization of predicted cCRE-target gene enhancers for each of 20,703 gene in the positively correlated pairs. g, Boxplots
pairs. a, Scatter plot showing the number of identified connections between of the enrichment scores (1 kb resolution) of aggregate peak analysis (APA)
all the cCREs pairs within 500k bp along with the number of nuclei for each cell for the top 20% positive proximal-distal connections (ppdc) from several
subclass identified based on the integration analysis. b, Scatter plot showing represented subclasses. Match, the subclass’s Hi-C data41 used for the same
the number of proximal-distal cCREs along with the number of nuclei for each subclasses. Unmatch, the subclass’s Hi-C data used for other subclasses as a
cell subclass. c, Histogram showing the distances along the genome for each random background. 11 data points were included in the match group and 110
proximal-distal cCREs. d, Histogram showing the distances along the genome points in the unmatched groups. P value was calculated by the one-sided
for each pair of enhancer and targeted gene’s promoter (positive proximal-distal Wilcoxon rank sum test. In f and g, boxes span the first to third quartiles,
cCREs) inferred by the correlation study (Fig. 3b). e, In total, 613,485 positively horizontal line denotes the median, and whiskers show 1.5x (f) and 2x (g) the
correlated proximal-distal cCREs and 107,413 negatively correlated proximal- interquartile ranges. h. Heatmaps of enrichment signals for the top 10% global
distal cCREs were identified. f, Boxplot showing the identified potential proximal-distal connections (pdc) and enrichment signals for the random pairs.
Extended Data Fig. 11 | Inference of gene regulatory networks (GRNs) at cell the median, and whiskers show 1.5x the interquartile range. d, 15 commonly used
subclass level across the whole mouse brain. a, Schematic of identifying network motifs56 used in our analysis. Each node is a TF or a gene, and edges
co-accessible cCREs for each cell subclass using Cicero44. b, Schematic view of describe the regulation directions, i.e., arrows pointed to the ones that were
inference of GRNs from predicting the putative target genes’ expression with the regulated by the source nodes or TFs. The blue colour means the negative
corresponding transcription factors (TFs) for each cell subclass using CellOracle52. regulation (TFs inhibit target gene expressions), while the orange colour means
c, Boxplot of 267 P values from two-sided Kolmogorov-Smirnov test to check the positive regulation (TFs upregulate target gene expressions). PFL, positive-
power-law distributions of the nodes’ degrees from GRNs. Only one cell subclass feedback loops; RDP, regulated double-positive; FC, fully connected triad; FFL,
(OB_Eomes_Ms4a15_Glut) did not pass this examination with the P values smaller feedforward loops. SIM, single-input module. e, Stacked bar plots of the ratio of the
than 0.05. The box spans the first to third quartiles, the horizontal line denotes network motifs above in each subclass. Each column responds to one cell subclass.
Article

Extended Data Fig. 12 | Histograms of the counts of the network motifs in amygdala; the diencephalon region includes thalamus and hypothalamus; the
each subclass’s gene regulation network (GRN) grouped by main class (a) or hindbrain includes pons and medulla. c, Normalized signals of Atf3 ChIP-seq at
regions (b). The names of the network motifs are the same ones in Extended Klf4 in bone marrow-derived macrophages (BMM) showing Klf4 is likely to be a
Data Fig. 11d. Only the class with at least 3 subclasses were shown here. For each putative target of Atf3. d, Normalized signals of Atf3 ChIP-seq at Tal1 in bone
histogram, we added the corresponding density plot. The telencephalon region marrow-derived macrophages (BMM) showing Tal1 is likely to be a putative
includes isocortex, olfactory bulb, hippocampus, striatum, pallidum, and target of Atf3.
Extended Data Fig. 13 | Comparison of chromatin accessibility (CA) the fraction of genomic distribution of CA-conserved and CA-divergent cCREs.
conserved and divergent cCREs between mouse and human. a, A schematic The CA-conserved cCREs show an increase in percentage in Promoter-TSS
of CA conserved and divergent cCREs. The CA-conserved cCREs are the cCREs regions. d, Histograms showing the number of CA-conserved and CA-divergent
in our snATAC-seq data that are conserved across species and have open cCREs in subclasses. The number of CA-conserved cCREs is higher than
chromatin in orthologous regions. The CA divergent cCREs are sequence CA-divergent cCREs. e, Histograms showing the CA-conserved cCREs captured
conserved to orthologous regions but have not been identified as open by the number of cell subclasses. A fraction of CA-conserved cCREs are captured
chromatin regions in other species. The bar plot shows the numbers of by more than 200 cell subclasses. f, Histograms showing the CA-divergent
CA-conserved and CA-divergent cCREs. b, Bar plot showing the relative fraction cCREs captured by the number of cell subclasses. Most CA-divergent cCREs are
of CA conserved and divergent cCREs across subclasses. c, Radar chart showing captured by less than 50 cell subclasses.
Article

Extended Data Fig. 14 | See next page for caption.


Extended Data Fig. 14 | Analyses of chromatin accessibility at transposon versus genes near TE-cCREs in all subclasses are enriched for neuronal specific
elements (TEs) of cCREs. a, Pie charts showing the genomic distribution of functions. g, GO analysis showing genes near TE-cCREs in highTE-Glut versus
mouse-specific cCREs. b, Histograms showing the fraction of cCREs overlap genes near all cCREs in highTE-Glut are enriched for neuronal specific functions.
with TEs in subclasses of glutamatergic neurons (Glut), non-glutamatergic h, Top3 motif families enriched in the TE-cCREs in highTE-Glut. The unadjusted
neurons (nonGlut-Neu), and non-neurons (NN). c, Boxplot showing the fraction P-values were calculated using a two-sided Fisher’s exact test. i, Top3 motif
of cCREs overlap with TEs in highTE-Glut, other-Glut, nonGlut-Neu, and NN families enriched in the TE-cCREs which showed positively correlated with genes
subclasses. The P values are calculated by the one-sided Wilcoxon rank-sum test. and occurred in highTE-Glut. The unadjusted P-values were calculated using a
Boxes span the first to third quartiles, horizontal line denotes the median, and two-sided Fisher’s exact test. j, Volcano plot showing differential chromatin
whiskers show 1.5× the interquartile range. There are n = 22 subclasses in the accessibility (DCA) TE-cCREs in highTE-Glut subclasses compared to other
“highTE-Glut” group, n = 108 subclasses in the “other-Glut” group, n = 123 subclasses. The red colour labelled all DCA TE-cCREs which correlated with
subclasses in the “nonGlut-Neu” group, and n = 22 subclasses in the “NN” group. synaptic related genes. k, Genome browser tracks of aggregate chromatin
d, Heatmap showing the fraction of genomic distribution of cCREs in each cell accessibility profiles for NN, GABA, highTE-Glut, and other Glut subclasses at
subclass. e, Heatmap showing the fraction of TE family distribution of cCREs in selected DCA TE-cCREs and gene pairs. RNA signals shown here were collected
each cell subclass. f, GO analysis showing genes near TE-cCREs in highTE-Glut from previous study 99.
Article

Extended Data Fig. 15 | Accessible variability at transposon elements accessibility at variable TEs across subclasses. c, Top10 motifs enrich in
(TEs) across cell subclasses. a, Density scatter plot comparing the averaged positively distal cCREs overlapped with variable TEs. The unadjusted P-values
accessibility and coefficient of variation across cell subclasses at each were calculated using a two-sided Fisher’s exact test. d, Normalized accessibility
transposon element. Variable TEs are defined on the upper right side of dash at invariable TEs in different cell subclasses. The middle bar plot showing
lines, invariable TEs are defined on the upper left of dash lines. b, Normalized correlation between mCG level and accessibility at invariable TEs across
accessibility at variable TEs in different cell subclasses. The middle bar plot subclasses. The right bar plot showing correlation between expression level
showing correlation between mCG level and accessibility at variable TEs across and accessibility at invariable TEs across subclasses.
subclasses. The right bar plot shows correlation between expression level and
Extended Data Fig. 16 | Spearman correlation across orthologous cCREs between all paired human and mouse subclasses (mba: mouse brain atlas; hba:
human brain atlas).
nature portfolio | reporting summary
Corresponding author(s): Bing Ren
Last updated by author(s): Oct 30, 2023

Reporting Summary
Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.

Software and code


Policy information about availability of computer code
Data collection Sony Cell Sorter Software v2.1.2-5, Biomek Software 5.1 (library preparation), Illumina HiSeq2500, HiSeq4000, and NovaSeq 6000 instrument
control software (sequencing)

Data analysis bwa (v.0.7.17), HOMER(v4.11), BEDTools (v2.25.0), MACS2 (v2.1.2), GNU parallel (20220822),
GNU R (v4.3.1), ggplot2(3.4.3), stringr (1.5.0), purrr(1.0.2), dplyr (1.1.3), Seurat v5,
Python (v3.10), SnapATAC2 (v2.4), Sklearn(v1.1.0), Cicero (v3.16), CellOracle (v0.15.0),
Sony SH800S software,
https://github.com/beyondpie/CEMBA_wmb_snATAC
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and
reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.
April 2023

1
nature portfolio | reporting summary
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A description of any restrictions on data availability
- For clinical datasets or third party data, please ensure that the statement adheres to our policy

Demultiplexed FASTQ files are available at the NEMO archive (NEMO, RRID: SCR_016152 ) at https://assets.nemoarchive.org/dat-bej4ymm (the raw directory under
the source data URL in this archive), and at the NCBI under GEO accession number GSE246791 . Processed data are available at our web portal (http://
www.catlas.org) and the same GEO accession number above.

Research involving human participants, their data, or biological material


Policy information about studies with human participants or human data. See also policy information about sex, gender (identity/presentation),
and sexual orientation and race, ethnicity and racism.
Reporting on sex and gender N/A

Reporting on race, ethnicity, or N/A


other socially relevant
groupings

Population characteristics N/A

Recruitment N/A

Ethics oversight N/A

Note that full information on the approval of the study protocol must also be provided in the manuscript.

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size No statistical methods were used to predetermine sample size. For each of 117 regions from the mouse brain, dissected brain tissues were
pooled from 2-31 (only 2 dissections from the mouse cerebellum region had 2 animals for snATAC-seq library construction, all the other
samples had 4-31 animals) of the same sex to obtain enough nuclei for single nucleus ATAC-seq for each biological replica, and two biological
replicas were performed. In total, 234 samples were included.

Data exclusions No samples were excluded.


For analysis, only nuclei with >1,000 fragments / nucleus and transcription start site enrichment > 10 were selected.

Replication Experiments were performed for 2 biological replicates for each of 117 dissection regions. All the replicates were successfully collected.

Randomization There was no randomization of the samples. For each dissection region, dissected brain tissues were pooled from 2-31 (only 2 dissections
from the mouse cerebellum region had 2 animals for snATAC-seq library construction, all the other samples had 4-31 animals) of the same
sex. The 117 dissection regions were designed before experiments for analyzing the whole mouse brain in a comprehensive way.

Blinding Investigators were not blinded to the specimen being investigated based on our experimental design above.

Reporting for specific materials, systems and methods


April 2023

We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.

2
Materials & experimental systems Methods

nature portfolio | reporting summary


n/a Involved in the study n/a Involved in the study
Antibodies ChIP-seq
Eukaryotic cell lines Flow cytometry
Palaeontology and archaeology MRI-based neuroimaging
Animals and other organisms
Clinical data
Dual use research of concern
Plants

Animals and other research organisms


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research, and Sex and Gender in
Research

Laboratory animals Adult (P56) C57BL/6J male mice were purchased from Jackson Laboratories at seven weeks of age and maintained in the Salk animal
barrier facility under a 12-h light/12-h dark cycle in a temperature-controlled room with ad libitum access to water and food until
euthanasia. The temperature in the animal facility was maintained within the range of20 to 22.2C, while the humidity levels varied
between 35 and 60%.

Wild animals No wild animals were used in this study.

Reporting on sex Only male mice were used.

Field-collected samples No filed-collected samples were used in this study.

Ethics oversight All experimental procedures using live animals were approved by the SALK Institute Animal Care and Use Committee under protocol
number 18-00006.
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Plants
Seed stocks N/A

Novel plant genotypes N/A

Authentication N/A

Flow Cytometry
Plots
Confirm that:
The axis labels state the marker and fluorochrome used (e.g. CD4-FITC).
The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers).
All plots are contour plots with outliers or pseudocolor plots.
A numerical value for number of cells or percentage (with statistics) is provided.

Methodology
Sample preparation Nuclei were stained with DRAQ7 (#7406, Cell Signaling)
April 2023

Instrument Sony SH800

Software Sony SH800S software

3
Cell population abundance Cell populations within each sample were determined using snATAC-seq as described in the manuscript. See Methods and

nature portfolio | reporting summary


Supplementary table 2 and 3 for details.

Gating strategy Potential nuclei were first identified using FSC-Area and BSC-Area. Next doublets were removed based on BSC and FSC signal
width. DRAQQ7 postive nuclei with 2n count were sorted.

Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information.

April 2023

You might also like