Single-Cell Analysis of Chromatin Accessibility in The Adult Mouse Brain
Single-Cell Analysis of Chromatin Accessibility in The Adult Mouse Brain
Single-Cell Analysis of Chromatin Accessibility in The Adult Mouse Brain
https://doi.org/10.1038/s41586-023-06824-9 Songpeng Zu1,10, Yang Eric Li1,2,10, Kangli Wang1,10, Ethan J. Armand1, Sainath Mamde1,
Maria Luisa Amaral1, Yuelai Wang1, Andre Chu1, Yang Xie1, Michael Miller3, Jie Xu1,
Received: 31 March 2023
Zhaoning Wang1, Kai Zhang1, Bojing Jia1, Xiaomeng Hou3, Lin Lin3, Qian Yang3, Seoyeon Lee1,
Accepted: 1 November 2023 Bin Li1, Samantha Kuan1, Hanqing Liu4, Jingtian Zhou4, Antonio Pinto-Duarte5, Jacinta Lucero5,
Julia Osteen5, Michael Nunn6, Kimberly A. Smith7, Bosiljka Tasic7, Zizhen Yao7, Hongkui Zeng7,
Published online: 13 December 2023
Zihan Wang8, Jingbo Shang8, M. Margarita Behrens5, Joseph R. Ecker6, Allen Wang3,
Open access Sebastian Preissl3,9 & Bing Ren1,3 ✉
The Brain Initiative Cell Census Network aims to achieve a comprehen- insulators, silencers and other less-well-characterized regulatory
sive understanding of the cellular and molecular composition of the sequences work together to drive cell-type-specific gene expression in
mammalian brain1. As an experimental model, the laboratory mouse has development11,12, differentiation and disease13,14. Comprehensive map-
a critical role in the investigation of gene function in vivo as well as in the ping of CREs in mouse brain cells will provide mechanistic insights into
development and safety evaluation of various therapeutics. A detailed gene regulation and function in different brain cell types and advance
catalogue of cell types in the mouse brain along with their spatial distri- our understanding of brain development and neurological disorders.
bution and functional connections would therefore greatly facilitate the Previous catalogues of cCREs in mouse brain cells were derived
study of the complex neurocircuits and gene pathways as well as help in through epigenomic profiling of a limited number of brain regions
the development of treatments for neurological disorders. Single-cell and are therefore incomplete2,15–22. To more comprehensively delin-
transcriptomics studies2–7 have identified hundreds of subclasses and eate the cCREs in the mouse brain cells, we used the single-nucleus
thousands of cell types across the brain. This considerable cellular and assay for transposase-accessible chromatin followed by sequencing
spatial complexity underscores the need for a better understanding of (snATAC–seq) to profile chromatin accessibility at the single-cell reso-
the cis-regulatory elements (CREs) that are responsible for the identity lution across the entire adult mouse brain. In a previous study19 that
and gene expression patterns in each cell type. focused on the mouse cerebrum, we reported the delineation of 160
CREs control spatiotemporal gene expression through the binding cell types comprising approximately 800,000 brain cells across 45
of sequence-specific transcription factors (TFs) and the recruitment anatomic dissections, and the annotation of 491,818 cCREs that are
of chromatin remodeller proteins and/or transcription machinery to probably deployed in one or more of these cell types. Here we report
their target genes8–10. These elements, including promoters, enhancers, the analysis of an additional 1.5 million brain cells from the rest of mouse
1
Department of Cellular and Molecular Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA. 2Department of Neurosurgery and Genetics, Washington University
School of Medicine, St Louis, MO, USA. 3Center for Epigenomics, University of California San Diego, School of Medicine, La Jolla, CA, USA. 4Genomic Analysis Laboratory, The Salk Institute for
Biological Studies, La Jolla, CA, USA. 5The Salk Institute for Biological Studies, La Jolla, CA, USA. 6Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, USA.
7
Allen Institute for Brain Science, Seattle, WA, USA. 8Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA. 9Institute of Experimental and
Clinical Pharmacology and Toxicology, Faculty of Medicine, University of Freiburg, Freiburg, Germany. 10These authors contributed equally: Songpeng Zu, Yang Eric Li, Kangli Wang.
✉e-mail: [email protected]
Class
01: IT–ET Glut
02: NP–CT–L6b Glut
03: OB–CR Glut
04: DG–IMN Glut
05: OB–IMN GABA
06: CTX–CGE GABA
07: CTX–MGE GABA
08: CNU–MGE GABA
09: CNU–LGE GABA
11: CNU–HYa GABA
Slices 10: LSX GABA
1 3 5 7 9 11 13 15 17
(600 μm) 2 4 6 8 10 12 14 16 18 12: HY GABA
13: CNU–HYa Glut
14: HY Glut
b A Major region 15: HY Gnrh1 Glut
B (% of cells) 17: MH–LH Glut
18: TH Glut
253 neuronal subclasses (ordered by subclass ID from smallest at the bottom to largest at the top)
C AMY 2.8 19: MB Glut
D CB 4.7 20: MB GABA
E 21: MB Dopa
Dissection regions
STR
TH
253/315 neuronal
001
1
5
10
20
30
40
101
102
103
104
105
Fig. 1 | Single-cell analysis of chromatin accessibility in the adult whole neuronal clusters from our snATAC–seq data. f, The 253 neuronal subclasses in
mouse brain. a, Schematic of the sample dissection strategy. The brain map was our snATAC–seq data matched to neuronal subclasses in the scRNA-seq above,
generated using coordinates from the Allen Mouse Brain Common Coordinate and ordered on the basis of the subclass IDs (for all of the following figures, the
Framework (CCF) v.3 (ref. 26). b, The number of nuclei for 117 dissections after order was kept the same unless otherwise mentioned). From left to right, the
quality control and doublet removal. The dot size is proportional to the size of bar plots represent the class, major neurotransmitter (NT) type, biological
cells and the dissections that were not covered by our previous study19 are shown replicate distribution of nuclei, major region distribution of nuclei, number
in grey. A to L on the left were used as the dissection region labels on each of clusters and number of nuclei. Detailed information about class,
slice (details are provided in Extended Data Fig. 1). The number of dissections neurotransmitter type and subclass is reported in the companion paper5. A list
represents the number of dissections covered by our previous study (last) and of full names of the subclasses is provided in Supplementary Table 3. CTX,
updated in the current study (new). The total number of cells represents the cerebral cortex; HYa, anterior hypothalamus; L6b, layer 6b; LSX, lateral septal
number of cells covered by our previous study (last) and updated in the current complex; IT, intratelencephalic; ET, extratelencephalic; NP, near-projecting;
study (new). c, UMAP81 embedding and clustering analysis of snATAC–seq data. CT, corticothalamic; OB, olfactory bulb; CR, Cajal-Retzius; DG, dentate gyrus;
The light colours denote major cell classes. NN, non-neuronal cells. Cells are IMN, immature neurons; CGE, caudal ganglionic eminence; MGE, medial
coloured on the basis of major regions as in b. d, The co-embedding UMAP ganglionic eminence; CNU, cerebral nuclei; LGE, lateral ganglioniceminence;
embedding of the neuronal cells from scRNA-seq data5 and the snATAC–seq MH, medial habenula; LH, lateral habenula; Chol, cholinergic neurons; Dopa,
data on the same space coloured by the two modalities. e, The consensus score dopaminergic neurons; Glyc, glycinergic neurons; Sero, serotonergic neurons.
between neuronal subclasses from the scRNA-seq data above and L4-level
1,053,811 cCREs
Intergenic, 35.9%
n = 1,053,811
No. clusters
c Ovlp d 1
Non-ovlp 6% 2
Random
19% 3
4
PhastCons score
0.20
38%
11% 5
0.15 6
8% 71%
7
0.10
8
snATAC–seq log[CPM + 1] mCG score snmC-seq
–250 bp cCRE Summit 250 bp 9
Non-ovlp cCREs Ovlp cCREs
(591,289) ≥10 0 5.4 0 1
(462,522)
e Chr. 13 Chr. 10 Chr. 11 Chr. 11 Chr. 8 Chr. 13 Chr. 11 Chr. 8 Non-ovlp ENCODE rDHS
460,000 cCREs
DG-PIR Ex IMN
Lamp LHX GABA
Pvalb Chandelier GABA
SI-MA-LPO-LHA Skor Glut
Fig. 2 | Identification and characterization of cCREs across mouse brain cell browser tracks of the two types of cCREs. Left, cCREs with no overlaps with
types. a, The fraction of cCREs that overlaps with annotated sequences in the rDHSs. Right, the cCREs with overlaps with rDHSs. The subclass names were
mouse genome was determined using HOMER45. TTS, transcription termination the same as for the scRNA-seq data in the companion paper5. f, The chromatin
site; UTR, untranslated region. b, The overlaps between the cCREs in this study accessibility at 150 cis-regulatory modules across the 244 shared cell subclasses
(red) and the representative DHSs (rDHSs; blue) from the SCREEN database18. in the snATAC–seq data for all of the 1 million cCREs (top left). Rows represent
c, The average PhastCons conservation scores of cCREs (red) overlapping (ovlp) subclasses, and columns are representative cCREs sampled from each module.
with rDHSs, cCREs (blue) with no overlaps with rDHSs, and random genomic Right, heat map showing the snDNA-methylation signals from the snmC-seq41
background (grey) were determined using deepTools82. d, The fraction of cCREs analysis at the genomic locations of the corresponding cCREs for the same
captured by different cell subtypes for peak calling. Left, the cCREs with no subclasses. Bottom, heat maps similar to those above but for only the 460,000
overlaps with rDHSs. Right, the cCREs with overlaps with rDHSs. e, Genome cCREs with no overlaps with the ENCODE rDHSs.
computed the PCCs of the shuffled cCRE–gene pairs (Fig. 3b and Meth- based on the modules (Fig. 3c and Supplementary Tables 14 and 15).
ods). This analysis revealed a total of 613,485 positively correlated The putative enhancers in each module showed cell-subclass-specific
distal cCRE (putative enhancer)–gene pairs and 107,413 negatively chromatin accessibility profiles co-occurring with the RNA expres-
correlated distal cCRE–gene pairs at an empirically defined significance sion of their putative target genes (Fig. 3c). We next performed the
threshold of FDR < 0.01 (Extended Data Fig. 10d and Supplementary motif-enrichment analysis for each module using HOMER45 with a
Table 13). The median distance between the potential enhancers and the threshold of P < 10−10 (Fig. 3c and Supplementary Table 16). The known
target promoters was 133 kb (Extended Data Fig. 10e). Each promoter motifs showed a similar cell-subclass-specific pattern, which indicated
region was assigned to a median of 24 putative enhancers (Extended cell-subclass-specific regulatory programs. For example, EBF transcrip-
Data Fig. 10f). The top proximal–distal cCRE pairs and positive pairs tion factor 1 (EBF1), which is important for B cell development, was
showed enrichment signals using the chromatin conformation data expressed in the pericytes from human brain tissues46. We found that
from the companion study41 (Methods and Extended Data Fig. 10g,h). EBF1 motifs are enriched in the cCREs from pericytes in the mouse brain
For the subsequent analysis, we focused mainly on the positively (Fig. 3c). For example, motifs for both the TF PU.1 and interferon regula-
correlated pairs, including 281,200 potential enhancers and 20,703 tory factor 8 (IRF8) were enriched in border-associated macrophages
putative target genes. To investigate how the putative enhancer may (BAMs) and microglia (Fig. 3c and Supplementary Tables 15 and 16). IRF8
regulate cell-type-specific gene expression, we further classified them is critical to transform microglia into a reactive phenotype47,48. PU.1 is
into 54 modules using the NMF42 on the matrix of normalized chro- especially expressed in microglia and can regulate genes associated
matin accessibility across the cell subclasses based on the integra- with Alzheimer’s disease in primary human microglia49. PU.1 and IRF8
tion analysis with the scRNA-seq data, and organized the distal cCREs also have essential roles in macrophages50,51.
EWS:ERG–fusion (ETS)
CBX MLI Cdh22 GABA
MEA−BST Sox6 GABA
LSX Sall3 Pax6 GABA
PU.1:IRF8 (ETS:IRF)
HOXC9 (homeobox)
PRP Otp Gly–GABA
DUXBL (homeobox)
HNF6B (homeobox)
HNF1b (homeobox)
IC Tfap2d Maf Glut
RT−ZI Gnb3 GABA
ARNT:AHR (bHLH)
RUNX–AML (Runt)
NF1–halfsite (CTF)
COAp Grxcr2 Glut
PBX1 (homeobox)
FOXD3 (forkhead)
PRT Tcf7l2 GABA
Otx2 (homeobox)
bHLHE40 (bHLH)
GSC (homeobox)
L2−3 IT CTX Glut
Hypendymal NN
STR D1 GABA
Astro–OLF NN
Bergmann NN
RUNX1 (Runt)
Lamp5 GABA
SC Bnc2 Glut
MyoD (bHLH)
MYF5 (bHLH)
Astro–NT NN
Microglia NN
CT SUB Glut
TBX5 (T-box)
Tanycyte NN
TCF7 (HMG)
ZFP281 (ZF)
PRDM1 (ZF)
FRA2 (bZIP)
EBF1 (EBF)
MYB (HTH)
GATA1 (ZF)
ELF5 (ETS)
E2F3 (E2F)
PU.1 (ETS)
RFX (HTH)
CTCF (ZF)
EBF (EBF)
ISRE (IRF)
VLMC NN
IRF8 (IRF)
KLF6 (ZF)
HIC1 (ZF)
p63 (p53)
Tlx? (NR)
SMC NN
OPC NN
OEC NN
Sp5 (ZF)
DG Glut
DC NN
Cell type 3
Detect correlated
cCRE–gene pairs
281,200 positively correlated cCREs
cCRE accessibility
RNA expression
b
613,485 positively correlated pairs
6
Density
snATAC–seq scRNA–seq
4
Motif analysis for 54 cCRE modules
Class NT type
2 Astro–Epen CNU–MGE GABA HY Gnrh1 Glut MB Glut OB–CR Glut log[CPM + 1] z score of log[CPM + 1] –log10[P]
GABA Chol
CB GABA CTX–CGE GABA Immune MB–HB Sero OB–IMN GABA GABA–Glyc Dopa
CB Glut CTX–MGE GABA IT–ET Glut MH–LH Glut OEC Glut Sero
CNU–HYa GABA DG–IMN Glut LSX GABA MY GABA OPC–Oligo Glut−GABA NN 0 5 –1.3 1.5 10 100
0 CNU–HYa Glut HY GABA MB Dopa MY Glut P GABA
–1.0 –0.5 0 0.5 1.0 CNU–LGE GABA HY Glut MB GABA NP–CT–L6b Glut P Glut
Pearson correlation Pineal Glut TH Glut Vascular
Fig. 3 | Integrative analysis to identify the potential enhancer–gene (FDR < 0.01). The grey-filled curve shows the distribution of PCCs for randomly
connections across the whole mouse brain. a, Schematic of the computational shuffled cCRE–gene pairs. c, The chromatin accessibility of putative enhancers
strategy used to identify cCREs that are positively correlated with the mRNA (left); mRNA expression of the linked genes in the 275 cell subclasses across the
expression of the target genes; PCCs were calculated across 275 cell subclasses whole mouse brain (middle); and the enrichment of known TF motifs in distinct
between the snATAC–seq and scRNA-seq data. Co-accessible cis-regulatory DNA enhancer gene modules (right). A total of 428 out of 440 known motifs from
interactions were predicted using Cicero 44 for each cell subclass. b, In total, HOMER45 with enrichment P < 10 −10 is shown. The unadjusted P values were
613,485 pairs (red) of positively correlated cCRE–gene pairs were identified calculated using two-sided Fisher’s exact tests.
We next applied CellOracle52 to the snATAC–seq and scRNA-seq data double-positive motif composed of activating transcription factor 3
(Methods and Extended Data Fig. 11a,b) for GRN analysis. To achieve (ATF3), KLF4 and TAL1, indicating that the three factors may positively
this, the subclass-specific distal cCREs detected using Cicero above regulate each other in the BAM subclass. ATF3 is an inflammatory medi-
were first mapped to mouse TFs based on TF-binding motifs using the ator and a key regulator of interferon response in macrophages57. KLF4
tool gimmemotifs53. A regularized linear regression model was then from the Kruppel-like family of factors has an essential role in mono-
used to predict the gene expression at the single-cell level on the basis cyte differentiation58, and is a mediator of proinflammatory signals in
of the mapped TF-motif instances surrounding each gene promoter macrophages59. The Tal1 gene, which encodes a basic helix-loop-helix
and generate GRNs for each subclass. The 3,000 most variable genes TF, is expressed during monocyte–macrophage lineage differentiation
across all of the subclasses from the scRNA-seq data using Seurat and and has an important role in cell cycle progression and proliferation
499 TFs reported to have essential roles in defining cell subclasses in the during monocytopoiesis60,61. Using the Cistrome Data Browser62 as a
scRNA-seq data5 were included for this analysis. Finally, we successfully resource for chromatin immunoprecipitation followed by sequenc-
inferred GRNs for 267 out of 275 cell subclasses (one example of GRN ing data, we noticed that ATF3 binds to putative enhancers near both
from the subclass ASC-TE_NN, that is, astrocytes from the telencephalon Tal1 and Klf4 in bone-marrow-derived macrophages (Gene Expression
region, is shown in Fig. 4a). The resulting GRNs contained a total of 403 Omnibus: GSE99895; Extended Data Fig. 12c,d). Overall, non-neuronal
TFs and 2,628 non-TF genes (Methods and Supplementary Table 17). cells showed higher numbers on several network motifs (such as the
As expected, the connectivity of the nodes follows a power-law dis- regulated double-positive motif) compared with Glut neurons and
tribution54 (Fig. 4b) in 266 of 267 of them (Extended Data Fig. 11c). On GABAergic neurons (Fig. 4e and Extended Data Figs. 11d and 12a).
average, each GRN owned 312 TFs and 681 genes (Fig. 4c). Furthermore, we highlighted the importance of key TFs within these
Recurring network motifs are a common feature of GRNs55. We networks by calculating their eigenvector centrality scores using
compared the 17 common network motifs56 in each of the above GRNs CellOracle. In Fig. 4f, the 267 subclasses and 226 TFs were ordered in
(Methods and Supplementary Table 18) across different cell classes the same manner as described in the companion paper5 (Supplemen-
defined in the scRNA-seq data (Extended Data Figs. 11d,e and 12a) and tary Table 21). Notably, we observed a similar pattern of importance
across different brain regions (Methods, Extended Data Fig. 12b and scores for the TFs as seen in the scRNA-seq data, where normalized
Supplementary Table 19). We first mapped the 267 subclasses to five gene expression was shown. This consistency of the TF signatures
main regions, that is, the telencephalon (isocortex, OLF, AMY, STR, across modalities reinforced the fidelity of our GRN inferences. It also
PAL), diencephalon (TH, HY), hindbrain (pons, MY), MB and CB, only demonstrated how regulatory codes of TFs across the whole mouse
if at least 60% (248 subclasses left) of the cells in the subclass could be brain could be revealed through integrated analysis of snATAC–seq
mapped to these regions, and identified regulated double-positive and scRNA-seq data.
motifs (TF A increases the expression of both TF B and TF C, and TF B TFs such as JUN, JUNB and FOS have high importance scores across
and TF C can positively regulate each other) (Fig. 4d and Supplementary multiple neuronal and non-neuronal subclasses. TFs of the bHLH
Table 20). The GRN from BAMs (BAM_NN; Fig. 4e) includes a regulated family such as NEUROD1, NEUROD2, NEUROD6 and BHLHE22 have
Number of genes
340 1,000 Diencephalon
Number of TFs
Density
Genes per TF
TFs per gene
r2 = 0.79 30
Lhx2 0.10 60 0.02 Telencephalon
Olig1 –3 Hindbrain
300 20 40
600 0.01
0.08
10 20
Gli3 Id4 –4 0
260 0
log[P(k)]
0.06 260 0 0 50 100 150 200
P(k)
Irx2 Meis2 –5 Cascades positive
Nr2f1
0.04
d Regulated double- Main class
0.0015 X Y Z
Rfx3 –6
Npas2 0.02 0.03 positive motif GABA
Density
GABA–Glyc 0.0010
Rreb1 Glut
0.02
Density
0 –7 NN
BAM_NN 0.0005
Negative Positive TF 0 50 100 150 200 0 2 4 ATF3
0.01 Z
K log[K] KLF4 TAL1 0
0 X Y
BAM_NN
1, 0
2, 0
3, 0
4, 0
0
00
00
00
00
0 50 100 150 200
Counts
f NT
class
NR4A2
FOS
NEUROD6
SP8
TCF7L2
PAX5
HOXB4
GATA2
MEIS2
001
338
267 subclasses (ordered by subclass ID from smallest to largest)
0.8
0.6
0.4
0.2
Glut Sero IT–ET Glut CTX–MGE GABA CNU–HYa Glut MB GABA P GABA OEC
0
GABA Chol NP–CT–L6b Glut CNU–MGE GABA HY Glut MB Dopa MY GABA Vascular
NT
HY Gnrh1 Glut
Class
Dopa GABA–Glyc OB–CR Glut CNU–LGE GABA MB–HB Sero CB GABA Immune
Glut–GABA NN DG–IMN Glut CNU–HYa GABA MH–LH Glut P Glut CB Glut
OB–IMN GABA LSX GABA TH Glut MY Glut Astro–Epen Eigenvector centrality
CTX–CGE GABA HY GABA MB Glut Pineal Glut OPC–Oligo
Fig. 4 | Inference of subclass-specific GRNs across the whole mouse brain. the whiskers show 1.5× the interquartile range. d, Normalized histograms of
a, Example of the GRN inferred in telencephalon-region astrocyte (ASC-TE_NN) the number of the regulated double-positive 56 network motifs for each main
using CellOracle 52 . Edges are weighted and directed to reflect the putative cell class. The lines are the kernel-based density curves fitted for different
regulation strength and mode (inhibition or activation). b, The degree histograms. e, Histograms of the two network motifs for five mouse brain
distribution of the GRN in a. P(k), the probability of a node having k degree in regions: telencephalon (isocortex, OLF, HPF, STR, PAL and AMY), diencephalon
the GRN. The degree of one node is the number of other nodes with links to it. (TH and HY), MB, hindbrain (MY and pons) and CB. f, Heat map of eigenvector-
c, The number of TFs, the number of genes, the number of regulated TFs per based centralities or importance scores of TFs in each of the subclass-specific
gene and the number of genes regulated by the TFs among the GRNs for each of GRNs. Each row represents a TF, and each column a subclass. The orders of the
267 cell subclasses. The numbers of dots in each box plot from left to right are TFs and subclasses are based on the companion paper5 for the similar heat map
as follows: 267, 267, 185,000 and 82,000. For the latter two plots, treat TFs and but using the scRNA-seq data. The names of the rows and columns are listed in
genes from different subclasses as different ones. For the box plots in c, the box Supplementary Table 18.
limits span the first to third quartiles, the centre line denotes the median and
high importance scores for many types of neurons such as the Glut neurons in the MB and pons regions. TCF7L2, SHOX2 and EBF1 had
neurons in the isocortex region. Our analysis also indicated potential high importance scores associated with Glut neurons specifically in
regulation of gene expression in GABAergic neurons by TFs such as the TH region. Moreover, TCF7L2 exhibited high importance in the MB
ARX, SP8 and SP9 in the telencephalon regions, whereas TFs such as region. Next, we observed that the TFs FOXA1 and FOXA2 had a specific
GATA2, TAL1 and GATA3 showed high importance scores for GABAergic association with the Glut neurons in the MB region. HOX-family TFs
5 42% HighTE-Glut
4 LTR Intergenic
3
2 SINE TTS
1 0
LINE 3′ UTR
0.05 0.10 0.15
0
Mouse specific Orthologous Set Mouse specific Orthologous Fraction of open cCREs overlapped with TEs in subclass
d e Glutamatergic synapse
h NN Astro-TE NN
Glutamatergic synapse OPC NN
GABA GPe-SI Sox6 Cyp26b1 GABA
Synaptic membrane Neuron to neuron synapse NDB-SI-ant Prdm12 GABA
Proteasomal protein NP SUB Glut
Neuron to neuron synapse catabolic process Other Glut
HPF CR Glut
Ribonucleoprotein complex CLA-EPd-CTX Car3 Glut
Postsynaptic specialization biogenesis L5/6 IT TPE-ENT Glut
L6 IT CTX Glut
ATAC signal
Synaptic vesicle Asymmetric synapse
L5 IT CTX Glut
Postsynaptic specialization Postsynaptic specialization L4/5 IT CTX Glut
membrane Proteasome-mediated- L2/3 IT CTX Glut
Postsynaptic membrane ubiquitin-dependent HighTE- L2/3 IT ENT Glut
Postsynaptic density protein catabolic process L2/3 IT PIR-ENTl Glut
Glut LA-BLA-BMA-PA Glut
membrane Postsynaptic density CA1-ProS Glut
Asymmetric synapse RNA splicing CA3 Glut
Proteasomal protein L2/3 IT PPP Glut
mRNA processing L4 RSP-ACA Glut
catabolic process
L5 ET CTX Glut
0 1 2 3 4 0 5 10 L6b CTX Glut
–log10[FDR] –log10[FDR] L6 CT CTX Glut
NN Astro-TE NN
OPC NN
GPe-SI Sox6 Cyp26b1 GABA
f Top 10 DCA TE–cCREs with synaptic genes in PDC (selected based on FDR)
GABA
NDB-SI-ant Prdm12 GABA
Other Glut NP SUB Glut
12 L1MB8–Cdkl5 HPF CR Glut
RMER19B2–Psenen CLA-EPd-CTX Car3 Glut
L5/6 IT TPE-ENT Glut
10 LX5C–Lin7b Class L6 IT CTX Glut
RNA signal
L5 IT CTX Glut
LX6–Snca DCA L4/5 IT CTX Glut
8 L1MD–Grin2a Non-DCA L2/3 IT CTX Glut
L2/3 IT ENT Glut
–log10[FDR]
Fig. 5 | Analyses of chromatin accessibility at TEs of cCREs. a, Schematic of functions among genes that exhibited positive correlations with TE-cCREs in
mouse-specific and orthologous cCREs. The bar plot shows the numbers of highTE-Glut subclasses, compared with genes positively correlated with all
mouse-specific and orthologous cCREs. b, The fraction of the genomic cCREs in highTE-Glut subclasses. f, DCA at TE-cCREs in highTE-Glut subclasses
distribution of mouse-specific and orthologous cCREs. c, The fraction of cCREs compared with other subclasses. The top ten DCA TE-cCREs correlating with
overlapping with TEs in each subclass of Glut neurons, GABAergic neurons, synaptic-related genes are shown. The top ten DCA TE-cCRE–gene pairs (such
dopaminergic neurons, cholinergic neurons, serotonergic neurons, glycinergic as L1MB8–Cdkl5) are indicated by red boxes. The super family of the top ten
neurons and non-neurons. The two curves show the Gaussian distribution from DCA TE-cCREs are indicated by different shapes. g, The top three motif families
the mixture model. highTE-Glut refers to the Glut neuron subclasses with a high enriched in the DCA TE-cCREs in highTE-Glut neurons. The unadjusted P values
percentage of their cCREs overlapping with TEs. d, Gene Ontology (GO) analysis were calculated using two-sided Fisher’s exact tests. h, Genome browser tracks
revealing an enrichment of neuronal-specific functions among genes that of aggregate chromatin accessibility profiles for NN, GABA, highTE-Glut and
exhibited positive correlations with TE-cCREs (TE-related cCREs) in highTE- other Glut subclasses at selected DCA TE-cCREs and gene pairs. RNA signals
Glut subclasses, compared with genes positively correlated with TE-cCREs in shown here were collected from the previous study 2 . PDC, proximal–distal
all subclasses. e, GO analysis revealing an enrichment of neuronal-specific connections.
displayed high importance scores in both GABAergic and Glut neurons this study with a separate study of single-cell chromatin accessibility
in the MY region. Last, MAF and MAFB showed high importance scores in 42 human brain regions23. We first identified orthologues of mouse
in GABAergic neurons in the cortex region. cCREs in the human genome by performing reciprocal homology
searches and found 613,073 cCREs (58% of total mouse cCREs) defined
in mouse brains to have orthologous sequences in the human genome
Conservation of the mouse brain cCREs (more than 50% of bases lifted over to the mouse genomes) (Fig. 5a
To investigate the conservation of the gene regulatory landscapes in and Extended Data Fig. 13a). The percentage of orthologous cCREs is
mouse brain cells, we compared the mouse brain cCREs defined in significantly higher than the random expectation (32% orthologous
True-positive rate
131 kb (128 bp bins) 140
cCRE density
Accuracy (PCC)
120
Pearson r
0.6 0.6 0.4
DL model 100
GABA 0.2 80
(Basenji) n = 93
0.4 0.4 60
0
Invariable
ATAC–seq signal 40
SST GABA = 0.915 –0.2
cCREs
0.2 20
ET Glut = 0.920
subclasses
0.2 Variable cCREs
–0.4
221 cell
OPC NN = 0.929
NN 0
n = 17 0 0.2 0.4 0.6 0.8 1.0 0 1.0 2.0 3.0
0
False-positive rate Coefficient of variance
GABA Glut NN
CA1-ProS Glut
L6 CT CTX Glut
Vip GABA
Pvalb GABA
Sst GABA
CBX MLI Megf11 GABA
CB granule Glut
Bergmann NN
Astro-TE NN
Oligo NN
Microglia NN
Nr4a2 Pou4f2 Ecel1 Hopx Apoe Pf4
L2/3 IT CTX Glut
CA1-ProS Glut
Predicted signals
L6 CT CTX Glut
Vip GABA
Pvalb GABA
Sst GABA
CBX MLI Megf11 GABA
CB granule Glut
Bergmann NN
Astro-TE NN
Oligo NN
Microglia NN
g j
Training
True signals in human
0.5 IT-L5
IT-L4/5
Pearson r
AUROC
0.6 IT-L2/3
0 VIP
0.4 SST
D1CaB
0.2 –0.5 OPC
OGC
MGC
0 –1.0
Human Overall Distal Proximal
cell types
Fig. 6 | Deep-learning models predict chromatin accessibility in different Representative loci near Nr4a2, Pou4f2, Ecel1, Hopx, Apoe and Pf4 are shown.
brain cell types from the DNA sequence. a, Schematic of the deep-learning g, Schematic of predicting potential chromatin accessibility signals using
(DL) model Basenji for predicting chromatin accessibly. b, The number of human DNA sequence as inputs. h, The AUROC was calculated for matched
subclasses of each cell class in the training dataset. c, The accuracy (Pearson human cell types. n = 26 cell types for the human brain dataset. i, The Pearson r
correlation) of each class. n = 93 (GABA), n = 111 (Glut) and n = 17 (NN) subclasses. of true signals and the predicted signals across cell types for all tested cCREs,
d, The AUROC was calculated for representative subclasses by comparing the tested distal cCREs and tested proximal cCREs. The numbers of overall, distal
peaks called from predicted genomic signals with the peaks called from real and proximal cCREs are 452,531, 437,207 and 15,324, respectively. j, True signals
experimental signals. e, The model’s ability to predict cell-type-specific patterns captured from ATAC–seq analysis in human cell types and predicted chromatin
of open chromatin. The coefficient of variance (variance/mean) across cell accessibilities are shown at representative genomic loci near the genes CUX2,
types was compared with the Pearson r calculated between true signals and GAD2, DRD1 and OLIG1. Cell-type-specific cCREs are highlighted in grey. For the
the predicted signals across cell subclasses. Each dot represents one cCRE in box plots, the box limits span the first to third quartiles, the centre line denotes
the testing set. f, True signals from ATAC–seq data in mouse cell subclasses the median and the whiskers show 1.5× the interquartile range.
were compared with the predicted chromatin accessibility in the test set.
Processing and alignment of sequencing reads Feature selection. We applied the function add_tlle_matrix from
Paired-end sequencing reads were demultiplexed and the cell index SnapATAC2 to extract the cell by genomic bin count matrix. The size of
was transferred to the read name. Sequencing reads were aligned to a consecutive genomic region was chosen as 500 bp. We filtered out any
the mm10 reference genome using bwa84. After alignment, we checked bins overlapping with the ENCODE blacklist and removed the top 0.5%
the fragment length contribution, which is characteristic for ATAC–seq and tail 0.5% bins based on the read coverage from the count matrix.
libraries (Extended Data Fig. 2e) for each of the 234 samples. We then Only chromosomes 1–19, X and Y were considered. For our L1-level
combined the sequencing reads to fragments using the make_frag- clustering, we used all of the bin features (over 4 million) that passed
ment_file function of SnapATAC229 and, for each fragment, we applied the criteria above as non-neuronal cells and diverse neuronal cells were
the following quality control criteria: (1) retain only fragments with all included. For clustering of other levels, we chose the default top
quality scores MAPQ > 30; (2) remove PCR duplicates. Reads were also 500,000 features using the function select_features of SnapATAC2.
sorted on the basis of cell barcodes in read names, and shifted +4 bp
for positive strand and −5 bp for negative strand to correct the 9 bp Dimensionality reduction. We applied the function of spectral from
duplication induced from Tn5 transposase85 during processing. SnapATAC2 to convert the high-dimension sparse 500 bp genomic
bin features per cell into low dimensional representations, which used
TSSe calculation spectral embedding of the normalized graph Laplacian defined by the
Enrichment of ATAC–seq accessibility at TSSs was used to quantify data cell-to-cell similarity matrix using cosine distance. For L1-level and
quality without the need for a defined peak set. We followed a previously L2-level clustering, we chose 50 as the dimension of the low-dimensional
described procedure86, and used the function filter_cells in SnapATAC2 representation space as usually a large number of cells and potentially
to calculate TSS enrichment (TSSe). TSS positions were obtained from diverse cell types was involved in the two levels. We used ‘elbow plot’ to
the GENCODE87 database v.16. In brief, Tn5-corrected insertions (reads rank all of the principal components to make sure that the top 50 com-
aligned to the positive strand were shifted +4 bp and reads aligned to ponents were sufficient for our analysis. For later analysis, we chose 30
the negative strand were shifted –5 bp) were aggregated ±2,000 bp instead. The parameter ‘weighted_by_sd’ in the function spectral was set
relative (TSS-strand-corrected) to each unique TSS genome wide. This to be true for all dimensional reduction. We did not use the parameter
profile was then normalized to the mean accessibility ±1,900–2,000 bp ‘sample_size’ in the function spectral, so no approximation method
from the TSS and smoothed every 11 bp. The maximum of the smoothed was used for the spectral embedding. For 2.3 million cells, it took about
profile was taken as the TSSe. 300 GB memory in our high-performance computing system88.
Nucleus filtering by quality control Graph-based clustering. We then applied the function knn from
Nuclei with ≥1,000 uniquely mapped fragments and TSSe ≥ 10 were SnapATAC2 to construct the k-nearest neighbour graph using the
filtered for each of 234 samples according to the ENCODE ATAC–seq parameter n_neighbors = 50 and the parameter method was set to
data standards and process pipeline (https://www.encodeproject.org/ ‘kdtree’. We next used the function leiden of SnapATAC2 for clustering
atac-seq/). We used the filter_cells function of SnapATAC2 to achieve with the parameter object_function set as modularity. The parameter
this. resolution, which affected the number of clusters a lot, was selected
from 0.1 to 2 with a step size 0.1 based on the silhouette coefficient89
Doublet removal using the Python package Scikit-learn90. We also manually checked
We used a modified version of Scrublet28 to remove potential doublets the UMAP81 for each clustering result to make sure that the resolution
for every sample independently using SnapATAC2. First, we used the was suitable corresponding to the top silhouette coefficient. UMAP
projections were calculated using the Python package umap with the For every cell cluster above, we combined all properly paired reads
parameters a as 1.8956, b as 0.8005 and init as spectral. All of the reso- to generate a pseudobulk ATAC–seq dataset for individual biological
lution parameters during clustering are provided in Supplementary replicates. Moreover, we generated two pseudoreplicates comprising
Table 3. In our later analysis, we used the term subtypes to represent half of the reads from each biological replicate. We called peaks for
all of the final clusters from L3-level clustering and L4-level clustering. each of the four datasets and a pool of both replicates independently.
Peak calling was performed on the Tn5-corrected single-base insertions
Integration analysis with scRNA-seq data using MACS236 with the following parameters: --shift -75 --extsize 150
We performed integration analysis of the 1,482 subtypes with all of over --nomodel --call-summits --SPMR -q 0.01. Finally, we extended peak
5,300 clusters reported in a companion scRNA-seq study of 4.5 mil- summits by 250 bp on either side to a final width of 501 bp for merging
lion cells for the whole adult mouse brain5. Only cells from male mice and downstream analysis. If the number of cells in any of the pseudobulk
were considered in the scRNA-seq data, which is over 2 million cells. ATAC–seq from either individual biological replicates or individual
The scRNA-seq data are mainly from 10x v.2 and 10x v.3 platforms, pseudoreplicates is fewer than 200, we did not run MACS2 for it. We
and only a few thousand cells are from snRNA-seq. On the basis of our did this to reduce the potential false negatives during the next filtering
integration analysis, we did not see significant differences between step induced by the limited number of cells in the replicates.
using 10x v.3 alone and using all of them. Very few cell clusters were To generate a list of reproducible peaks, we retained peaks that
found using the 10x v.2 but not using the 10x v.3 platform. We therefore (1) were detected in the pooled dataset and overlapped ≥50% of peak
used all of the cells without distinguishing their platform information length with a peak in both individual replicates or (2) were detected in
in the later analysis. the pooled dataset and overlapped ≥50% of peak length with a peak in
We first imputed RNA expression levels according to the chromatin both pseudoreplicates.
accessibility of the gene promoter (up to 2 kb to TSSs) and gene body We found that, when the cell population varied in read depth or num-
as described previously32 using the function make_gene_matrix in ber of nuclei, the MACS2 score varied proportionally due to the nature
SnapATAC2. We next performed integration analysis using Seurat32 for of the Poisson distribution test in MACS219. Ideally, we should perform
neuronal cells and non-neuronal cells separately. For neuronal cells, in a reads-in-peaks normalization but, in practice, this type of normaliza-
the scRNA-seq data, we randomly selected 50 cells for each of over 5,100 tion is not possible because we do not know how many peaks we will
clusters, and finally got more than 200,000 cells. To have a comparable get. To account for differences in the performance of MACS2 based on
number of cells in our snATAC–seq data, we randomly selected 150 cells read depth and/or number of nuclei in individual clusters, we converted
for each of over 1,260 L4-level neuronal subtypes and got over 180,000 MACS2 peak scores (−log10[q]) to SPM37. We filtered reproducible peaks
nuclei. For non-neuronal cells, we sampled 500 cells per cluster and got by choosing a SPM cut-off of 5.
35,000 cells in the scRNA-seq data. For the snATAC–seq, we sampled We then retained only reproducible peaks on chromosome 1–19 and
300 cells per L4-level subtypes, and got over 57,000 nuclei. both sex chromosomes and filtered ENCODE mm10 blacklist regions.
For the variable features, we applied the >8,000 genes from differ- A union peak list for the whole dataset was obtained by merging peak
ential expression analysis in the scRNA-seq study5, and used their data sets from all of the cell clusters using BEDtools91.
as the reference. We next applied the canonical component analysis for Finally, as snATAC–seq data are very sparse, we selected only ele-
integration using Seurat v.5. Canonical component analysis was recom- ments that were identified as open chromatin in a significant fraction
mended for the cross-modality integration, which indeed showed more of the cells in each cluster. To this end, we first randomly selected the
promising results than reciprocal principal component analysis in our same number of non-DHS regions from the genome as background
experiments. Seurat v.5 is specifically designed to handle large-scale using the shuffleBed function of BEDtools, and calculated the fraction
datasets and is especially important for our scenario. We used the of nuclei for each cell type that showed a signal at these sites. We next
function FindTransferAnchors with the parameter k.anchor as 50 for fitted a zero-inflated β-model, and empirically identified a significance
single-cell level label transfer. k.anchor is important for large-scale threshold of FDR < 0.01 to filter potential false positive peaks. Peak
data integration as mentioned in Seurat. The default k.anchor value regions with FDR < 0.01 in at least one of the clusters were included in
is 5 for that function, and we tested k.anchor as 5, 10, 30, 50, 70, 100 downstream analysis. Given one cell subclass, we treat all of the peaks
and 120; a k.anchor value of 50 showed more reliable results compared from the subtypes mapped to this subclass as the peaks for the subclass.
with others. For UMAP visualization, we used the FindIntegration-
Anchors function of Seurat, and then calculated UMAP based on the Identification of cis-regulatory modules
co-embedding space. It was also recommended by Seurat to perform We used NMF42 to group cCREs into cis-regulatory modules on the basis
integration in this manner. The transfer label scores for a given L4-level of their relative accessibility across major clusters. We adapted NMF
subtype in our snATAC–seq data is a numeric vector, where each ele- (Python package sklearn90) to decompose the cell-by-cCRE matrix V
ment is the number of cells annotated as the corresponding cluster (N × M, N rows: cCRE, M columns: cell clusters) into a coefficient matrix
in the scRNA-seq data divided by the number of cells in that L4-level H (R × M, R rows: number of modules) and a basis matrix W (N × R), with
subtype. For each L4-level subtype, we used the corresponding top a given rank R19:
3 clusters in the scRNA-seq data as the candidate annotations, then The basis matrix defines module-related accessible cCREs, and
mapped the three clusters to the subclasses defined in the scRNA-seq the coefficient matrix defines the cell cluster components and their
data, and manually checked whether they were consistent on mouse weights in each module. The key issue to decompose the occupancy
brain major regions and gene markers. profile matrix was to find a reasonable value for the rank R (that is, the
number of modules). Several criteria have been proposed to decide
Identification of reproducible peak sets in each cell cluster whether a given rank R decomposes the occupancy profile matrix
We performed peak calling according to the ENCODE ATAC–seq pipe- into meaningful clusters. Here we applied two measurements, Sparse-
line (https://www.encodeproject.org/atac-seq/) on 1,482 L4-level sub- ness92 and Entropy42, to evaluate the clustering result. Average values
types and used the same procedure to filter the peaks at both the bulk were calculated from five NMF runs at each given rank with a random
and single-cell level (Extended Data Fig. 9a) as in our previous study19. seed, which ensures that the measurements are stable (Extended Data
Before calling peaks, we merged clusters with the number of cells less Fig. 9f).
than 200 if they shared the same cell cluster annotation based on the We next used the coefficient matrix to associate modules with dis-
integration analysis before and were in the same L3-level cluster. Next, tinct cell clusters. In the coefficient matrix, each row represents a mod-
1,463 subtypes (including merged ones) were used. ule, and each column represents a cell cluster. The values in the matrix
Article
indicate the weights of the clusters in their corresponding module. To evaluate the confidence of identified subclass-specific cCRE–gene
The coefficient matrix was then scaled by column (cluster) from 0 to 1. pairs, we randomly selected 11 major subclasses (Sst_GABA, Pvalb_
Subsequently, we used a coefficient > 0.1 (~95th percentile of the whole GABA, CBX_MLI_Megf11_GABA, Vip_GABA, CA1-ProS_Glut, CB_granule_
matrix) as a threshold to associate a cluster with a module. Glut, L6_CT_CTX_Glut, L2-3_IT_CTX_Glut, Astro-TE_NN, Microglia_NN,
Moreover, we associated each module with accessible elements using Bergmann_NN), and calculated the Hi-C signal enrichment (at 1 kb
the basis matrix. For each element and each module, we derived a basis resolution) at the top 20% subclass-specific cCRE–gene pair anchors
coefficient score, which represents the accessible signal contributed identified in this study. We found that there is statistically significant
by all clusters in the defined module. We also implemented and calcu- higher enrichment (P = 0.004) of chromatin interaction signal at the
lated a basis-specificity score called feature score for each accessible corresponding subclass-specific cCRE–gene pair anchors, compared
element using the kim method42. The feature score ranges from 0 with non-corresponding pair anchors (Extended Data Fig. 10g), suggest-
to 1. A high feature score means that a distinct element is specifically ing that subclass-specific cCRE–gene pairs are more likely to interact
associated with a specific module. Only features that fulfil both fol- in the cell types in which the cCREs are active.
lowing criteria were retained as module specific elements: (1) feature Meanwhile, we selected the two peak modules that show global acces-
score greater than median + 3s.d.; (2) the maximum contribution to sibility across the subclasses based on the NMF analysis (Fig. 2f (top
a basis component is greater than the median of all contributions left)). We then selected all of the proximal–distal connections with
(that is, of all elements of W ). cCREs in the peak modules above and ranked the proximal–distal con-
nections based on the highest Cicero scores they have. We treated them
Inference of cis-co-accessible cCREs as global proximal–distal connections and performed the Hi-C signals
Cis-co-accessibility cCREs are predicted for all open regions in each by aggregating all of the Hi-C data. From the heat maps (Extended Data
of the 275 cell subclasses separately using Cicero for Monocle 372,93 Fig. 10h), we observed the strong enrichment signals for the global
with the default parameters and the mouse mm10 genome, scanning proximal–distal connections.
the mouse genome with a window size of 500 kb. For each subclass,
we randomly selected 5,000 nuclei, and used all of the nuclei for cell Predicting GRNs for each cell subclass
clusters with <5,000 nuclei. Only one subclass failed during running We adapted the recently published Python package CellOracle52 on
Cicero, which was annotated as ‘Hypendymal_NN’ with 92 nuclei in our data to infer GRNs for each cell subclass across the whole mouse
total, and showed the smallest number of peaks (less than 5,000) of brain based on our integration analysis between our snATAC–seq data
all of the subclasses. To find an optimal co-accessibility threshold for and the scRNA-seq data5. Three steps were followed. First, we identi-
each subclass, we randomly shuffled the columns of the cell-by-cCREs fied the co-accessibility distal-to-proximal pairs, which was described
matrix (that is, the cCREs) in the cells as the background and identified previously for each subclass. Second, we mapped the distal cCREs to
co-accessibility regions from this shuffled matrix. A normal distribu- TFs. Lastly, we identified the regulatory relationships between TFs
tion is then used to fit the co-accessibility scores from the shuffled and the potential target genes by fitting a regularized linear regres-
background using the R package fitdistrplus94. Co-accessibility cCREs sion model using scRNA-seq data. For the second step, according to
were filtered out only if their co-accessibility scores were significantly the CellOracle tutorial, we used the Python package gimmemotifs53
larger than the background (FDR < 0.001 using Benjamini–Hochberg for the TF-binding-motif scan with the mouse genome mm10 and the
adjustment). CCREs outside of ±1 kb of TSSs in GENCODE mm10 ver- default motif database provided by CellOracle. The proximal cCREs
sion 23, were treated as distal cCREs, others as proximal ones. All of were mapped to the genes based on GENCODE mm10 (v.23, the same as
the cis-co-accessibility cCREs were then grouped into three classes: above). We used Seurat32 to randomly sample 1,000 cells per subclass
proximal-to-proximal, distal-to-distal and distal-to-proximal pairs. In (all of the cells of a cell subclass were used if it had <1,000 cells). To
our study, we focused only on distal-to-proximal pairs. select the variable features, we performed the FindVariableFeatures
function of Seurat to select the top 3,000 genes, and then we manually
Enrichment analysis of FIREs added the 499 TFs (if any of them were missed in the previous 3,000
We called frequently interacting regions (FIREs) in the mouse cortex38 genes) that were reported in the scRNA-seq data of ref. 5. For each sub-
by applying the criteria in our group’s FIRE paper95. The result showed class, we performed CellOracle on the scRNA-seq data with the default
that most FIREs (3,158 out of 3,169) overlap with cCREs in the mouse parameters. We used P < 0.001 and the top 10,000 edges based on
brain, and a fraction of the cCREs (71,626 out of 1,053,811) overlap with the absolute values of the weights to filter the predicted interactions
FIREs (Extended Data Fig. 10e). between TFs and genes as suggested by CellOracle. Finally, 267 out of
We next tested whether cCREs are enriched at FIREs through permu- 275 subclasses successfully had the predicted GRNs.
tation analysis. In brief, we shuffled the mouse genome 1,000 times,
each time generating 1,053,811 random regions with equivalent sizes Sequence conserved, chromatin accessibility conserved and
as the cCREs. We then calculated the number of overlaps between mouse-specific cCREs
the randomly generated regions and the FIREs during each shuffle. The orthologous cCREs of the mouse brain in the human genome
We found that cCREs are significantly enriched at FIREs (P < 0.001; were identified by performing reciprocal homology searches using
Extended Data Fig. 10f), with the actual number of overlaps on FIREs the liftover tool96. The mouse cCREs for which human genome
substantially higher than expected. sequences had high similarity (more than 50% of bases lifted over
to the mouse genome) were defined as orthologous cCREs. We next
Motif enrichment compared these orthologous cCREs in the mouse brain with our
We performed both de novo and known motif-enrichment analysis previously identified cCREs in the human brain23. Those ortholo-
using Homer45. gous cCREs, which both were DNA sequence conserved across spe-
cies and had open chromatin in orthologous regions, were defined
Enrichment analysis of chromatin conformation as chromatin-accessibility-conserved cCREs. The other orthologous
We cross-referenced the dataset from the companion study41, in which cCREs, which were only sequence conserved to orthologous regions
a comprehensive chromatin conformation/methylome joint profile but had not been identified as open chromatin regions in other
throughout the adult mouse brain is described, and most of the subclass species, were defined as chromatin-accessibility-divergent cCREs.
annotations (244 subclasses of 275 subclasses in our data) are shared Mouse-specific cCREs were those ones that were not able to find orthol-
between these two datasets. ogous regions in the human genome.
For training, we set the parameter batch size to 32, epochs to 150 and
TE analysis patience to 30.
The TE annotation of cCREs was annotated using Homer45 and UCSC To evaluate the model’s ability to identify cell-type-specific patterns
mm10 refGene and RepeatMasker annotation. To define the high of cCRE, we compared the Spearman correlation of model predictions
TE-cCREs fraction of subclasses, we fitted a mixture model for the to true accessibility across cell types in all peaks in the test set. We fur-
TE-cCRE fraction across all subclasses using the R package mixtools97 ther compared cross-cell-type correlation to the coefficient of variation
(v.2.0.0). The P value was calculated based on the null distribution. (the ratio of s.d. to mean) of each peak.
To annotate the TE-cCREs, we used two strategies. One was based We also evaluated the model’s accuracy when applied to human
on the genomic regions. We mapped the TE-cCREs to genes within cell types. We first identified matched human cell types from a previ-
3 kb flanking regions using the R package ChIPseeker98 (v.1.34.1). ous study23. For each subclass in human and mouse cCREs, we per-
Another method to link the gene to TE-cCREs was based on the cCREs formed spearman correlation across orthologous cCREs (Extended
and gene correlation. For each GO test, we also filtered unexpressed Data Fig. 16). We next selected pairs based on correlation and annota-
genes in defined subclasses based on the single-cell RNA-seq data (see tion matching. We then used the model to predict chromatin acces-
the companion manuscript5). The DCA of TE-cCREs between groups sibility in the paired human cell types, across all chromosomes.
was calculated using the Wilcoxon rank-sum test. Motif-enrichment We further evaluated this prediction accuracy within and across cell
analysis of TE-cCREs was performed using Homer software using the types.
‘given size’ parameter.
To analyse the TE-accessible variability with decreased noise, the TE External datasets
signal was aggregated from the TE-cCREs. To calculate the correlation External datasets used were as follows: (1) ENCODE rDHS regions for
between chromatin accessibility and mCG methylation in TEs across both hg19 and mm10 are obtained from SCREEN database (https://
subclasses, we averaged and normalized the TE-cCRE mCG signal for screen.encodeproject.org)39,40. (2) ChromHMM38,102 states for mouse
each TE in matched subclasses from the companion paper41. To calcu- brain are download from GitHub (https://github.com/gireeshkbogu/
late the correlation between chromatin accessibility and RNA expres- chromatin_states_chromHMM_mm9) and coordinates are LiftOver
sion, we aggregated RNA signals at TE-cCREs of each TE in matched (https://genome.ucsc.edu/cgi-bin/hgLiftOver) to mm10 with the
subclasses from a previous study99. default parameters96. (3) PhastCons103 conserved elements were down-
loaded from the UCSC Genome Browser (http://hgdownload.cse.ucsc.
GO enrichment edu/goldenpath/mm10/phastCons60way/). (4) The ENCODE mm10
We performed GO enrichment analysis using R package clusterPro- blacklist file was downloaded from http://mitra.stanford.edu/kundaje/
filer100,101. The background genes were selected on the basis of the akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz.
enrichment analysis and described in text. The P value was computed (5) Mouse mm10 genome information was downloaded from GENCODE
using the Fisher exact test and adjusted for multiple comparisons using (https://www.gencodegenes.org/mouse/).
the Benjamini–Hochberg method.
Statistics
Deep-learning model No statistical methods were used to predetermine sample sizes.
Our model was trained on all 275 subclasses annotated based on the There was no randomization of the samples, and investigators were
integration with the scRNA-seq data. We generated aggregated genome not blinded to the specimens being investigated. However, cluster-
signal tracks in bigwig format by running MACS236. The training, val- ing of single nuclei based on chromatin accessibility was performed
idation, and testing datasets have been generated using the script in an unbiased manner, and cell types were assigned after clustering.
basenji_data.py from Basenji65 with the parameters: “-b mm10.blacklist. Low-quality nuclei and potential barcode collisions were excluded
bed -l 131072 --local -p 16 -t 0.1 -v 0.1 -w 128”. from downstream analysis as described above.
The model architecture, layers and parameters are adapted from the
mouse model from a previous study80, with modification only in the Reporting summary
last output head layer with parameter: “units”: 275. To encourage the Further information on research design is available in the Nature Port-
model to predict cCREs in under-represented cell types, we created folio Reporting Summary linked to this article.
one novel loss function:
Extended Data Fig. 2 | Quality control metrics of the snATAC-seq datasets experiments. In a-d, the number per each boxplot (rep1 or rep2) is 117. In each
at the bulk level. a, Box plots showing the distribution of mapping ratios (the boxplot, the box spans the first to third quartiles, the horizontal line denotes
fraction of the mapped sequencing reads) in replicates (rep) 1 and 2 of the the median, and whiskers show 1.5x the interquartile range. e, Frequency
snATAC-seq experiments from each brain dissection. b, Box plots showing the distribution plot showing the fragment size distribution of each snATAC-seq
distribution of the number of proper read pairs (reads are correctly oriented) sample or datasets (234 samples/datasets in total). f, Heat map showing the
in rep 1 and 2 of the snATAC-seq experiments. c, Box plots showing the pairwise Spearman correlation coefficients of the mapping correlations of the
distribution of numbers of unique chromatin fragments detected in rep 1 and 2 bam files between the snATAC-seq datasets. The column and row names consist
of the snATAC-seq experiments. d, Box plots showing the distribution of the of two parts: brain region name and replicate label. Study represents dissections
number of unique barcodes captured in replicates 1 and 2 of snATAC-seq covered by our previous study (Last) or updated in the current study (New).
Extended Data Fig. 3 | Quality control metrics of the snATAC-seq datasets their replicate information. n = 117 biologically independent samples for each
at the single-cell level. a, Dot plot illustrating fragments per nucleus and replicate 1 and 2. d, Number of nuclei retained after each step of quality control.
individual TSS enrichment. Nuclei in the top right quadrant were selected for e, Bar plots showing the numbers of nuclei passing quality control for subregions.
analysis (TSS enrichment > 10 and > 1,000 fragments per nucleus). b, Box plots f, Box plots showing the TSS enrichments and unique fragments per nuclei for
showing the AUPRCs of AMULET30 and Scrublet 28 on the simulated data sets the replicates in different mouse brain regions. The smallest sample size is ORB
from the corresponding samples labelled in x axis. Each bar represents the region replicate 1 with n = 4,943 cells, while the largest is PAL-2 replicate 1 with
mean value of 10 random experiments with 1x standard deviation as the error n = 12,464 cells. In c and f, boxes span the first to third quartiles, horizontal line
bar. Two-sided t-tests were used, and *** means P-value < 0.0001. c. Box plots denotes the median, and whiskers show 1.5x the interquartile range.
showing the doublet rates across the samples. Samples were grouped based on
Article
Extended Data Fig. 4 | Iterative clustering for the snATAC-seq data. neighbour batch effect test 34 (kBET) for the 275 subclasses. Boxes span the first
a, A multi-stage cell clustering pipeline is organized for all the nuclei passing to third quartiles, horizontal line denotes the median, and whiskers show 1.5× the
our quality control. b, Violin plots showing the number of unique fragments interquartile range. Two-sided t-tests showed no significant P-values between
per nucleus in each cell subclass. c, Violin plots showing the TSS enrichment in the values from the two boxes. e, Distribution of the local inverse Simpson’s
each nucleus of each cell subclass. d, Boxplots of acceptance rates from k-nearest index35 (LISI) scores for cells in each subclass.
Extended Data Fig. 5 | See next page for caption.
Article
Extended Data Fig. 5 | Quality and reproducibility of the cell clusters. a, CDF dissections. The column and row names consist of two parts: brain region name
plot showing the consistency of the estimated fraction of each cell subclass and replicate label. For example, CB-1.1 represents the replicate 1 of the first
between the biological replicates. Two-sided Kolmogorov-Smirnov test shows brain dissection of the cerebellum (CB-1). The embedded box plot shows the
no significant difference between the biological replicates. b, Box plots of the distribution of Spearman correlation coefficients between two biological
P values of two-sided Kolmogorov-Smirnov tests illustrate consistent results replicates, replicates from intra-major brain regions and inter-major brain
between the two biological replicates for each subclass across major brain regions. Significance is denoted as ***P < 2.2e-16, determined by one-sided
regions, sub-regions and brain dissections tested. n = 12 comparisons for major Wilcoxon rank-sum test. n = 22720 pairs for “intra-major regions” group, n = 4424
regions, n = 41 comparisons for sub-regions and n = 117 comparisons for pairs for “inter-major regions” group, n = 117 for “between replicates” group.
dissection regions. c, Heat map showing the pairwise Spearman correlation Boxes span the first to third quartiles, horizontal line denotes the median, and
coefficients of cell subclass composition between each replicate of brain whiskers show 1.5x the interquartile range.
Extended Data Fig. 6 | Integration analysis between the snATAC-seq and the clusters from the snATAC-seq data. g, Consensus scores between neuronal
scRNA-seq data for neurons and non-neurons separately. UMAP on the clusters from the scRNA-seq data of Allen Institute and L4-level neuronal clusters
co-embedding space of neurons from the snATAC-seq data (a) and scRNA-seq from the snATAC-seq data. h, Consensus score between non-neuronal clusters
data (b). Colours as major regions. c, The co-embedding UMAP embedding of from the scRNA-seq data and L4-level non-neuronal clusters from the snATAC-seq
non-neuronal cells from the scRNA-seq data and the snATAC-seq data on the data. i, The 22 non-neuronal subclasses matched to the non-neuronal subclasses
same space coloured by the two modalities. UMAP on the co-embedding space in the scRNA-seq. From left to right, the bar plots represent class, biological
of non-neurons from snATAC-seq data (d) and scRNA-seq data (e). Colours replicate distribution of nuclei, major region distribution of nuclei, number of
as major regions. f, Consensus scores (i.e., transfer-label scores) between clusters, and number of nuclei.
non-neuronal subclasses from the scRNA-seq data and L4-level non-neuronal
Article
Extended Data Fig. 7 | Marker genes for the subclasses after integration in Slc32a1 (GABA), Slc17a6 (Glut-subcortical), Slc17a7 (Glut-cortical), Slc17a8 (Glut),
the snATAC-seq data using the imputed gene expressions. Dotplot showing Slc6a5 (Gly-GABA), Slc6a4 (Glut-Sero), Slc6a3 (Dopa), Slc18a3 (Chol), Hdc (Hist),
the snATAC-seq gene activity scores of the marker genes (columns) used for Slc6a2 (Nora). The subsequent columns are the most occurring marker gene
identification of the scRNA-seq data across the cell subclasses 5. The first 13 reported within each Allen Institute subclass designation corresponding to
columns correspond to major neuronal cell type marker genes including each subclass annotation (row) of the snATAC-seq data.
neurotransmitter genes as follows: Snap25 (Neuron), Gad1 (GABA), Gad2 (GABA),
Extended Data Fig. 8 | Cellular composition of brain dissections for cell dissected regions are shown as different sized dots. The sizes of dots correspond
subclasses. a, Bar plot shows the total number of nuclei sampled for each brain to the percentage and the colours of the dots indicate the brain dissections.
dissection region. b, Normalized percentages (pct) of each subclass in all the
Article
Extended Data Fig. 10 | Characterization of predicted cCRE-target gene enhancers for each of 20,703 gene in the positively correlated pairs. g, Boxplots
pairs. a, Scatter plot showing the number of identified connections between of the enrichment scores (1 kb resolution) of aggregate peak analysis (APA)
all the cCREs pairs within 500k bp along with the number of nuclei for each cell for the top 20% positive proximal-distal connections (ppdc) from several
subclass identified based on the integration analysis. b, Scatter plot showing represented subclasses. Match, the subclass’s Hi-C data41 used for the same
the number of proximal-distal cCREs along with the number of nuclei for each subclasses. Unmatch, the subclass’s Hi-C data used for other subclasses as a
cell subclass. c, Histogram showing the distances along the genome for each random background. 11 data points were included in the match group and 110
proximal-distal cCREs. d, Histogram showing the distances along the genome points in the unmatched groups. P value was calculated by the one-sided
for each pair of enhancer and targeted gene’s promoter (positive proximal-distal Wilcoxon rank sum test. In f and g, boxes span the first to third quartiles,
cCREs) inferred by the correlation study (Fig. 3b). e, In total, 613,485 positively horizontal line denotes the median, and whiskers show 1.5x (f) and 2x (g) the
correlated proximal-distal cCREs and 107,413 negatively correlated proximal- interquartile ranges. h. Heatmaps of enrichment signals for the top 10% global
distal cCREs were identified. f, Boxplot showing the identified potential proximal-distal connections (pdc) and enrichment signals for the random pairs.
Extended Data Fig. 11 | Inference of gene regulatory networks (GRNs) at cell the median, and whiskers show 1.5x the interquartile range. d, 15 commonly used
subclass level across the whole mouse brain. a, Schematic of identifying network motifs56 used in our analysis. Each node is a TF or a gene, and edges
co-accessible cCREs for each cell subclass using Cicero44. b, Schematic view of describe the regulation directions, i.e., arrows pointed to the ones that were
inference of GRNs from predicting the putative target genes’ expression with the regulated by the source nodes or TFs. The blue colour means the negative
corresponding transcription factors (TFs) for each cell subclass using CellOracle52. regulation (TFs inhibit target gene expressions), while the orange colour means
c, Boxplot of 267 P values from two-sided Kolmogorov-Smirnov test to check the positive regulation (TFs upregulate target gene expressions). PFL, positive-
power-law distributions of the nodes’ degrees from GRNs. Only one cell subclass feedback loops; RDP, regulated double-positive; FC, fully connected triad; FFL,
(OB_Eomes_Ms4a15_Glut) did not pass this examination with the P values smaller feedforward loops. SIM, single-input module. e, Stacked bar plots of the ratio of the
than 0.05. The box spans the first to third quartiles, the horizontal line denotes network motifs above in each subclass. Each column responds to one cell subclass.
Article
Extended Data Fig. 12 | Histograms of the counts of the network motifs in amygdala; the diencephalon region includes thalamus and hypothalamus; the
each subclass’s gene regulation network (GRN) grouped by main class (a) or hindbrain includes pons and medulla. c, Normalized signals of Atf3 ChIP-seq at
regions (b). The names of the network motifs are the same ones in Extended Klf4 in bone marrow-derived macrophages (BMM) showing Klf4 is likely to be a
Data Fig. 11d. Only the class with at least 3 subclasses were shown here. For each putative target of Atf3. d, Normalized signals of Atf3 ChIP-seq at Tal1 in bone
histogram, we added the corresponding density plot. The telencephalon region marrow-derived macrophages (BMM) showing Tal1 is likely to be a putative
includes isocortex, olfactory bulb, hippocampus, striatum, pallidum, and target of Atf3.
Extended Data Fig. 13 | Comparison of chromatin accessibility (CA) the fraction of genomic distribution of CA-conserved and CA-divergent cCREs.
conserved and divergent cCREs between mouse and human. a, A schematic The CA-conserved cCREs show an increase in percentage in Promoter-TSS
of CA conserved and divergent cCREs. The CA-conserved cCREs are the cCREs regions. d, Histograms showing the number of CA-conserved and CA-divergent
in our snATAC-seq data that are conserved across species and have open cCREs in subclasses. The number of CA-conserved cCREs is higher than
chromatin in orthologous regions. The CA divergent cCREs are sequence CA-divergent cCREs. e, Histograms showing the CA-conserved cCREs captured
conserved to orthologous regions but have not been identified as open by the number of cell subclasses. A fraction of CA-conserved cCREs are captured
chromatin regions in other species. The bar plot shows the numbers of by more than 200 cell subclasses. f, Histograms showing the CA-divergent
CA-conserved and CA-divergent cCREs. b, Bar plot showing the relative fraction cCREs captured by the number of cell subclasses. Most CA-divergent cCREs are
of CA conserved and divergent cCREs across subclasses. c, Radar chart showing captured by less than 50 cell subclasses.
Article
Extended Data Fig. 15 | Accessible variability at transposon elements accessibility at variable TEs across subclasses. c, Top10 motifs enrich in
(TEs) across cell subclasses. a, Density scatter plot comparing the averaged positively distal cCREs overlapped with variable TEs. The unadjusted P-values
accessibility and coefficient of variation across cell subclasses at each were calculated using a two-sided Fisher’s exact test. d, Normalized accessibility
transposon element. Variable TEs are defined on the upper right side of dash at invariable TEs in different cell subclasses. The middle bar plot showing
lines, invariable TEs are defined on the upper left of dash lines. b, Normalized correlation between mCG level and accessibility at invariable TEs across
accessibility at variable TEs in different cell subclasses. The middle bar plot subclasses. The right bar plot showing correlation between expression level
showing correlation between mCG level and accessibility at variable TEs across and accessibility at invariable TEs across subclasses.
subclasses. The right bar plot shows correlation between expression level and
Extended Data Fig. 16 | Spearman correlation across orthologous cCREs between all paired human and mouse subclasses (mba: mouse brain atlas; hba:
human brain atlas).
nature portfolio | reporting summary
Corresponding author(s): Bing Ren
Last updated by author(s): Oct 30, 2023
Reporting Summary
Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.
Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.
Data analysis bwa (v.0.7.17), HOMER(v4.11), BEDTools (v2.25.0), MACS2 (v2.1.2), GNU parallel (20220822),
GNU R (v4.3.1), ggplot2(3.4.3), stringr (1.5.0), purrr(1.0.2), dplyr (1.1.3), Seurat v5,
Python (v3.10), SnapATAC2 (v2.4), Sklearn(v1.1.0), Cicero (v3.16), CellOracle (v0.15.0),
Sony SH800S software,
https://github.com/beyondpie/CEMBA_wmb_snATAC
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and
reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.
April 2023
1
nature portfolio | reporting summary
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A description of any restrictions on data availability
- For clinical datasets or third party data, please ensure that the statement adheres to our policy
Demultiplexed FASTQ files are available at the NEMO archive (NEMO, RRID: SCR_016152 ) at https://assets.nemoarchive.org/dat-bej4ymm (the raw directory under
the source data URL in this archive), and at the NCBI under GEO accession number GSE246791 . Processed data are available at our web portal (http://
www.catlas.org) and the same GEO accession number above.
Recruitment N/A
Note that full information on the approval of the study protocol must also be provided in the manuscript.
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf
Replication Experiments were performed for 2 biological replicates for each of 117 dissection regions. All the replicates were successfully collected.
Randomization There was no randomization of the samples. For each dissection region, dissected brain tissues were pooled from 2-31 (only 2 dissections
from the mouse cerebellum region had 2 animals for snATAC-seq library construction, all the other samples had 4-31 animals) of the same
sex. The 117 dissection regions were designed before experiments for analyzing the whole mouse brain in a comprehensive way.
Blinding Investigators were not blinded to the specimen being investigated based on our experimental design above.
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.
2
Materials & experimental systems Methods
Laboratory animals Adult (P56) C57BL/6J male mice were purchased from Jackson Laboratories at seven weeks of age and maintained in the Salk animal
barrier facility under a 12-h light/12-h dark cycle in a temperature-controlled room with ad libitum access to water and food until
euthanasia. The temperature in the animal facility was maintained within the range of20 to 22.2C, while the humidity levels varied
between 35 and 60%.
Ethics oversight All experimental procedures using live animals were approved by the SALK Institute Animal Care and Use Committee under protocol
number 18-00006.
Note that full information on the approval of the study protocol must also be provided in the manuscript.
Plants
Seed stocks N/A
Authentication N/A
Flow Cytometry
Plots
Confirm that:
The axis labels state the marker and fluorochrome used (e.g. CD4-FITC).
The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers).
All plots are contour plots with outliers or pseudocolor plots.
A numerical value for number of cells or percentage (with statistics) is provided.
Methodology
Sample preparation Nuclei were stained with DRAQ7 (#7406, Cell Signaling)
April 2023
3
Cell population abundance Cell populations within each sample were determined using snATAC-seq as described in the manuscript. See Methods and
Gating strategy Potential nuclei were first identified using FSC-Area and BSC-Area. Next doublets were removed based on BSC and FSC signal
width. DRAQQ7 postive nuclei with 2n count were sorted.
Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information.
April 2023