Towards a clinically-based common coordinate framework for the human gut cell atlas: the gut models

Burger, Albert; Baldock, Richard A.; Adams, David J.; Din, Shahida; Papatheodorou, Irene; Glinka, Michael; Hill, Bill; Houghton, Derek; Sharghi, Mehran; Wicks, Michael; Arends, Mark J.

doi:10.1186/s12911-023-02111-9

Research article
Open access
Published: 15 February 2023

Towards a clinically-based common coordinate framework for the human gut cell atlas: the gut models

BMC Medical Informatics and Decision Making volume 23, Article number: 36 (2023) Cite this article

2311 Accesses
4 Citations
4 Altmetric
Metrics details

Abstract

Background

The Human Cell Atlas resource will deliver single cell transcriptome data spatially organised in terms of gross anatomy, tissue location and with images of cellular histology. This will enable the application of bioinformatics analysis, machine learning and data mining revealing an atlas of cell types, sub-types, varying states and ultimately cellular changes related to disease conditions. To further develop the understanding of specific pathological and histopathological phenotypes with their spatial relationships and dependencies, a more sophisticated spatial descriptive framework is required to enable integration and analysis in spatial terms.

Methods

We describe a conceptual coordinate model for the Gut Cell Atlas (small and large intestines). Here, we focus on a Gut Linear Model (1-dimensional representation based on the centreline of the gut) that represents the location semantics as typically used by clinicians and pathologists when describing location in the gut. This knowledge representation is based on a set of standardised gut anatomy ontology terms describing regions in situ, such as ileum or transverse colon, and landmarks, such as ileo-caecal valve or hepatic flexure, together with relative or absolute distance measures. We show how locations in the 1D model can be mapped to and from points and regions in both a 2D model and 3D models, such as a patient's CT scan where the gut has been segmented.

Results

The outputs of this work include 1D, 2D and 3D models of the human gut, delivered through publicly accessible Json and image files. We also illustrate the mappings between models using a demonstrator tool that allows the user to explore the anatomical space of the gut. All data and software is fully open-source and available online.

Conclusions

Small and large intestines have a natural “gut coordinate” system best represented as a 1D centreline through the gut tube, reflecting functional differences. Such a 1D centreline model with landmarks, visualised using viewer software allows interoperable translation to both a 2D anatomogram model and multiple 3D models of the intestines. This permits users to accurately locate samples for data comparison.

Peer Review reports

Background

Following a short preamble introducing the Human Cell Atlas endeavour, the main objective of this background section is to provide the reader with the biomedical context of our work. Specifically, we begin with brief introductions of (1) human gut anatomy, the focal point of our models, (2) Inflammatory Bowel Disease, a primary medical concern ultimately to benefit from this research, (3) clinical investigations, which underpin the nature of our models, and (4) single Cell RNA sequencing technology, which is the main development pushing biomedical atlasing work down to the cellular level. Building on this basis, we set out the general case for so-called Common Coordinate Frameworks and the specific objectives of our work.

Introduction to human cell atlas

The mission of the Human Cell Atlas (HCA) programme [1] is “To create comprehensive reference maps of all human cells—the fundamental units of life—as a basis for both understanding human health and diagnosing, monitoring, and treating disease” [2]. The human body is a complex amalgamation of cells organised into tissue, organs and systems which can be studied in health and disease states. The ability to study complex organisms at the most basic cellular level has generated vast quantities of molecular data necessitating suitable data capture and modelling platforms to support the interpretation of the data. The ability to visualise and map cell to tissue and tissue to organ data will allow in future for a more comprehensive understanding of changes related to health and pathological conditions.

Human gut anatomy

The gastrointestinal tract can be represented as a long cylindrical tube from oesophagus through stomach, small intestines, large intestines, to anal canal, terminating at the anus. The main function of the gut is to digest and absorb nutrients with the excretion of waste products. It also has essential roles in endocrine, immune and barrier function, delicately balancing the symbiotic relationship with the microbiome and supporting continuous epithelial tissue renewal. Here, we focus on the small and large intestines, from gastro-duodenal junction to anus. These gut components have internationally standardised gut anatomy ontology terms that describe the various regions, such as duodenum, jejunum and ileum of the small intestines, and caecum, ascending colon, transverse colon, descending colon, sigmoid colon, rectum and anal canal of the large intestines. A number, but not all, of the junctions between these component regions are separated by established landmarks, such as ileo-caecal valve, hepatic flexure, splenic flexure and anorectal junction for example. These gut regions with landmarks, together with consensus average length measurements, can be used to generate a 1-dimensional map or model of the gut that allows normal or disease samples to be located more precisely.

Inflammatory bowel disease

Mapping of disease location accurately within the gut is important for Inflammatory Bowel Diseases (IBD). These are chronic inflammatory conditions of the gastrointestinal tract with an increasing incidence worldwide [3]. The underlying inflammation is postulated to be secondary to the interactions between the microbiome, an activated immune system and mucosal barrier dysfunction in genetically susceptible individuals. The increase in incidence has been linked to adoption of a westernised diet with ultra-processed food [4, 5] and medications such as proton pump inhibitors [6, 7]. There are two main types of IBD: Ulcerative Colitis and Crohn’s Disease (CD). Ulcerative Colitis affects the large bowel, often starting in the rectum and progressing proximally, resulting in abdominal pain and a change in bowel function. Crohn’s Disease is the more complex disorder affecting any part of the gastrointestinal tract from mouth to anus, with distinct disease manifestations associated with the specific region of the affected gut [8].

Clinical investigations

IBD is diagnosed by standard methods including clinical assessment, radiological, endoscopic and histological evaluation. The Lennard–Jones criteria are considered the gold standard for confirming the diagnosis of Crohn’s Disease [9]. Endoscopic evaluation is used frequently to obtain tissue samples which are analysed to confirm the pathognomonic changes of discontinuous transmural inflammation, often with a fissuring pattern of deep ulceration and fibrosis [10]. In addition to the fissuring ulcers, there is both acute and chronic inflammation with focal cryptitis, crypt destruction and granuloma formation in around 60% CD cases. Longstanding inflammation may predispose to dysplasia which in some may evolve to invasive adenocarcinoma. Some patients may develop fibrotic tissue resulting in narrowing or stricturing of the affected bowel precipitating bowel obstruction. Subsequently, fibrotic tissue can be surgically excised although the mechanism of fibrosis is poorly understood and other treatment strategies are less effective [11].

It is a challenge to accurately map these changes to their correct positions in a three-dimensional model to illustrate the distribution within the gastrointestinal tract. If the patient has undergone surgery and there is associated radiological imaging, then location can be determined reasonably straightforwardly. During endoscopic procedures (and many surgical resections), the endoscopist (or surgeon or pathologist) usually describes the small or large intestinal region (e.g. ileum, ascending colon, etc.) involved by a lesion and sometimes provides a distance (in cm) from the anus using the distance markings on the endoscope surface (see Fig. 1), or for surgical resection specimens, a distance of the lesion from a landmark such as the ileo-caecal valve or the resection margin of the specimen. Some change to clinical practice is required to capture these data routinely as distances from recognisable gut landmarks.

Single cell RNA sequence data analysis

Single-cell sequencing is a powerful technology for profiling the transcriptome of large numbers of individual cells (see [12, 13] and [14] for a recent review). The technique generates large amounts of data that requires specialised computational and statistical analysis. Generally, single cells are isolated into wells of a plate or into droplets, such that transcripts from each cell can be barcoded or tagged (marking them with a unique molecular identifier; UMI), allowing the expression profile of the cell to be ascertained after RNA-sequencing, which is normally performed on pools of transcriptomes from many cells. The major variables in all single cell sequencing experiments are the number of cells profiled and the depth of sequence generated for each cell. The initial steps of single cell sequencing data quality control include removing data associated with UMIs that are not well represented, these are often associated with cells that are dying or are damaged, and to then examine the proportion of multi-mapping, un-mappable and mitochondrial reads for each cell, the frequency of which tend to correlate with poor data quality. Since the aim is to profile the transcriptome at single cell resolution, empty droplets, cell-free RNA and doublet-cells are removed using software such as EmptyDrops, SoupX and DoubletFinder respectively. Following on from these steps the data is normalised to account for differences in sequencing depth and where appropriate batch correction to account for non-biological factors such as time of sample collection. Further data processing steps can involve data smoothing and imputation, cell cycle analysis, unsupervised clustering as a prelude to dimensionality reduction and data visualisation, which can be performed using approaches such as PCA, t-SNE and UMAP. Where differential expression analyses are a key parameter, various methods have been developed including MAST and MetaCell. The field of single cell sequence analysis is rapidly evolving with many robust and elegant approaches allowing data exploration and there is a requirement for integration of such data with histological, radiological, clinical disease metadata and other data using a common coordinate framework approach. In addition, mapping the locations of the source tissue samples within the gut context will allow the discovery and analysis of the gradients of variation along the proximal–distal gut axis and reveal novel understanding of the gut biology. Without a mechanism for capturing gut location this aspect of gut-biology will remain undiscovered.

Towards a human gut cell atlas common coordinate framework

The primary aim of the Human Gut Cell Atlas (HGCA) is to capture a detailed atlas of tissue and single cell data in the spatial context of the adult human gut. Whether for clinical purposes for individual patients or more general research studies concerning the gut, data ranging from patient-specific information, histopathological image data and radiological images, to single cell sequencing data of gut cells as part of research work, all are now collected and stored by hospitals and research institutions, respectively. Data integration is a key prerequisite to facilitate AI techniques, in particular Machine Learning, to derive medically useful knowledge from these large, distributed data sources, in order to reveal the spatial organisation of the underlying molecular and cellular processes in normal and diseased samples. One of the primary integration criteria is the anatomical origin of tissues and cells which the collected data refers to. In the context of the Human Cell Atlas, such anatomical locations are to be recorded using a computational framework called the Common Coordinate Framework (CCF) [15].

The number and types of use cases for a Human Gut Cell Atlas and therefore the requirements for its CCF, are large and varied. A balance has to be struck between catering for all eventualities and a simplicity that makes the use of the CCF practical. In this paper, we describe a CCF for the human gut that is based on clinical practice, has at its core an easily understandable 1D gut model, but extends to complex 2D and 3D representations. A mechanism for capturing proximal–distal gut location is critical to enable not just an atlas of cell-types as revealed by scRNA-seq but also the gradients of change of the cell-types, sub-types and cell states, as well as cell populations along the gut axis and how that links to gut anatomy in both health and disease. Once we have introduced our own models for a Human Gut Cell Atlas CCF in the Methods and Results sections, we provide further details on related frameworks in the Discussion.

Objectives

The primary objective of the Human Gut Cell Atlas programme is to enable data integration of all data-types to deliver a research and analysis capability to support science discovery and clinical benefit in the context of the gut and related tissue diseases and pathological abnormalities. Our objective with this work is to deliver a practical CCF for the gut that make this possible. For that we have developed a conceptual model of the gut based on the natural coordinate of distance along the gut midline with semantic extension to specific tissues and cells. In addition, we develop a mapping mechanism that allows cross comparison with 2D and 3D gut representations including patient-specific data. A further aim is the interoperability of the proposed CCF with other similar efforts, to facilitate cross-CCF data integration.

In the Methods section we set the scientific context for our models in terms of the specific use case underpinning our work and then describe how the models were developed. The Results section presents the 1D, 2D and 3D models that we created and which form the core of the proposed CCF. In addition, a publicly accessible online tool illustrating the use and interaction of these models is presented. Related work is reviewed in the Discussion section, as are limitations as well as future prospects of the Gut CCF. The Conclusions summarise the primary contribution of the models and their implementations, as well as their importance and potential impact in the context of the Human Gut Cell Atlas endeavour.

Methods

Edinburgh–Cambridge Helmsley trust project HGCA CCF use-case

Although the exact CCF requirements across different projects will vary, the project described here includes many of the typical components for this kind of work, and thus facilitated the development of the gut models with common clinical and research practice in mind. For this project, Crohn’s disease lesion samples are collected from surgical resections. From the resection specimen tissue slices are taken from various sample points capturing both diseased and morphologically healthy (no visible pathological abnormality) tissue. The CCF must be able to capture the location from where in the gut the samples (either biopsies or blocks from a surgical resection) were taken, and for multiple tissue slices, the relative location of slices in terms of their sequence order as taken from the surgical resection specimen. Following slicing of a surgical resection specimen of gut, one or more parts of the slices form blocks of tissues for further processing, for either dissociation of fresh tissue into single cells for single-cell transcriptome sequencing, or fixation for histological analysis. In the latter case tissue blocks are fixed in buffered formalin, processed into paraffin and sections are cut for staining, scanning and analysis. Both the source of the sections—in terms of their original blocks—and their relative ordering and adjacency in the blocks must be tracked. Histology and sequence data generated during analysis are annotated with relevant CCF location information to allow the integration of data based on the same precise location of tissue, but also to map across different samples from different patients. Datasets will be made available where possible in accordance within the appropriate legal framework or within a secure research environment [16]. A conceptual overview of the project is provided in Fig. 2.

Regev et al. [1] state that “To be useful, an atlas must also be an abstraction, comprehensively representing certain features, while ignoring others.” What then are the appropriate abstractions for a Human Gut Cell Atlas? The answer to this question is guided by what data can be reliably obtained to build the atlas in the first place and then how to map new data onto the atlas, e.g. what is the location from where a resection specimen was obtained, and secondly, what are the questions we want to answer using this data. We start with a simple, clinically orientated, 1D abstraction, which in turn we extend to 2D and 3D models, including the mappings between them. These are complemented with a semantic layer of location descriptions. How we created these abstractions is discussed next and specific details and parameters are provided in the Results section.

1D—core model

The primary abstraction of the gut, representing both the small and large intestines, is that of a tube connecting the stomach to the anus. Location is captured in terms of distance along the centreline of the tube to anatomical landmarks, as measured, for example, by the use of an endoscope during a colonoscopy.

2D—anatomograms

Anatomograms have been developed at the European Bioinformatics Institute (EBI) within the Single Cell Expression Atlas (SCEA) programme as 2D graphical representations of certain organs, tissues and cellular assemblies for the purpose of presenting a pictorial overview of transcriptome data and potentially as a graphical interface for data query [17]. Here we have taken the gut 2D anatomogram image and created image domains (regions) that were drawn within the anatomogram for the anatomy of the large and small intestines. These domains were then segmented into sub-domains corresponding to the regions delineated in the anatomogram, e.g. anus, anal canal, descending colon, splenic flexure, etc. Where the anatomogram depicted distant parts of the gut domains as overlapping or touching, e.g. as the small intestine passed behind the transverse colon, then cut-domains were created to preserve the appropriate connectivity for the intestines.

Midline-paths were computed from the anus to the tip of the appendix for the large intestine and from the ileocaecal valve—ileum (ICVi) to the gastro-duodenum junction for the small intestine. A propagation algorithm in which possible image locations are considered in priority order was used to compute initial midline paths. Image location priority was determined by a combination of the distance to the path endpoint and from the domain boundary. From the ordered set of path image locations found along each path a smooth B-spline curve was then computed as the primary path representation. All image processing was done using Woolz [18].

3D—radiological-image based models

3D models of the human gut (limited to the large intestine and ileum of the small intestine) were computed from anonymised CT images. Two models have been built, one from an image in which the colon had been inflated and a second from an image in which the colon was non-inflated. In both cases the image domain of the large intestine and all or part of the ileum was segmented from the 3D CT image. For the inflated colon the domain was computed by using threshold-based region growing and morphological operations, with the region growing seed locations entered manually. ITKsnap [19] was used for region growing and Woolz was used for all other image processing operations.

For the non-inflated colon, threshold-based segmentation could not be used because of the wide variation in image values and textures throughout the colonic region, so a pre-segmentation image classification was performed using a machine learning approach based on the full convolutional neural network described by Long et al. [20] and implemented as “U-Net” by Ronneberger et al. [21]. To train the convolutional neural network a small number of virtual sections were cut through the 3D image with a range of sectioning parameters—including position and 3D orientation, these were manually segmented using MAPaint [22] an interactive drawing application for segmenting 3D image data. The segmented section images (2D) were then used to train a u-net classifier.

To reduce the manual segmentation effort, the number of segmented images was augmented using a combination of affine and non-affine transforms. The trained network was then used to generate a colon classification 2D image for all planes parallel to a virtual section of the original 3D image resulting in a full 3D classification image. The prediction was repeated for 36 sets of virtual sectioning parameters and the resulting 3D classification images were averaged to give a single such image. The classification image was then segmented using region growing and morphological operations in a similar manner to that used for the inflated model. The u-net was built using PyTorch [23], all other image processing was executed using Woolz. With the large intestine and ileum domains segmented from the 3D images, paths through them were computed using the same approach as for the 2D anatomogram.

Model–model mapping transforms

Each of the 1D, 2D and 3D models represent a spatial context in which data locations can be visualised and queried. It is critical for spatial query and analysis that a location in one model can be mapped to any other so that the spatial frameworks are interoperable and data can be cross compared. For this the 1D linear model was mapped onto the 2D and 3D midline paths computed through the anatomogram and 3D image models respectively. Actual distances along each path are model dependent so a piecewise-linear mapping approach was adopted as an initial or base-level cross-mapping. On each path within each of the 1D, 2D and 3D models the landmarks defined in Fig. 3 are marked. These are indicated within the anatomogram (Fig. 4) and the 3D models (Fig. 5) with marker “flags” and a change in colour of the visualised intestine segment. A location within a model is defined by the proportional distance along the midline path between the closest proximal (towards the mouth) and closest distal (towards the anus) landmarks. This simple definition allows locations and any data associated with them to be mapped between 1D, 2D and 3D models of the large and small intestines. This base-level mapping of locations between two landmarks without additional information is linear, however, this can be enhanced to a non-linear mapping to better reflect the anatomical structure as more detailed knowledge is acquired. Locations away from the midline path (but within the gut region) are mapped to the closest midline point of the same region. For efficiency this may be precomputed.

This mapping mechanism allows data from other coordinate frameworks to be mapped to these models. It also allows data from a specific patient to be mapped either through distances to landmarks noted during sample collection or retrospectively using pre-surgery 3D image data.

Semantic extension

The descriptions of locations in the gut so far have focused on distances along its tubular structure from the anus to the caecum and appendix (for the large intestine) and from the ileocaecal valve proximally (for the small intestine), but they are unable to provide a mechanism to capture more detailed cell location information with respect to the layers of the gastrointestinal wall at the given position, i.e. the mucosa, submucosa, muscularis propria and serosa. We therefore semantically complement the location by combining them with the relevant ontological concepts representing these layers, tissue types and cell-types. More specifically, the corresponding standard anatomy terms are those that were agreed as part of the HuBMAP project as a series of Anatomy Structures, Cell Types and Biomarkers (ASCT + B) tables [24, 25], which include links to matching terms in UBERON [26], the Foundational Model of Anatomy (FMA) [27] and the Cell Ontology (CL) [28]. Hence, a typical description might specify the origin of a tissue sample as coming from the mucosa halfway along the transverse colon and scRNAseq data could include further specification of cell-type e.g. goblet cell. Future work will also address the representation of location in terms of villus versus crypt (for small intestine) and left versus right (for large intestine).