C J Andrew@nhm Uio No

Title.
Biodiversity data standards for the organization and dissemination of complex
research projects and digital twins: a guide
Authors. Carrie Andrew1, Sharif Islam2, Claus Weiland3, Dag Endresen1
Institutions. 1Natural History Museum, University of Oslo, Sars’ gate 1, 0562 Oslo,
Norway; 2 Naturalis Biodiversity Center, Darwinweg 2, 2333 CR Leiden, Netherlands; 3
Senckenberg – Leibniz Institution for Biodiversity and Earth System Research,
Senckenberganlage 25, 60325 Frankfurt, Germany.
ORCID’s. CA: https://orcid.org/0000-0002-0524-8334; SI: https://orcid.org/0000-0001-
8050-0299; CW: https://orcid.org/0000-0003-0351-6523; DE: https://orcid.org/0000-
0002-2352-5497.
Emails. CA: [email protected]; SI: [email protected]; CW:
[email protected]; DE [email protected].
Key words. Biodiversity data; BioDT; Centralized data; Data standards; Data
integration; Digital twin; Distributed collaboration; Network architecture design;
Ontology; Thesaurus; Vocabulary
Abstract. Biodiversity data are substantially increasing, spurred by technological
advances and community (citizen) science initiatives. To integrate data is, likewise,
becoming more commonplace. Open science promotes open sharing and data usage.
Data standardization is an instrument for the organization and integration of biodiversity
data, which is required for complex research projects and digital twins. However, just
like with an actual instrument, there is a learning curve to understanding the data
standards field. Here we provide a guide, for data providers and data users, on the
logistics of compiling and utilizing biodiversity data. We emphasize data standards,
because they are integral to data integration. Three primary avenues for compiling
biodiversity data are compared, explaining the importance of research infrastructures for
coordinated long-term data aggregation. We exemplify the Biodiversity Digital Twin
(BioDT) as a case study. Four approaches to data standardization are presented in
terms of the balance between practical constraints and the advancement of the data
standards field. We aim for this paper to guide and raise awareness of the existing
issues related to data standardization, and especially how data standards are key to
data interoperability, i.e., machine accessibility. The future is promising for
computational biodiversity advancements, such as with the BioDT project, but it rests
upon the shoulders of machine actionability and readability, and that requires data
standards for computational communication.
1. Introduction. The need to organize, integrate disseminate and utilize
biodiversity data.
Biodiversity data are being generated at an exceedingly rapid rate, propelled by novel
techniques (e.g., eDNA, drones, satellites, camera traps, acoustic monitors, text
extraction, AI) and community (citizen) science initiatives to gather information at levels
never before achieved. It is an exciting time for biodiversity research (Heberling et al.
2021), as the greater extent of data makes unanswered questions in ecology and
2
evolution ever more obtainable. While they can help address them, rarely is it
instantaneous. Instead, biodiversity data are a massively increasing resource. They
must be organized, and then disseminated, in order to be effectively utilized. There are
challenges to integrating biodiversity data, as we describe here.
We explain, step-by-step, the logistics of compiling and utilizing biodiversity data, when
considered as a whole across independent data collection events. Our target audiences
are biodiversity data providers and biodiversity data users, for whom we find that such a
guideline resource is currently lacking, yet would prove helpful to be provided. We aim
for readers to find this guide useful as a general reference for understanding integrated
biodiversity data, with our final recommendations aimed towards research
infrastructures. We begin with the organization of biodiversity data for dissemination,
and emphasize data standards, because they are how data become integrated. As will
be exemplified, no matter the solution, new challenges will - and do - arise with
compiling and integrating “big data”. We then utilize the Biodiversity Digital Twin (de
Koning et al. 2023, Trantas et al. 2023, Golivets et al. 2024) as a data standards case
study, because it exemplifies the varied states of data standards that most biodiversity
research projects are involved with, and the needs to support technological advances
alongside data mobilization within data infrastructures. We illustrate the processes that
are involved to simultaneously advance multiple research projects, each with different
data sources and modeling outcomes, to function as one singular biodiversity resource,
i.e., a digital twin.
There are two assumptions that we begin with. First, that biodiversity data, across
independent research projects, should be made Findable, Accessible, Interoperable and
3
Reusable, i.e., FAIR (Wilkinson et al. 2016) - and not relegated to a trash bin, whether
real or digital. Second, that data should be deposited in integrated databases and data
infrastructures, instead of private initiatives, to help establish their reusability and
preserve their longevity.
2. From fractionated datasets to unified resources.
Whether by researchers or community scientists, and whether from a single project or a
global initiative, people and machines are creating vast amounts of biodiversity data - in
parallel independent projects. This is a challenge, as biodiversity data typically originate
from multiple, fractionated sources, and often from dissimilar methodologies and
measures (Gadelha et al. 2020). Data also vary in their degrees of FAIRness, which
impacts their accessibility for people, as well as their machine interoperability and
readability (Wilkinson et al. 2016). Computationally, software agents are able to
autonomously interpret and process FAIR data, meaning that they are machine-
actionable (Jacobsen et al. 2019; Section 3).
There are three common ways to find data resources and to deal with data integration:
Either a) a list of available data must be created, which the community can access for
independent integration, or data must be deposited within established b) databases or
c) data infrastructures for community access of integrated data (Figure 1). No solution is
ideal, but, as we will explain, shared or interoperable data standards makes data fusion
significantly easier.
Creating a list of available data sources is the only option that does not directly work
with the data, but which does inventory the data which are available. It is, therefore,
4
most useful when data sources still need to be compiled. Take, for example, global
conservation initiatives. Inadequate data are a source of frustration that impedes
conservation policy and management (Stephenson et al. 2022). One solution has been
to compile a list of existing conservation data sources (e.g., Stephenson & Stengel
2020). People can independently access and integrate the data from the list, depending
on their own objectives. A list option is impractical across all biodiversity topics, having
previously been attempted (Blair et al. 2020), but it can function successfully for
subtopics, e.g., conservation (Stephenson & Stengel 2020).
However, lists that have been compiled require consistent updating to remain actively
usable, as new datasets may become available and/or existing datasets may become
degenerated, especially if not made FAIR (e.g., due to nonfunctional weblinks). Lists
may also result in a lot of duplicated effort, since data integration will occur
independently between users, and without any of them necessarily openly sharing their
harmonized data. It is important to note that the lack of sharing may be due to data
protection rights of at least one of the original data sources. In those cases, the data will
always need to be individually accessed, irrespective of any data integration and
sharing initiatives. It is also not always possible to integrate all of the data listed if the
contents do not match. There are reasons, thus, for database lists in certain situations.
Applying a community-driven approach to curating and maintaining any list could extend
its effective lifespan, as many people would be able to help contribute to and update it
(Blair et al. 2020).
It may be more beneficial to, instead, create a database that researchers can submit
their data to (Figure 1). This may occur independently through the creation of a single,
5
accessible database. For example, often such a database, begun by a select group of
researchers, is first published in a peer-reviewed journal and then opened up for further
data accessions to it, as well as use of the data. When accepted by the general
scientific community, this is an effective bottom-up approach to unify biodiversity data.
For example, this approach has proven popular for traits-based data; both the TRY
(Plant trait database; Kattge et al. 2011) and the FRED (Fine-Root Ecology Database,
Iversen et al. 2017) databases are examples of such initiatives. However, by being
kickstarted via independent researchers &/or within specific institutions, the continual
requirements for maintenance of any database may, as with lists, be difficult to
perpetuate across longer time scales. Project funded dependency limits the duration as
well as the benefits of the aggregated data to the initiative’s lifetime. Community-driven
approaches again have the potential to maintain databases longer than individual
people can (Blair et al. 2020). For example, both the TRY and FRED databases can
now be found as part of the Open Traits Network (Gallagher et al. 2020), a community-
driven organismal traits database initiative that has unified earlier initiatives into one.
Currently there is hope by some that the management of both lists and databases can
be sustained by willing community members (e.g., Blair et al. 2020, Gallagher et al.
2020). However, how durable such distributed collaboration initiatives can be depends
on a multitude of factors related to user demographics, degree of support by the hosting
institution, and, ultimately, still the need for database management by a select group of
people. Lists and databases require maintenance by groups of dedicated individuals,
and, in addition, they can be prone to errors from public contributors, if not checked or
6
limited. Community-driven approaches to lists and database are, thus, not automatic
promises of success (Shaw & Hargittai 2018, O’Leary et al. 2020).
It may be more sustainable in the long-term to integrate data into a data infrastructure,
which is the third option (Guralnick et al. 2007). A larger network of support, and longer
funding duration of institutionalized, non-project-based data infrastructures, which
essentially are libraries for digitized data, should provide protection and access to the
data across many generations of scientists. One example is the Global Biodiversity
Information Facility, GBIF (Guralnick et al. 2007, Robertson et al. 2014, de Poorter et al.
2017). GBIF has functioned by building upon local data repositories to aggregate data
into their infrastructure. Dedicated individuals (nationally funded nodes staff and globally
funded secretariat staff) work to maintain and integrate the existing and newly deposited
data, while community-driven approaches are also applied, but as subcomponents for
data deposition ultimately managed by regulations implemented by GBIF employees
(Robertson et al. 2014), and set by the wider governing community TDWG (Section 3).
Data in a data infrastructure must fit into the protocols that have been established. It is a
result of what is referred to as a centralized and connected network architecture design
for a data infrastructure (Gallagher et al. 2020); the biodiversity data are connected into
a data infrastructure from multiple sources, but are limited to the formatting that a
central hub has designated, as the manager of the data. The connection to many
sources is not the issue with the network design, but the centralization aspect can be
seen as a governance issue, despite that such requirements are often imposed due to
standardization requirements (discussed further in Sections 3 - 5). Another caveat of
data infrastructures is that no single infrastructure can capture all of biodiversity data
7
(De Pooter et al. 2017). And, finally, data infrastructures help streamline data deposition
and management, but they, ultimately, function the same as a database to many data
depositors and users. They may, thus, end up on a list alongside other individual
databases, for any given biodiversity data topic.
Given the need for multiple lists, databases &/or data infrastructures, and given the
complexity of biodiversity data, there is the potential for overlap. Databases originally
created independently may, with time, be deposited within a data infrastructure, perhaps
after it was developed or gained popularity. Or databases may be combined into a
larger one (Gallagher et al. 2020). Data infrastructures pull in data from multiple
sources, despite repetition, although this is, to a degree, contingent on the taxonomic
systems utilized (Feng et al. 2022). Data can, thus, fit within multiple lists, databases
and/or data infrastructure categories, which does create data duplication. So long as
key information (e.g., a persistent identifier or combination of original source, date,
location, taxon, etc.) is passed along to the different resources, duplication is not likely
to be a significant issue to data users. Importantly, the key information must be
standardized in order to most easily discern any data repetitions between sources.
3. There are (many) standards to follow.
As long as the aim is to save and share biodiversity data, we advocate that the most
optimal solutions are to deposit data within databases and data infrastructures.
However, it requires harmonization and integration of the data, i.e., they need to be
standardized (Box 1). Broadly considered, there are multiple types of, and reasons for
using, data standards. For example, contents of variables have standards (i.e., units or
8
forms of measurement, such as the SI units), variables need to be standardized by
name and meaning across datasets, and even metadata require standards for machine
interoperability and readability. For the purposes here, we focus on the standardization
of biodiversity datasets, including the relationships between them and the data they
contain (e.g., taxonomic hierarchy or interspecies interactions). We utilize the term
“dataset record-level data standards” to differentiate them from the other forms.
Standardization can occur very simply and independently, for example by a single
researcher integrating datasets, they do employ a form of standardization. Collectively
and more formally, standards are ratified at an international level for widespread
adoption and utilization. The Biodiversity Information Standards working group (TDWG,
the acronym is based on a former name of the group; https://www.tdwg.org) is the
international resource for biodiversity data standards. It is a collaborative international
network of scientists and computational experts from a variety of fields, who together
establish standards for dataset integration. Data standards have developed primarily
based on the interest and computational needs of those depositing and integrating data;
thus, they have been especially influenced by museum data digitization (e.g., Graham
et al. 2004), and the rise of data infrastructures such as GBIF (Wieczorek et al. 2012,
Robertson et al. 2014, De Pooter et al. 2017). However, as the field of biodiversity data
changes, and the data sources diversify, so too do the practicalities of data
standardization (e.g., Schneider et al. 2019; Section 4).
Dataset record-level data standards help integrate and harmonize data into databases,
whether private or shared, and, ultimately, into data infrastructures. They make different
datasets uniform during integration, by requiring formatting of data to established
9
identifiers (e.g., persistent identifiers such as DOIs), labels (e.g., names of variables),
definitions (e.g., how variables are defined) and interrelationships (e.g., how variables
interrelate). The focus is, from the data providers’ and users’ points of view, primarily on
the variables: how they are named and how they interrelate (Figure 1). From the data
scientists’ point of view, it is more complicated, because very large databases do not
usually store data as a single table that retains variables as, for example, the columns.
They instead parse data to save storage space, often by reducing redundancies and
blank entries. For example, they work around data cores, which are subsets of the
original datasets that connect to all of the rest of the dataset material (and, thus, need to
be standardized).
Data cores are an example of required (i.e., core) variables for data deposition into a
database or data infrastructure. They have become somewhat synonymous with
dataset record-level standards because the terms (i.e., names of variable) have
established vocabularies, thesauri and/or ontologies associated with them (Box 1). For
example, GBIF’s star schema is built around the Darwin Core data standard, a set of
terms (variable names) describing the taxon, location, date and related basic
information, and which have been ratified in TDWG (Wieczorek et al. 2012, Baskauf et
al. 2016, Baskauf & Webb 2016). Cores function also to limit the amount of storage
space that data take up by revolving around a core standard of terms, compared to if
data were stored in a more basic table format - although data cores have been criticized
for constraining how easily new variables can be added, thus they may become
obsolete with time (Robertson et al. 2014, De Pooter et al. 2017, Gallagher et al. 2020;
Section 4 below). There is rarely a perfect match to how a database structure is set up;
10
tradeoffs between types of data, storage space, computational time to access data, and
the incorporation of new forms of data all weigh into any option.
Dataset record-level data standards, more concretely, include collections of terms (e.g.,
names of variable), associated concepts of the terms (e.g., definitions), and described
relationships between the terms (e.g., taxonomic hierarchies or species interactions
between variables). They are assembled into semantic artifacts such as vocabularies,
thesauri and/or ontologies. They often follow semantic web standards via the application
of the resource description framework (RDF), as expressed with the Web Ontology
Language (OWL) for computational relevancy (Box 1). It can be a major semantic
endeavor to create any of the types of standards, requiring many participants for
discussions, and the general scientific community for implementation. When employed,
they define and standardize variables, e.g., the data fields across different data sets,
make data integration possible, and can pave the way for machine actionability (with a
consistent structure for computation) and machine readability (with instructions for
computational interpretation and manipulation).
Despite their relatedness, there are critical differences between the forms of dataset
record-level data standards. Vocabularies are lists of words (terms, such as variable
names) with definitions, like a dictionary. They organize specific terms (names of
variables) into an inventory. For example, the Darwin Core standards list terms and their
definitions in a controlled vocabulary. Many times, however, relationships between
terms are needed to better express them. A classic example would be the Linnean
classification system, which organizes taxonomic terms into a hierarchy that nests more
11
specific terms within broader ones. In contrast, a vocabulary does not, on its own,
achieve any such definition of the relationships between terms.
Ontologies and thesauri are often used interchangeably to refer to a vocabulary that
also includes established relationships between the terms, despite that, semantically,
they differ. Compared to vocabularies, thesauri contain a more prominent relationship
structure by organizing terms into a hierarchy. Ontologies contain an even greater focus
on knowledge representation and provide semantic linkages between vocabulary terms
by specifying relationships between terms, which can extend beyond hierarchical to, for
example, interspecies interactions (e.g., Wohner et al. 2022). Ontologies were
developed to provide greater structure than could any singular vocabulary, thereby
making data more accessible, and standardizing across vocabularies (Mi & Thomas
2011, Kartika et al. 2022). Vocabularies can be used for multiple different ontologies, so
that ontologies of broadly similar topics (e.g., biology, medicine and ecology) may
partially overlap (Kartika et al. 2022). If vocabularies are like a dictionary for data, then
ontologies are like the language of data, consisting of not only the words, but also
relationships between words and their meanings. Ontologies, while ultimately more
helpful for data integration and standards, are also more complex to develop and,
therefore, less applied in practice than are vocabularies (Gadelha et al. 2020).
A gradation in complexity can be used, as here, to describe the basics of vocabularies,
thesauri and ontologies. However, the standards can also be viewed alongside a
continuum which describes a transition from machine actionability to machine
readability. For computation, the managed terms of vocabularies provide tagging
information that support data aggregation and information retrieval. In contrast, thesauri
12
and, especially, ontologies are designed for machine readable knowledge. They both
allow cross-referencing and semantic interfacing with computers (Walls et al. 2014,
Schneider et al. 2019, Kartika et al. 2022).
Applications of the resource description framework (RDF) are, essentially, the
computational voices for how data standards can be interfaced with computers. RDF
can be used to computationally link two terms (e.g., variables) to reflect relationships,
thereby allowing large databases to push away from flat file storage formats (such as
GBIF’s star schema; Wieczorek et al. 2012, Baskauf et al. 2016, Baskauf & Webb
2016). RDF is, in other words, how basic sentences are computationally formed,
explaining, for example, taxonomic hierarchies and species interactions. It explains
relationships in a triplet sequence of subject, predicate and object. RDF can
computationally express ontological relationships, with the Web Ontology Language,
OWL, an example of a data modelling language of data standards that are expressed in
RDF.
Despite the many available ontologies to date, the ability to match data to existing
sources can remain very low due to the diversity of variables possible to quantify (e.g.,
Kartika et al. 2022, Wohner et al. 2020, Wohner et al. 2022). A pragmatic approach to
implementing data standardization concerns semantic entity mapping, which connects
and unifies, such as similar terms (variables) between different semantic artifacts
(vocabularies, thesauri, ontologies) used to annotate different datasets. Data users
may, without realizing it, routinely employ a simple version of mapping when they
connect variables when integrating independently produced datasets. They would sort
and combine, when possible, the geographical and date variables, or species
13
taxonomy, between different datasets. Mapping can be fairly intuitive to integrate and to
synonymize variables between a small number of related datasets, however, it becomes
more challenging as datasets increase in size and diversity. It is also subjective to user
interpretation. Another potential issue is that it can be employed only so far as there is
availability of established data standards for the terms to be mapped, else the decisions
become even more subjective (Wieczorek et al. 2012, Walls et al. 2014).
Some argue that ontologies, which reflect interrelationships, should be built to
comprehensively cover biodiversity data rather than to use what could be a simpler, but
more repetitive and subjective approach to independently map data. The latter
“reinvents the wheel” with every mapping occasion, as opposed to the former, which
gives instructions on how to build the wheel. However, the latter is much easier to
implement directly in data integration, unlike the former, which takes substantial
additional work to create the ontologies for the terms of interest, which are nearly as
diverse as biodiversity itself. Therefore, mapping is often a more pragmatic solution to
data integration than is the development of a specific ontology.
With regard to dataset record-level data standards, there are four important points to
keep in mind. First, data standards refer to any individual or collection of vocabularies,
thesauri, and/or ontologies (Box 1). They grade in the degree of complexity that they
define the relationships between terms, from the simplest, a vocabulary (which lacks
relationships), to the more complex, an ontology (which specifies relationships). All can
be applied as a standard. Each degree of complexity increases the accuracy of
computationally representing the terms (e.g., variables) and their relationships (e.g.,
taxonomic hierarchy) and, hence, the biodiversity data. For example, a vocabulary can
14
be a data standard, and an ontology is also a data standard. Either may be applied to
any given database or data infrastructure, but only one will infer defined relationships
between terms.
There is also a hierarchy to data standards, which is important to bring up due to its
impact on data interoperability (e.g., computational communication). Data standards
vary on how broadly or specifically applicable they are to, and across, disciplines (also
referred to as domains). Multiple ontologies can sync across disciplines by linking into a
top-level ontology, which spans across disciplines (Box 1). Top-level ontologies ensure
uniform meaning of objects (such as terms) and how they interrelate, irrespective of the
discipline they are applicable for. In the early 2000s, the Open Biomedical Ontologies
(OBO) consortium formed the OBO-Foundry, which does exactly this for biological and
related disciplines (Smith et al. 2007). There is only one level of hierarchy above the
OBO-Foundry, and that is the Basic Formal Ontology (BFO), which lacks any domain-
specific terminology (Otte et al. 2022). The domain-specific terms must be referenced
within the OBO-Foundry and, even more specifically, in its more specialized ontologies.
The Biological Collections Ontology (BCO) has started the process to integrate some of
the first key terms from Darwin Core in an OBO-Foundry ontology (Walls et al. 2014).
Thus, data standards segway from very broad to very specific, and the more discipline-
specific standards should, ideally, fit into this hierarchy. The greater relatability
discipline-specific ontologies have to top-level ontologies, the better for data
interoperability.
The third point is that data standards can be applied, either in part or in full, to any
number of databases and data infrastructures. For example, the Darwin Core data
15
standard is a vocabulary that is expressed in RDF (Baskauf et al. 2016, Baskauf &
Webb 2016). It is applied as ratified in TDWG for GBIF data (Wieczorek et al. 2012), but
it is modified in another data infrastructure, OBIS (the Ocean Biogeographic Information
System; De Pooter et al. 2017), to include an extension of the terms covered (see
Section 4 for more about extensions).
The final point is with respect to ratification. TDWG supports and ratifies many
standards, including Darwin Core, but many other vocabularies, thesauri and ontologies
also independently exist (Kartika et al. 2022). There can be greater or lesser adoption of
them by their scientific communities, i.e., not all are ratified, and even more are being
developed on an as-needed basis (Schneider et al. 2019, Gallagher et al. 2020). Data
standards are usually created when needed, thus, they rarely follow a single given
protocol or method for development, and rarely encompass all terms an individual
research project may contain in its dataset. Data standards are community-driven in
creation, and TDWG is open for anyone to join. Thus, data standards impose
restrictions on data quality and formats (Section 4), but they in themselves are not
restricted in terms of who creates them.
4. How and whether to select a standard core?
To this point, a straightforward approach to selecting which data can be deposited into a
database is to delimit a standard core of common terms (e.g., variables). This approach
is employed when centralizing data into a hub that manages it, and requires that dataset
record-level data standards have already been developed, to form the standard core. A
classic example is how GBIF is structured around a set of Darwin Core standard terms
16
(Wieczorek et al. 2012). Data deposition is streamlined to GBIF by ensuring all datasets
contain the core standard variables; however, it also removes or archives extraneous
variables which may have originally been a part of the datasets (De Pooter et al. 2017,
Guralnick et al. 2018, Gallagher et al. 2020). Information is lost when not all data are
included. Standard cores have been created due to the variety of dataset variables; the
cores represent the most reliable and common terms (variables) across a variety of
datasets, for a given purpose or objective (e.g., museums; Graham et al. 2004), but this
cannot cover the whole of biodiversity data. An emerging issue we are facing, as
community-based databases and data infrastructures become more prevalent, is the
need to include what have been earlier excluded data terms (variables). We provide
three options for this, discussing how it may not always be possible to align all data
around a core standard that centralizes the data architecture.
One choice is at the level of the data standards governance, whether ratified in TDWG
or maintained independently elsewhere. The number of terms in a standard core can be
increased through an update of the existing standard core. For example, the ratified
TDWG Darwin Core standard began with 24 terms (i.e., variables) in 1998, grew to 169
by their 2009 version (Wieczorek et al. 2012; Darwin Core Maintenance Group 2021)
and currently contains 206 terms in their July 2023 version
(https://github.com/tdwg/dwc/releases). Still, 206 terms cannot describe all of
biodiversity data, and the terms need to be ratified in an update in order to then allow
their inclusion in the core. In addition, for any given database or data infrastructure
which began based on an older version of a data standard, there will need to be a new
solution to how to update to the new standard. For example, they would have to accept
17
blanks in the data deposited before the update for the newly added terms, especially
when the data did not originally contain any of the updated terms. Another option could
be to retain an archival version of more complete data that can be used to add in at
least some of the new terms. Original data sources could also update the data to
include variables earlier excluded, if the data managers were motivated and able to. A
final option would be to not accept the new version of the standard, and instead
maintain data with an older standards version. In the case of GBIF, for such purposes
they have made Darwin Core backwards compatible, to account for data updates based
on core terms updates. While updating a data standard is very helpful for future
databases and data infrastructures, it complicates matters for existing ones, and brings
forward a degree of catch-up work for them.
A second option does not need to be directly enacted at the level of data standards
governance; it does not modify a ratified data standard. It is applied at the level of an
individual database or data infrastructure, and works by including a new optional set of
data standards to the existing database, which is referred to as an extension, or an
extended core. When a standard core has already been implemented, which has
steered the database structure, and it cannot be drastically modified, this is a practical
solution to extend out the data beyond the standard core. Adding an extension of the
standard core (which will be a new set of standards) allows data to be integrated that fit
with not only the original standard core, but also the newly established extended core.
The Humboldt Core (which is ratified by TDWG) is an extension now available for GBIF-
deposited data. It allows users to include field inventory information additional to the
standard Darwin Core (Guralnick et al. 2018). GBIF contains multiple other extensions
18
(Robertson et al. 2014). Similar logistical issues regarding the implementation of
extensions for databases or data infrastructures with earlier deposited data also exist
with this option. Usually, the extended core is not required, perhaps somewhat
alleviating the issues of adopting one after the fact, at least in practical terms of earlier
deposited data. Otherwise, the same issues remain for updating already established
databases and data infrastructures as explained for the first option.
The third option is to alter how new databases and data infrastructures are structured
for data deposition. By switching to a decentralized network architecture, data standards
are not required prior to data deposition. They instead may be created after the fact, on
an as-needed basis. It bypasses the above issues regarding creating and/or applying
only established data standards. A decentralized (but still connected) network
architecture is promoted for cases when data standards do not yet exist for the data that
are to be integrated. For example, the Open Traits Network is advocating this approach
to incentivize community collaboration in integrating diverse traits data that lack any
current data standards (Gallagher et al. 2020). It works opposite to the centralized (and
connected) network architecture, as is used, for example, with GBIF and OBIS.
Considering that ratified data standards take years to be created, removing that
requirement would open the potential to integrate a much greater amount of biodiversity
data. In theory, the approach promotes data standards creation at a level that
supersedes the community-driven approaches of TDWG and similar groups. However,
the degree to which such a design really will deviate from the TDWG governance
approach to data standards is questionable, i.e., could it really be a current-day
reenactment of what originally led to TDWG and the existing data standards
19
governance? How will mapping be implemented to join up similar dataset variables? We
caution that a distributed collaboration approach brings with it substantial logistical
challenges in terms of community representation and data management (Shaw &
Hargittai 2018, O’Leary et al. 2020), but highlight it as an option being promoted by
some for breaking out of how they perceive the current data standards regulations to be
(Gallagher et al. 2020).
It should be clarified that other bottom-up community approaches have already been
tested by both GBIF and TDWG, to support anyone in describing a new term needed for
their data. In a recent approach, GBIF is computationally compiling all terms used in
datasets that are submitted to and published in GBIF. Many datasets contain more
variables than what is ultimately distributed by GBIF in its core and extensions, and
these data are available to access in their archives. The inventory of all data terms will
be used as a starting point for future discussions to standardize more terms than
currently found in controlled vocabularies.
Data standards, however created, are critical for integrating data; otherwise, database
contents (i.e., variables) may overlap, not match up properly or be incorrectly defined
(i.e., semantic mismatches). Data standards also can constrain the extent of data able
to be integrated, and thus are not always adopted in fully ratified ways. No approach to
working with data integration is a “silver bullet” for biodiversity data, nor can any network
structure promise perfect harmonization.
5. Case study part I: introducing the Biodiversity Digital Twin and available data
standards
20
Digital twins (DTs) produce real-time monitoring and decision-making information
through the automated processing of data, models and output. As with any modeling
approach, DTs require data. They utilize both existing and, more uniquely, real-time
sensor data to run models and produce results for people to interpret, whether for
industrial, engineering, or, more recently, biodiversity purposes (de Koning et al. 2023,
Trantas et al. 2023). The data used by the DT is constantly updated with new
information that the DT accesses, then uses to reprocess the modeling output. The
revisions happen continuously and, ideally, in real-time. One example of a DT is a
weather app. Weather predictions are based on sensor data continuously being fed into
weather models, with each being updated continuously to predict current and future
conditions. Results are graphed and tabulated, and fed into apps that users view to
monitor and check the weather forecasts. The Biodiversity Digital Twin (BioDT) project
has been developed as a DT analog for biodiversity projects - an ambitious endeavor to
tackle the world’s biodiversity issues in real-time analyses. The remarkability of a BioDT
is in the data integration plus modelling – it is a “twinning process” that will allow us to
attempt to better understand biodiversity through real-time monitoring as well as
predictive modelling for current and future time.
The ambitions do not stop with creating the BioDT, however. Multiple DTs are planned
to be integrated into a global system that can represent current to future aspects of the
natural, physical and social world, i.e., a computational rendering of earth, and into the
future. The BioDT should, thus, be built to be able to connect with other DTs. For
example, to interact with the Destination Earth DT, a European Union initiative on
climatological, meteorological and atmospheric conditions of the world (Hoffmann et al.
21
2023). For the DTs to interact, they will require interoperability, and this requires uniform
data standards to allow the DTs to communicate. It is supported through the
advancement of open science, such as what the European Open Science Cloud
(EOSC) Association is concerned with.
In the BioDT project, ten prototype Digital Twins (pDTs) have been developed which
encompass a variety of biodiversity research topics (Golivets et al. 2024). They are
organized around four themes: species and environmental change; genetic biodiversity;
threats of policy concern; species and humans. Each pDT requires input data to run
their models, but the sources and types of data vary. For example, the pDT data differ in
the organismal groups, the taxonomic coverage, the spatiotemporal scales, and the
sources of the data, i.e., from the data users’ perspectives, they differ in the explanatory
and response variables, and where to find those data. The variety in datasets is a
challenge for FAIR data and the application of data standards by the BioDT. It also
exemplifies the practicalities of integrating biodiversity across an array of topics, data
sources, and data types.
The BioDT is pillared by participating Research Infrastructures (RIs), which are
organizations involved in maintaining, integrating and disseminating biodiversity data
openly in Europe and/or globally; some are also data infrastructures (i.e., research data
infrastructures). The BioDT RIs consist of: GBIF; the Distributed System of Scientific
Collections (DiSSCo); the Integrated European Long-Term Ecosystem, critical zone and
socio-ecological system Research Infrastructure (eLTER); the e-Science European
Infrastructure for Biodiversity and Ecosystem Research (LifeWatch ERIC); and,
informally included later in the project, the European life science infrastructure for
22
biological information (ELIXIR). The BioDT RIs differ in their objectives, any forms of
data which they integrate and manage, and any applications of data standards (Table
1). The data standards that each utilizes differ, which governs how standardized data
may be integrated into the BioDT.
As the RIs are European to global in scope, they are as relevant for other data
integration initiatives as for the BioDT project. Whether depositing data into or extracting
data from the RIs, how they standardize their data will impact decisions and logistics of
data use and further integration (e.g., Wohner et al. 2020). Ideally, all data would filter
first through one of the RIs. However, in practice, they do not. The first step to
integrating data that will include contributions from the listed RIs, or any further similar
organizations, is to understand what data standards they apply, and how the data
network is organized (Table 1). In terms of DTs, and given their novelty at this point in
time, the RIs require our guidance in how they can better support digital twinning, which
includes the need to harmonize their data standards across disciplines (domains) and
subdisciplines.
Between the five RIs, GBIF and DiSSCo overlap the most in data standards, because
they both deal with museum data (Table 1). They employ the Access to Biological
Collection Data (ABCD) and Darwin Core (DwC) data standards (Holetschek et al.
2012, Wieczorek et al. 2012, Walls et al. 2014). The ABCD standard defines
relationships between terms specific to both specimens (e.g., taxon) and collections
(e.g., holding institution). In contrast, DwC terms include those specific to records (e.g.,
type, basis of record), occurrences (e.g., recorded by), event (e.g., year, month, day),
location (e.g., decimal latitude, decimal longitude, altitude), and 13 other categories of
23
terms. DiSSCo independently also utilizes the Collection Descriptions (CD) for entire
collections of natural history museums (e.g., expeditions), and the Minimum Information
about a Digital Specimen (MIDS) for clarifications on digitization. Unique to GBIF is their
recent adoption of the Humboldt Core Extension (HumbExt; https://eco.tdwg.org) to
standardize aspects of field inventories (Guralnick et al. 2018). The HumbExt is a
vocabulary of field-based terms that, for example, allows the recording of species
absences, details on the event (e.g., event duration), sampling effort (e.g., if described
and where described, such as in a publication), and similar relevant information not
captured by DwC. Note extensions are not required, but possible to be utilized, i.e., not
all GBIF data will have data within the HumbExt core.
Data that are deposited within eLTER can contain a diversity of variables that often are
difficult to integrate or standardize together, in contrast to the more centralized
organization of GBIF and DiSSCo data systems. Instead of a primary core, eLTER data
are independent datasets that cover atmospheric, ecological, geological, hydrological &
social research topics. The eLTER network architecture is, in other words, more
decentralized. As a result, eLTER primarily focuses on standardizing dataset
measurements and protocols through their Standards Observations (eLTER SOs; Table
1). For ecological data, eLTER does also support the EnvThes, an ecological thesaurus
built on existing vocabularies (Schentz et al. 2013). Examples of EnvThes terms include
definitions of species abundance, biomass and growth. However, to our knowledge
there is no strict requirement to adhere to EnvThes vocabulary, nor does EnvThes
match to any of the standards applied by GBIF and DiSSCo. eLTER and EnvThes focus
on site-level field data (Wohner et al. 2022).
24
The final two RIs, LifeWatch ERIC and ELIXIR, are not directly involved with data
aggregation, and so have not adopted any specific data standards. LifeWatch ERIC
primarily facilitates biodiversity data logistics. Among its resources is the EcoPortal
(https://ecoportal.lifewatch.eu), a repository of about 30 ecological data standards and
related semantic domains, including EnvThes (supported by eLTER). ELIXIR’s focus is
more on life science data, i.e., biomolecular and chemically-derived data. They also
provide guidance and weblinks for databases and standards help. As with LifeWatch
ERIC, the ELIXIR reference lists can be a stepping-stone towards finding relevant data
sources and/or standards.
If all data will originate from within an RI, any research project requiring the data can
bypass further data standards applications. However, if data originate from multiple RIs
and/or only partly from an RI, the data must be standardized to be harmonized. Six of
the ten BioDT pDT projects utilize biodiversity data from either GBIF or other
independent sources (i.e., otherwise published or private), and two utilize eLTER data
(Figure 1). The differences between the data standards of GBIF and eLTER will pose
challenging for any direct data integration procedures between them.
6. Case study part II: the ways to standardize biodiversity data during integration
for complex research and DTs.
Philosophies on data standards range from the most idealistic, top-level approaches
that utilize and contribute to standards that are suitable across all possible disciplines
(domains), to the most practical, applied methods that directly address one specific data
integration case without relevance to even their own discipline within the field of data
25
standards. There is a struggle between motivating the field of data standards forward
against a lack of knowledge about the existence of such a field or relevance to
individual researchers, combined with the weight each individual research project bears
to complete their task(s) in a limited and timely manner.
In the case of the BioDT pDTs, and many data users’ objectives, data are created
and/or selected for the research, and not based on existing standards or data
infrastructures. The goals are to integrate the data, but only to utilize them in modelling.
However, the more data standards can be advanced alongside the actual research, the
more it will aid future data standardization requirements. In this final section, we explain
the possible ways to standardize biodiversity data during integration, transitioning from
ideal to practical. The efforts and outcomes are a balance between practical constraints
in enacting standardizations versus the advancement of biodiversity data standards as
a field. The labels we provide for each approach are not meant to serve as lasting
names, but rather to help differentiate between the other possible options we propose.
The first approach to data standards is the most broadly applicable. It is conducted with
relevance across all domains, by basing a new biodiversity ontology on a top-level
ontology. We thus call this the “top-ontology” approach to data integration, as it begins
most broadly, for example at the level of the OBO-Foundry (Section 3), to ensure all
biodiversity data objects (such as terms) will be interoperable across the lower-level
ontologies and different disciplines. In such a case, data and information scientists
could work with RIs to ensure and guide them in their data standards efforts, thereby
producing an upper-level ontology with interoperability capable of, for example, linking
different DTs.
26
In the next approach, which we dub the “single-ontology” approach, would be to build off
of existing ontologies to unify them, and to define the relationships between all of the
data terms (variables) of the different biodiversity datasets. Like the “top-ontology
approach,” it would advance the biodiversity data standards field, and in such a way
result in less repetition for future initiatives. Communication with the general data
standards community would be critical, for awareness as well as for developing the
ontology. The ontology would be built from established data standards, not starting from
scratch, to add in terms not yet established. One by one, a list of all possible terms
would need to be compiled, mapped to existing vocabulary and ontologies, and then
filled in with further information to create a full biodiversity ontology. For example, in the
case of the BioDT, it would be built from ABCD, DwC, eLTER SO, EnvThes, and
HumbExt data standards, applying them to the different data sources of the pDTs
(Table 1). Semantic mismatches would need to be identified, alongside establishing
terms, the definitions, synonym terms, and the interrelationships between terms.
Another similar, sub-approach would stop at building a comprehensive vocabulary,
instead of an ontology, and thus be somewhat more achievable. Relationships between
terms would not need to be established, but the bulk of the work in defining terms and,
potentially, synonyms would still be required.
The efforts of both an “top-ontology” and “single-ontology” approaches to integration
would be substantial, and most practical if the general data standards community (i.e.,
TDWG) and the RIs could drive it forward. To succeed, it would require extensive time
for discussion and implementation; more than is afforded in most research project
durations. The process itself would also require a community component and could
27
bring up a myriad of conversations and opinions on the semantics of terms. How
practical such an approach is within the realism of funding timing is questioned.
However, it can alternatively be argued that even beginning such a task, to allow then
the TDWG community and/or RIs to take it over, would be substantial progress in the
field of data standards. For linking DTs, to do so without having any standards based on
top-level ontologies would prove an impossible endeavor, with very little progress likely
capable of being made towards the vision of a Destination Earth unifying the DTs. Even
given how diverse the BioDT pDT projects are (Golivets et al. 2024), our goal is more to
guide the direction of data standards (especially within and among RIs) towards
interoperability. Within the timespan of the BioDT, it would be nearly impossible to build
any unifying ontology or vocabulary to cover the data of all ten pDTs.
In similar situations, one bottom-level variation of the “single-ontology” approach could
be to focus on one topic (e.g., crop ontologies or another topic of the pDTs for the
BioDT). Usually, it would connect to a researcher’s own interests, and while it would
help move forward the data standards field, it would not suffice to meet a project’s
needs when data standards must cover the whole of a variety of topics, nor would it be
cross-domain. It would likely still be a daunting task for someone to begin who has not
already gained knowledge in the data standards field (hence our guide here). By
developing research projects which collaborate with a data scientist’s experience in data
standards, the “single-ontology” approach should become more feasible to implement
with time and sustained investment. This is an approach, for example advocated but not
implemented in a review of biodiversity digital twin challenges (Trantas et al. 2023).
28
The third approach is more pragmatic to implement, but it also will not forward the state
of biodiversity data standards to such degrees as the prior two. In what we call the “map
to existing approach”, it begins similar to the prior two methods. The first step would be
to find any applicable and already established data standards, for example as we have
done here for the BioDT (Table 1). The existing data standards could be compiled, as
possible, into a list of terms that are then used to map the research data terms
(variables) to. If research data terms do not match to an existing standard, they should
still be matched, as relevant, between the dataset variables of the project. In such an
approach, there would be no effort to create new data standards, but existing applicable
standards would be utilized to map the research data. In the same way that lists can
help notify the scientific community of available datasets (i.e., Section 2), an initiative
which compiled existing biodiversity data standards would be helpful for the data
standards community. This approach has been implemented by Wohner et al. (2020) to
create a set of core site-based terms, analogous to the Darwin Core of GBIF, with
additional recommended terms for lesser used field terms. Other RIs such as LifeWatch
ERIC and ELIXIR have begun similarly, but stopped short at supplying links to the
different vocabularies. We suggest that the “map to existing” approach is better suited to
research projects that do not cover a large extent of biodiversity topics, as otherwise,
and as with the first two approaches, the implementation becomes challenging within
the full data scope, as for example, with the BioDT.
The final alternative, which we call the “map independently” approach, is the easiest to
apply, but it also fails to forward the field of data standards. Independently, without
regard to any existing data standards, data can be mapped between the different
29
sources. It will achieve the goals of a research project to be able to model the integrated
data. It is the method that, essentially, researchers utilize by default when independently
integrating data. The lack of data standards advancement alongside high subjectivity
are cons against the timeliness and ease to enact this approach. In the case of the
BioDT, the interdisciplinary nature of the highly varied biodiversity data of the pDTs has
prompted us to adopt the “map independently” approach. It should be noted that over
half of the pDTs do utilize an RI for data, hence, much of the data terms are already
standardized. In alignment with EOSC’s Metadata Schema and Crosswalk Registry, the
BioDT project is looking into mapping the pDT data through a newly developed
Mapping.bio tool (Wolodkin et al. 2023). We recognize the importance of data standards
and furthering the field, as we have taken care to explain throughout this guide.
Ultimately, however, we are as tethered to fulfilling the project goals as is the next
research project. It demonstrates a conundrum to data standards: how do we advance a
field that itself is all too often an “after thought” to a methodological component of a
research project? We cannot fully answer our question, but can begin to by raising
awareness of the issue through the production of this guide.
7. Conclusion.
Data standardization is an instrument for the integration of massively diverse data, as
needed for complex research projects such as the BioDT. Mismatches between data
standards of existing biodiversity research data infrastructures render it challenging to
adhere fully to them across clustered communities of data producers and data users.
Research infrastructures will continue to augment their data contents, which may make
them increasingly more relevant for data users. The challenges facing data
30
standardization by an independent research project may be too many for most to
overcome, but projects can at least be aware of and adhere to the existing standards
that have been implemented for their data utilized in a project. Much of the work with
data standards will be achieved within research infrastructures, alongside inter-research
collaboration and wider open-science initiatives like EOSC and Destination Earth, and,
finally, with interoperability from cross-digital twin project collaborations.
Acknowledgements.
Dmitry Schigel is thanked for contributions to our knowledge of the data sources of the
BioDT projects, and for helpful manuscript edits and comments. This project has
received funding from the European Union's Horizon Europe research and innovation
programme under grant agreement No 101057437 (BioDT
project, https://doi.org/10.3030/101057437). Views and opinions expressed are those of
the author(s) only and do not necessarily reflect those of the European Union or the
European Commission. Neither the European Union nor the European Commission can
be held responsible for them.
Declaration of competing interest.
There are no conflicts of interest that need to be declared.
Author contributions.
31
C. Andrew: Conceptualization, Data – BioDT pDT projects, Visualization, Writing –
original draft, Writing – review & editing. S. Islam: Data – BioDT pDT projects, Writing –
review and editing. C. Weiland & D. Endresen: Writing – review and editing.
Data availability.
Further information on the BioDT pDT projects are published in the special issue:
Golivets M, Sharif I, Wohner C, Grimm V, & Schigel D (eds). 2024. Building Biodiversity
Digital Twins. RIO. https://doi.org/10.3897/rio.coll.240. No other data are referenced.
32
Figures and boxes (with captions).
Figure 1. Data production flow chart. To make biodiversity datasets available, they must
first be made interoperable through shared data standards. Data standardization is
critical to compilation of data, whether from lists of available datasets, singular
databases, or data within data infrastructures. Despite the criticality of data
standardization, the process is rarely discussed, except at institutional levels of data
infrastructures.
33
Box 1. Explanations of what data standards are, and comparisons of important
terminologies and relationships between the standards.
34
Figure 2. Number of BioDT pDTs utilizing a Research Infrastructure or other source(s)
for their input data. The source(s) that the pDTs utilize depend, in large extent, on the
types of data. Biodiversity data are used to model biological aspects of biodiversity (i.e.,
what is often predicted; left side), while environmental data are used to predict
responses (i.e., what are often explanatory data; right side). Circles are sized and
shaded by the number of pDTs utilizing an RI or other source. There are a total of 10
pDTs.
Tables (with captions).
Table 1. Currently published data standards used by the different BioDT Research
Infrastructures (GBIF, LifeWatch ERIC, DiSSCo, eLTER & ELIXIR). The research
infrastructures contribute differently towards aspects of biodiversity data. The Research
Infrastructures and data standards link to webpages of each. The data standards most
applicable to biodiversity data are distinguished by a larger, bolder X.
35
Research infrastructures Data standard(s) applied to data
Type of
Year eLTER
Name Acronym contribution to ABCD CD DwC EnvThes HumbExt MIDS
initiated SO
biodiversity data
Occurrence,
Global Biodiversity
Information Facility
GBIF 2001 checklist & sampling- X X X
event data
Life science
European life science resources &
Not applicable (provides links to repositories of
infrastructure for ELIXIR 2013 interoperability
established standards)
biological information services (storage,
access, analyses)
e-Science European e-Science research
Infrastructure for LifeWatch facilities & services Not applicable (provides links to repositories of
2017
Biodiversity and ERIC (Virtual Research established standards; EcoPortal)
Ecosystem Research Environments)
Natural science
Distributed System of
Scientific Collections
DiSSCo 2018 (museum) collections X X X X
data
Integrated European
Long-Term Ecosystem, Integrated (biotic,
critical zone and socio- eLTER 2018 abiotic & social) X X
ecological system datasets
Research Infrastructure
*Year initiated: For DiSSCo & eLTER, this refers to when they joined the ESFRI Roadmap. For LifeWatch ERIC, this refers to
their EU establishment. For GBIF, this refers to its official establishment. For ELIXIR, this refers to its permanent phase ( the
preparatory phase began in 2007)
36
References. (From Google Scholar)
Baskauf, S.J. and Webb, C.O., 2016. Darwin-SW: Darwin Core-based terms for
expressing biodiversity data as RDF. Semantic Web, 7(6), pp.629-643.
Baskauf, S.J., Wieczorek, J., Deck, J. and Webb, C.O., 2016. Lessons learned from
adapting the Darwin Core vocabulary standard for use in RDF. Semantic Web, 7(6),
pp.617-627.
Blair, J., Gwiazdowski, R., Borrelli, A., Hotchkiss, M., Park, C., Perrett, G. and Hanner,
R., 2020. Towards a catalogue of biodiversity databases: An ontological case
study. Biodiversity Data Journal, 8.
Darwin Core Maintenance Group. 2021. List of Darwin Core terms. Biodiversity
Information Standards (TDWG). http://rs.tdwg.org/dwc/doc/list/
de Koning, K., Broekhuijsen, J., Kühn, I., Ovaskainen, O., Taubert, F., Endresen, D.,
Schigel, D. and Grimm, V., 2023. Digital twins: dynamic model-data fusion for
ecology. Trends in Ecology & Evolution.
De Pooter, D., Appeltans, W., Bailly, N., Bristol, S., Deneudt, K., Eliezer, M., Fujioka, E.,
Giorgetti, A., Goldstein, P., Lewis, M. and Lipizer, M., 2017. Toward a new data
standard for combined marine biological and environmental datasets-expanding OBIS
beyond species occurrences. Biodiversity data journal, (5).
Feng, X., Enquist, B.J., Park, D.S., Boyle, B., Breshears, D.D., Gallagher, R.V., Lien, A.,
Newman, E.A., Burger, J.R., Maitner, B.S. and Merow, C., 2022. A review of the
37
heterogeneous landscape of biodiversity databases: Opportunities and challenges for a
synthesized biodiversity knowledge base. Global Ecology and Biogeography, 31(7),
pp.1242-1260.
Gadelha Jr, L.M., de Siracusa, P.C., Dalcin, E.C., da Silva, L.A.E., Augusto, D.A.,
Krempser, E., Affe, H.M., Costa, R.L., Mondelli, M.L., Meirelles, P.M. and Thompson,
F., 2021. A survey of biodiversity informatics: Concepts, practices, and
challenges. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 11(1), p.e1394.
Gallagher, R.V., Falster, D.S., Maitner, B.S., Salguero-Gómez, R., Vandvik, V., Pearse,
W.D., Schneider, F.D., Kattge, J., Poelen, J.H., Madin, J.S. and Ankenbrand, M.J.,
2020. Open Science principles for accelerating trait-based science across the Tree of
Life. Nature ecology & evolution, 4(3), pp.294-303.
Golivets M, Sharif I, Wohner C, Grimm V, & Schigel D (eds). 2024. Building Biodiversity
Digital Twins. RIO. https://doi.org/10.3897/rio.coll.240.
Graham, C.H., Ferrier, S., Huettman, F., Moritz, C. and Peterson, A.T., 2004. New
developments in museum-based informatics and applications in biodiversity
analysis. Trends in ecology & evolution, 19(9), pp.497-503.
Guralnick, R.P., Hill, A.W. and Lane, M., 2007. Towards a collaborative, global
infrastructure for biodiversity assessment. Ecology letters, 10(8), pp.663-672.
38
Guralnick, R., Walls, R. and Jetz, W., 2018. Humboldt Core–toward a standardized
capture of biological inventories for biodiversity monitoring, modeling and
assessment. Ecography, 41(5), pp.713-725.
Heberling, J.M., Miller, J.T., Noesgaard, D., Weingart, S.B. and Schigel, D., 2021. Data
integration enables global biodiversity synthesis. Proceedings of the National Academy
of Sciences, 118(6), p.e2018093118.
Hoffmann, J., Bauer, P., Sandu, I., Wedi, N., Geenen, T. and Thiemert, D., 2023.
Destination Earth–A digital twin in support of climate services. Climate Services, 30,
100394.
Holetschek, J., Dröge, G., Güntsch, A. and Berendsohn, W.G., 2012. The ABCD of
primary biodiversity data access. Plant Biosystems-An International Journal Dealing
with all Aspects of Plant Biology, 146(4), pp.771-779.
Iversen, C.M., McCormack, M.L., Powell, A.S., Blackwood, C.B., Freschet, G.T., Kattge,
J., Roumet, C., Stover, D.B., Soudzilovskaia, N.A., Valverde‐Barrantes, O.J. and van
Bodegom, P.M., 2017. A global Fine‐Root Ecology Database to address below‐ground
challenges in plant ecology. New Phytologist, 215(1), pp.15-26.
Jacobsen, A., de Miranda Azevedo, R., Juty, N., Batista, D., Coles, S., Cornet, R.,
Courtot, M., Crosas, M., Dumontier, M., Evelo, C.T. and Goble, C., 2020. FAIR
principles: interpretations and implementation considerations. Data intelligence, 2(1-2),
pp.10-29.
39
Kartika, Y.A., Akbar, Z., Saleh, D.R. and Fatriasari, W., 2022, November. An Empirical
Analysis of Knowledge Overlapping from Big Vocabulary in Biodiversity Domain.
In Proceedings of the 2022 International Conference on Computer, Control, Informatics
and Its Applications (pp. 199-203).
Kattge, J., Diaz, S., Lavorel, S., Prentice, I.C., Leadley, P., Bönisch, G., Garnier, E.,
Westoby, M., Reich, P.B., Wright, I.J. and Cornelissen, J.H.C., et al., 2011. TRY–a
global database of plant traits. Global change biology, 17(9), pp.2905-2935.
Mi, H. and Thomas, P.D., 2011. Ontologies and standards in bioscience research: for
machine or for human. Frontiers in Physiology, 2, p.5.
O'Leary, K., Gleasure, R., O'Reilly, P. and Feller, J., 2020. Reviewing the contributing
factors and benefits of distributed collaboration.
Otte, J.N., Beverley, J. and Ruttenberg, A., 2022. BFO: Basic formal ontology. Applied
ontology, 17(1), pp.17-43.
Robertson, T., Döring, M., Guralnick, R., Bloom, D., Wieczorek, J., Braak, K., Otegui, J.,
Russell, L. and Desmet, P., 2014. The GBIF integrated publishing toolkit: facilitating the
efficient publishing of biodiversity data on the internet. PloS one, 9(8), p.e102623.
Schentz, H., Peterseil, J. and Bertrand, N., 2013. EnvThes–interlinked thesaurus for
long term ecological research, monitoring, and experiments.
Schneider, F.D., Fichtmueller, D., Gossner, M.M., Güntsch, A., Jochum, M., König‐Ries,
B., Le Provost, G., Manning, P., Ostrowski, A., Penone, C. and Simons, N.K., 2019.
40
Towards an ecological trait‐data standard. Methods in Ecology and Evolution, 10(12),
pp.2006-2019.
Shaw, A. and Hargittai, E., 2018. The pipeline of online participation inequalities: The
case of Wikipedia editing. Journal of communication, 68(1), pp.143-168.
Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L.J.,
Eilbeck, K., Ireland, A., Mungall, C.J. and Obi Consortium, 2007. The OBO Foundry:
coordinated evolution of ontologies to support biomedical data integration. Nature
biotechnology, 25(11), pp.1251-1255.
Stephenson, P.J. and Stengel, C., 2020. An inventory of biodiversity data sources for
conservation monitoring. PLoS One, 15(12), p.e0242923.
Stephenson, P.J., Londoño-Murcia, M.C., Borges, P.A., Claassens, L., Frisch-
Nwakanma, H., Ling, N., McMullan-Fisher, S., Meeuwig, J.J., Unter, K.M.M., Walls, J.L.
and Burfield, I.J., 2022. Measuring the impact of conservation: The growing importance
of monitoring fauna, flora and funga. Diversity, 14(10), p.824.
Trantas, A., Plug, R., Pileggi, P. and Lazovik, E., 2023. Digital twin challenges in
biodiversity modelling. Ecological Informatics, p.102357.
Walls, R.L., Deck, J., Guralnick, R., Baskauf, S., Beaman, R., Blum, S., Bowers, S.,
Buttigieg, P.L., Davies, N., Endresen, D. and Gandolfo, M.A., 2014. Semantics in
support of biodiversity knowledge discovery: an introduction to the biological collections
ontology and related ontologies. PloS one, 9(3), p.e89606.
41
Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Döring, M., Giovanni, R., Robertson,
T. and Vieglais, D., 2012. Darwin Core: an evolving community-developed biodiversity
data standard. PloS one, 7(1), p.e29715.
Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A.,
Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E. and Bouwman, J., 2016.
The FAIR Guiding Principles for scientific data management and stewardship. Scientific
data, 3(1), pp.1-9.
Wohner, C., Peterseil, J., Genazzio, M.A., Guru, S., Hugo, W. and Klug, H., 2020.
Towards interoperable research site documentation–Recommendations for information
models and data provision. Ecological Informatics, 60, p.101158.
Wohner, C., Peterseil, J. and Klug, H., 2022. Designing and implementing a data model
for describing environmental monitoring and research sites. Ecological Informatics, 70,
p.101708.
Wolodkin, A., Weiland, C. and Grieb, J., 2023. Mapping. bio: Piloting FAIR semantic
mappings for biodiversity digital twins. Biodiversity Information Science and
Standards, 7, p.e111979.
42

C J Andrew@nhm Uio No

Uploaded by

Copyright:

Available Formats

C J Andrew@nhm Uio No

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C J Andrew@nhm Uio No

Uploaded by

Copyright:

Available Formats

Title.

Biodiversity data standards for the organization and dissemination of complex

research projects and digital twins: a guide

Authors. Carrie Andrew1, Sharif Islam2, Claus Weiland3, Dag Endresen1

Norway; 2 Naturalis Biodiversity Center, Darwinweg 2, 2333 CR Leiden, Netherlands; 3

Senckenberg – Leibniz Institution for Biodiversity and Earth System Research,

Senckenberganlage 25, 60325 Frankfurt, Germany.

ORCID’s. CA: https://orcid.org/0000-0002-0524-8334; SI: https://orcid.org/0000-0001-

8050-0299; CW: https://orcid.org/0000-0003-0351-6523; DE: https://orcid.org/0000-

Emails. CA: [email protected]; SI: [email protected]; CW:

integration; Digital twin; Distributed collaboration; Network architecture design;

Ontology; Thesaurus; Vocabulary

Abstract. Biodiversity data are substantially increasing, spurred by technological

Data standardization is an instrument for the organization and integration of biodiversity

logistics of compiling and utilizing biodiversity data. We emphasize data standards,

coordinated long-term data aggregation. We exemplify the Biodiversity Digital Twin

(BioDT) as a case study. Four approaches to data standardization are presented in

data interoperability, i.e., machine accessibility. The future is promising for

standards for computational communication.

1. Introduction. The need to organize, integrate disseminate and utilize

instantaneous. Instead, biodiversity data are a massively increasing resource. They

challenges to integrating biodiversity data, as we describe here.

biodiversity data, with our final recommendations aimed towards research

infrastructures. We begin with the organization of biodiversity data for dissemination,

i.e., a digital twin.

independent research projects, should be made Findable, Accessible, Interoperable and

infrastructures, instead of private initiatives, to help establish their reusability and

preserve their longevity.

2. From fractionated datasets to unified resources.

Whether by researchers or community scientists, and whether from a single project or a

parallel independent projects. This is a challenge, as biodiversity data typically originate

readability (Wilkinson et al. 2016). Computationally, software agents are able to

actionable (Jacobsen et al. 2019; Section 3).

independent integration, or data must be deposited within established b) databases or

conservation initiatives. Inadequate data are a source of frustration that impedes

subtopics, e.g., conservation (Stephenson & Stengel 2020).

always need to be individually accessed, irrespective of any data integration and

(Blair et al. 2020).

scientific community, this is an effective bottom-up approach to unify biodiversity data.

requirements for maintenance of any database may, as with lists, be difficult to

on a multitude of factors related to user demographics, degree of support by the hosting

people. Lists and databases require maintenance by groups of dedicated individuals,

promises of success (Shaw & Hargittai 2018, O’Leary et al. 2020).

funding duration of institutionalized, non-project-based data infrastructures, which

data deposition ultimately managed by regulations implemented by GBIF employees

result of what is referred to as a centralized and connected network architecture design

standardization requirements (discussed further in Sections 3 - 5). Another caveat of

databases, for any given biodiversity data topic.

after it was developed or gained popularity. Or databases may be combined into a

key information (e.g., a persistent identifier or combination of original source, date,

to be a significant issue to data users. Importantly, the key information must be

3. There are (many) standards to follow.

contain (e.g., taxonomic hierarchy or interspecies interactions). We utilize the term

researcher integrating datasets, they do employ a form of standardization. Collectively

the acronym is based on a former name of the group; https://www.tdwg.org) is the

international resource for biodiversity data standards. It is a collaborative international

standardization (e.g., Schneider et al. 2019; Section 4).

datasets uniform during integration, by requiring formatting of data to established

database or data infrastructure. They have become somewhat synonymous with